Correctness evaluation of LLM-generated code: probabilistic approach

News

02 August, 2019 OS DAY-2019. Cooperation among operating platform developers and the security of Russian software

10 April, 2019 Ivannikov Memorial Workshop has been supported by IEEE

14 March, 2019 The annual Ivannikov Memorial Workshop will take place on 13-14 September 2019

Correctness evaluation of LLM-generated code: probabilistic approach

Avagian D.A. (MSU, Moscow, Russia)

Abstract

Large language models (LLMs) are increasingly being used in software development. However, the lack of a formal definition for code correctness complicates the study of the correctness of generated code. This paper describes a probabilistic approach to define the correctness of LLM generated code. We introduce the TSA (Test Suite Accuracy) correctness metric, naturally derived within the presented approach, and compare it with Pass@1. Our evaluation of 5 LLMs (Phi-1, Phi-2, Phi-3-mini-4k, Phi-4-mini, and Qwen2.5-Coder) confirms the described properties of both metrics. The key contributions of the conducted research include the HumanEval++ dataset, which extends HumanEval+, and the TSA metric implementation built upon it.

Keywords

large language models; software engineering; code generation; code quality; code correctness; metrics.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 111-128

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-8

For citation

Avagian D.A. Correctness evaluation of LLM-generated code: probabilistic approach. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 111-128 DOI: 10.15514/ISPRAS-2026-38(2)-8.

Full text of the paper in pdf (in Russian)

Back to the contents of the volume

На нашем сайте мы используем cookie файлы, содержащие информацию о предыдущих посещениях веб-сайта. Данные обрабатываются для улучшения качества работы нашего веб-сайта. Если вы не хотите использовать cookie файлы, измените настройки браузера.

Понятно