Correctness evaluation of LLM-generated code: probabilistic approach


Correctness evaluation of LLM-generated code: probabilistic approach

Avagian D.A. (MSU, Moscow, Russia)

Abstract

Large language models (LLMs) are increasingly being used in software development. However, the lack of a formal definition for code correctness complicates the study of the correctness of generated code. This paper describes a probabilistic approach to define the correctness of LLM generated code. We introduce the TSA (Test Suite Accuracy) correctness metric, naturally derived within the presented approach, and compare it with Pass@1. Our evaluation of 5 LLMs (Phi-1, Phi-2, Phi-3-mini-4k, Phi-4-mini, and Qwen2.5-Coder) confirms the described properties of both metrics. The key contributions of the conducted research include the HumanEval++ dataset, which extends HumanEval+, and the TSA metric implementation built upon it.

Keywords

large language models; software engineering; code generation; code quality; code correctness; metrics.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 111-128

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-8

For citation

Avagian D.A. Correctness evaluation of LLM-generated code: probabilistic approach. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 111-128 DOI: 10.15514/ISPRAS-2026-38(2)-8.

Full text of the paper in pdf (in Russian) Back to the contents of the volume