News
Correctness evaluation of LLM-generated code: probabilistic approach
Abstract
Large language models (LLMs) are increasingly being used in software development. However, the lack of a formal definition for code correctness complicates the study of the correctness of generated code. This paper describes a probabilistic approach to define the correctness of LLM generated code. We introduce the TSA (Test Suite Accuracy) correctness metric, naturally derived within the presented approach, and compare it with Pass@1. Our evaluation of 5 LLMs (Phi-1, Phi-2, Phi-3-mini-4k, Phi-4-mini, and Qwen2.5-Coder) confirms the described properties of both metrics. The key contributions of the conducted research include the HumanEval++ dataset, which extends HumanEval+, and the TSA metric implementation built upon it.
Keywords
Edition
Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 111-128
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).
DOI: 10.15514/ISPRAS-2026-38(2)-8
For citation
Full text of the paper in pdf (in Russian)
Back to the contents of the volume