Experimental Study of Instruction-Based Models for Extracting Domain-Specific Entities from Student Reports


Experimental Study of Instruction-Based Models for Extracting Domain-Specific Entities from Student Reports

Melnikova A.V. (UTMN, Tyumen, Russia)
Vorobeva M.S. (UTMN, Tyumen, Russia)
Glazkova A.V. (UTMN, Tyumen, Russia; RNC, Moscow, Russia)
Morozov D.A. (NSU, Novosibirsk, Russia; RNC, Moscow, Russia)

Abstract

This work investigated the task of extracting domain-specific entities from student reports in the field of information technology. Domain-specific entities (DSE) represent key terms, skills, and named entities that reflect the thematic specifics of the text. The solutions evaluated included the keyword extraction tool rutermextract, a fine-tuned mBART language model, and instruction-tuned large language models (YandexGPT, Saiga, Tlite). The study found that fine-tuning mBART is effective given a sufficient volume of data. Instruction-based models outperformed rutermextract and show promise for low-data scenarios, with the Saiga model being particularly effective at identifying the core set of entities. The strategy of highlighting domain-specific entities within the text was found to be more accurate than extracting them as a simple list. However, the task requires further research: the high rate of erroneous extraction of domain-specific entities (67-89%), manifested as a complete lack of overlap with the gold-standard entities, indicates the models' difficulty in separating the core entity from its context. The main limitations of the study are the small corpus size (2,933 texts) and the use of simple instructions. Promising research directions include developing more detailed instructions and evaluating the approaches in other domains and text types.

Keywords

domain-specific entities; entity extraction; natural language processing; pre-trained language models; instruction-based models; generative language models; report document analysis; instruction tuning.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 165-182

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-11

For citation

Melnikova A.V., Vorobeva M.S., Glazkova A.V., Morozov D.A. Experimental Study of Instruction-Based Models for Extracting Domain-Specific Entities from Student Reports. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 165-182 DOI: 10.15514/ISPRAS-2026-38(2)-11.

Full text of the paper in pdf (in Russian) Back to the contents of the volume