News
Emotion Recognition Capabilities of Large Language Models: A Comparative Analysis
Abstract
Large language models (LLMs) are increasingly integrated into conversational systems, where understanding emotional cues is essential for maintaining coherent, engaging, and safe interactions. This study evaluates how effectively modern instruction-tuned large language models (LLMs) can recognize emotions from text only without task-specific fine-tuning. We benchmark multiple open-weight LLM families (<15B parameters) across four prompting strategies – Baseline, Context, Few-shot, and Context+Few-shot – on two English ERC benchmarks (IEMOCAP, MELD) and one Russian dataset (RESD). We find that the optimal prompting strategy is dataset-dependent: semantically redundant data such as IEMOCAP benefits most from few-shot demonstrations (best 73.3% weighted F1-score (WF1) with Context+Few-shot), whereas MELD gains primarily from incorporating dialogue history (best 60.3% WF1 with Context). Robustness experiments show that LLMs are largely insensitive to reordering few-shot examples, but performance degrades substantially when the label space is corrupted, indicating that coherent labels space matters more than order of examples or their ground truths. Cross-lingual evaluation reveals a notable drop on Russian RESD (best 45.8% WF1), highlighting a persistent gap between English and Russian affect understanding in current LLMs. Overall, non-finetuned LLMs serve as strong prompt-only baselines for ERC, yet remain clearly behind specialized supervised systems.
Keywords
Edition
Proceedings of the Institute for System Programming, vol. 38, issue 3, part 4, 2026, pp. 157-174
ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).
DOI: 10.15514/ISPRAS-2026-38(3)-53
For citation
Full text of the paper in pdf
Back to the contents of the volume