A Research on embedding clustering for paraphrase retrieval in texts instructions for medical use of drugs


A Research on embedding clustering for paraphrase retrieval in texts instructions for medical use of drugs

Kilmishkin N.V. (REA, Moscow, Russia)
Kubrakov D.D. (REA, Moscow, Russia)
Titov Y.P. (REA, Moscow, Russia)
Panteleev V.I. (REA, Moscow, Russia)
Kuropatkina T.A. (REA, Moscow, Russia)
Kochna N.A. (REA, Moscow, Russia)
Ivanova P.M. (REA, Moscow, Russia)

Abstract

The research developed an integrated approach for paraphrase detection in medical instruction texts, combining modern methods of natural language processing (NLP), dimensionality reduction and cluster analysis. The best results were demonstrated by the combination of the distiluse_base_multilingual model with the UMAP algorithm (parameters: n_components=2, n_neighbors=10, min_dist=0.1, metric=cosine) and agglomerative clustering (n_clusters=200, linkage=ward). A special feature of the methodology was the use of a dimensionality reduction strategy followed by the addition of class information, which preserved semantic relationships and improved the quality of clustering.A comparative analysis of different language models (including Clinical Modern BERT, paraphrase-multilingual and rubert-tiny) revealed the advantages of the distiluse_base_multilingual model in terms of accuracy and computational efficiency. Visualisation of the results confirmed the ability of the method to clearly distinguish semantic clusters, and the use of JSON-format for storing the results ensured their convenient integration into practical applications. The developed approach has the potential to automate the processing of medical texts, especially in the tasks of unifying the terminology of drug instructions.

Keywords

machine learning; NLP natural language processing; clustering; drug instructions; feature space dimensionality reduction; paraphrase search.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 3, part 4, 2026, pp. 175-190

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(3)-54

For citation

Kilmishkin N.V., Kubrakov D.D., Titov Y.P., Panteleev V.I., Kuropatkina T.A., Kochna N.A., Ivanova P.M. A Research on embedding clustering for paraphrase retrieval in texts instructions for medical use of drugs. Proceedings of the Institute for System Programming, vol. 38, issue 3, part 4, 2026, pp. 175-190 DOI: 10.15514/ISPRAS-2026-38(3)-54.

Full text of the paper in pdf (in Russian) Back to the contents of the volume