NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages


NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages

Bedrin P.A. (ISP RAS, Moscow, Russia; MSU, Moscow, Russia)
Varlamov M.I. (ISP RAS, Moscow, Russia)
Yatskov A.K. (ISP RAS, Moscow, Russia; MSU, Moscow, Russia)

Abstract

This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Automatic extraction of structured information from news web pages is crucial for multilingual web mining, aggregation, and analytics applications. Recent neural approaches, while effective on web page extraction datasets in English, are pre-trained on English data, limiting their applicability to other languages. We present the first large-scale multilingual dataset for news web page attribute extraction, containing 29,081 annotated pages from 759 websites across 56 languages. Each page includes DOM-node-linked annotations for up to five key attributes (title, publication date, text, authors, and tags), together with HTML and MHTML sources, English-translated versions, screenshots, and node-level render metadata. We evaluate a variety of open-source extraction methods, including heuristic tools and modern transformer-based models. Specifically, we fine-tune the English pre-trained MarkupLM on both original and English-translated pages, and pre-train a multilingual DOM-LM-based model from scratch on a multilingual news web corpus before fine-tuning it on our dataset. Experimental results show that the multilingual DOM-LM achieves the best overall performance across most attributes and languages without relying on translation, while MarkupLM benefits from translation but remains less consistent across languages. The collected dataset and all trained models are publicly available to support practical use and future research in multilingual web information extraction and downstream applications in the news domain.

Keywords

web data extraction; information extraction; web page dataset; news; multilingual dataset; multilingual model; neural networks.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 149-164

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-10

For citation

Bedrin P.A., Varlamov M.I., Yatskov A.K. NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 149-164 DOI: 10.15514/ISPRAS-2026-38(2)-10.

Full text of the paper in pdf Back to the contents of the volume