NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages

News

02 August, 2019 OS DAY-2019. Cooperation among operating platform developers and the security of Russian software

10 April, 2019 Ivannikov Memorial Workshop has been supported by IEEE

14 March, 2019 The annual Ivannikov Memorial Workshop will take place on 13-14 September 2019

NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages

Bedrin P.A. (ISP RAS, Moscow, Russia; MSU, Moscow, Russia)
Varlamov M.I. (ISP RAS, Moscow, Russia)
Yatskov A.K. (ISP RAS, Moscow, Russia; MSU, Moscow, Russia)

Abstract

This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Automatic extraction of structured information from news web pages is crucial for multilingual web mining, aggregation, and analytics applications. Recent neural approaches, while effective on web page extraction datasets in English, are pre-trained on English data, limiting their applicability to other languages. We present the first large-scale multilingual dataset for news web page attribute extraction, containing 29,081 annotated pages from 759 websites across 56 languages. Each page includes DOM-node-linked annotations for up to five key attributes (title, publication date, text, authors, and tags), together with HTML and MHTML sources, English-translated versions, screenshots, and node-level render metadata. We evaluate a variety of open-source extraction methods, including heuristic tools and modern transformer-based models. Specifically, we fine-tune the English pre-trained MarkupLM on both original and English-translated pages, and pre-train a multilingual DOM-LM-based model from scratch on a multilingual news web corpus before fine-tuning it on our dataset. Experimental results show that the multilingual DOM-LM achieves the best overall performance across most attributes and languages without relying on translation, while MarkupLM benefits from translation but remains less consistent across languages. The collected dataset and all trained models are publicly available to support practical use and future research in multilingual web information extraction and downstream applications in the news domain.

Keywords

web data extraction; information extraction; web page dataset; news; multilingual dataset; multilingual model; neural networks.

Edition

Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 149-164

ISSN 2220-6426 (Online), ISSN 2079-8156 (Print).

DOI: 10.15514/ISPRAS-2026-38(2)-10

For citation

Bedrin P.A., Varlamov M.I., Yatskov A.K. NewsXLM: A Multilingual Dataset and Model for Information Extraction from News Web Pages. Proceedings of the Institute for System Programming, vol. 38, issue 2, 2026, pp. 149-164 DOI: 10.15514/ISPRAS-2026-38(2)-10.

Full text of the paper in pdf

Back to the contents of the volume

На нашем сайте мы используем cookie файлы, содержащие информацию о предыдущих посещениях веб-сайта. Данные обрабатываются для улучшения качества работы нашего веб-сайта. Если вы не хотите использовать cookie файлы, измените настройки браузера.

Понятно