Cross-lingual language model pretraining for retrieval

Puxuan Yu, Hongliang Fei, Ping Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Scopus citations

Abstract

Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. We hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are key factors of this issue. In this paper, we introduce two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks such as cross-lingual ad-hoc retrieval (CLIR) and cross-lingual question answering (CLQA). We construct distant supervision data from multilingual Wikipedia using section alignment to support retrieval-oriented language model pretraining. We also propose to directly finetune language models on part of the evaluation collection by making Transformers capable of accepting longer sequences. Experiments on multiple benchmark datasets show that our proposed model can significantly improve upon general multilingual language models in both the cross-lingual retrieval setting and the cross-lingual transfer setting.

Original languageEnglish (US)
Title of host publicationThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PublisherAssociation for Computing Machinery, Inc
Pages1029-1039
Number of pages11
ISBN (Electronic)9781450383127
DOIs
StatePublished - Jun 3 2021
Event30th World Wide Web Conference, WWW 2021 - Ljubljana, Slovenia
Duration: Apr 19 2021Apr 23 2021

Publication series

NameThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

Conference

Conference30th World Wide Web Conference, WWW 2021
Country/TerritorySlovenia
CityLjubljana
Period4/19/214/23/21

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Cross-lingual language model pretraining for retrieval'. Together they form a unique fingerprint.

Cite this