TY - GEN
T1 - Cross-lingual language model pretraining for retrieval
AU - Yu, Puxuan
AU - Fei, Hongliang
AU - Li, Ping
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. We hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are key factors of this issue. In this paper, we introduce two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks such as cross-lingual ad-hoc retrieval (CLIR) and cross-lingual question answering (CLQA). We construct distant supervision data from multilingual Wikipedia using section alignment to support retrieval-oriented language model pretraining. We also propose to directly finetune language models on part of the evaluation collection by making Transformers capable of accepting longer sequences. Experiments on multiple benchmark datasets show that our proposed model can significantly improve upon general multilingual language models in both the cross-lingual retrieval setting and the cross-lingual transfer setting.
AB - Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. We hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are key factors of this issue. In this paper, we introduce two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks such as cross-lingual ad-hoc retrieval (CLIR) and cross-lingual question answering (CLQA). We construct distant supervision data from multilingual Wikipedia using section alignment to support retrieval-oriented language model pretraining. We also propose to directly finetune language models on part of the evaluation collection by making Transformers capable of accepting longer sequences. Experiments on multiple benchmark datasets show that our proposed model can significantly improve upon general multilingual language models in both the cross-lingual retrieval setting and the cross-lingual transfer setting.
UR - http://www.scopus.com/inward/record.url?scp=85107989824&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107989824&partnerID=8YFLogxK
U2 - 10.1145/3442381.3449830
DO - 10.1145/3442381.3449830
M3 - Conference contribution
AN - SCOPUS:85107989824
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 1029
EP - 1039
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery, Inc
T2 - 30th World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -