AIBox: CTR prediction model training on a single node

Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, Ping Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

44 Scopus citations

Abstract

As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. “Phoenix Nest”) are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.

Original languageEnglish (US)
Title of host publicationCIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages319-328
Number of pages10
ISBN (Electronic)9781450369763
DOIs
StatePublished - Nov 3 2019
Externally publishedYes
Event28th ACM International Conference on Information and Knowledge Management, CIKM 2019 - Beijing, China
Duration: Nov 3 2019Nov 7 2019

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference28th ACM International Conference on Information and Knowledge Management, CIKM 2019
Country/TerritoryChina
CityBeijing
Period11/3/1911/7/19

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Keywords

  • GPU Computing
  • SSD Cache Management
  • Sponsored Search

Fingerprint

Dive into the research topics of 'AIBox: CTR prediction model training on a single node'. Together they form a unique fingerprint.

Cite this