TY - GEN
T1 - AIBox
T2 - 28th ACM International Conference on Information and Knowledge Management, CIKM 2019
AU - Zhao, Weijie
AU - Zhang, Jingyuan
AU - Xie, Deping
AU - Qian, Yulei
AU - Jia, Ronglai
AU - Li, Ping
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/11/3
Y1 - 2019/11/3
N2 - As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. “Phoenix Nest”) are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.
AB - As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. “Phoenix Nest”) are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.
KW - GPU Computing
KW - SSD Cache Management
KW - Sponsored Search
UR - http://www.scopus.com/inward/record.url?scp=85075445465&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075445465&partnerID=8YFLogxK
U2 - 10.1145/3357384.3358045
DO - 10.1145/3357384.3358045
M3 - Conference contribution
AN - SCOPUS:85075445465
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 319
EP - 328
BT - CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 3 November 2019 through 7 November 2019
ER -