AUTOTRAINER: An automatic DNN training problem detection and repair system

Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

With machine learning models especially Deep Neural Network (DNN) models becoming an integral part of the new intelligent software, new tools to support their engineering process are in high demand. Existing DNN debugging tools are either post-training which wastes a lot of time training a buggy model and requires expertises, or limited on collecting training logs without analyzing the problem not even fixing them. In this paper, we propose AUTOTRAINER, a DNN training monitoring and automatic repairing tool which supports detecting and auto repairing five commonly seen training problems. During training, it periodically checks the training status and detects potential problems. Once a problem is found, AUTOTRAINER tries to fix it by using built-in state-of-the-art solutions. It supports various model structures and input data types, such as Convolutional Neural Networks (CNNs) for image and Recurrent Neural Networks (RNNs) for texts. Our evaluation on 6 datasets, 495 models show that AUTOTRAINER can effectively detect all potential problems with 100% detection rate and no false positives. Among all models with problems, it can fix 97.33% of them, increasing the accuracy by 47.08% on average.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE/ACM 43rd International Conference on Software Engineering, ICSE 2021
PublisherIEEE Computer Society
Pages359-371
Number of pages13
ISBN (Electronic)9780738113197
DOIs
StatePublished - May 2021
Event43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021 - Virtual, Online, Spain
Duration: May 22 2021May 30 2021

Publication series

NameProceedings - International Conference on Software Engineering
ISSN (Print)0270-5257

Conference

Conference43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021
Country/TerritorySpain
CityVirtual, Online
Period5/22/215/30/21

All Science Journal Classification (ASJC) codes

  • Software

Keywords

  • Deep learning training
  • Software engineering
  • Software tools

Fingerprint

Dive into the research topics of 'AUTOTRAINER: An automatic DNN training problem detection and repair system'. Together they form a unique fingerprint.

Cite this