Failure prediction in IBM BlueGene/L event logs

Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra Sahoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

102 Scopus citations

Abstract

Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007
Pages583-588
Number of pages6
DOIs
StatePublished - 2007
Event7th IEEE International Conference on Data Mining, ICDM 2007 - Omaha, NE, United States
Duration: Oct 28 2007Oct 31 2007

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other7th IEEE International Conference on Data Mining, ICDM 2007
Country/TerritoryUnited States
CityOmaha, NE
Period10/28/0710/31/07

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Failure prediction in IBM BlueGene/L event logs'. Together they form a unique fingerprint.

Cite this