Failure prediction in IBM BlueGene/L event logs

Yinglung Liang, Yanyong Zhang, Hui Xiong, Ramendra Sahoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

66 Citations (Scopus)

Abstract

Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007
Pages583-588
Number of pages6
DOIs
StatePublished - Dec 1 2007
Event7th IEEE International Conference on Data Mining, ICDM 2007 - Omaha, NE, United States
Duration: Oct 28 2007Oct 31 2007

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other7th IEEE International Conference on Data Mining, ICDM 2007
CountryUnited States
CityOmaha, NE
Period10/28/0710/31/07

Fingerprint

Support vector machines
Classifiers
Supercomputers

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Liang, Y., Zhang, Y., Xiong, H., & Sahoo, R. (2007). Failure prediction in IBM BlueGene/L event logs. In Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007 (pp. 583-588). [4470294] (Proceedings - IEEE International Conference on Data Mining, ICDM). https://doi.org/10.1109/ICDM.2007.46
Liang, Yinglung ; Zhang, Yanyong ; Xiong, Hui ; Sahoo, Ramendra. / Failure prediction in IBM BlueGene/L event logs. Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007. 2007. pp. 583-588 (Proceedings - IEEE International Conference on Data Mining, ICDM).
@inproceedings{38e41a96abe643f69fb0c7e5727dbba7,
title = "Failure prediction in IBM BlueGene/L event logs",
abstract = "Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.",
author = "Yinglung Liang and Yanyong Zhang and Hui Xiong and Ramendra Sahoo",
year = "2007",
month = "12",
day = "1",
doi = "10.1109/ICDM.2007.46",
language = "English (US)",
isbn = "0769530184",
series = "Proceedings - IEEE International Conference on Data Mining, ICDM",
pages = "583--588",
booktitle = "Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007",

}

Liang, Y, Zhang, Y, Xiong, H & Sahoo, R 2007, Failure prediction in IBM BlueGene/L event logs. in Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007., 4470294, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 583-588, 7th IEEE International Conference on Data Mining, ICDM 2007, Omaha, NE, United States, 10/28/07. https://doi.org/10.1109/ICDM.2007.46

Failure prediction in IBM BlueGene/L event logs. / Liang, Yinglung; Zhang, Yanyong; Xiong, Hui; Sahoo, Ramendra.

Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007. 2007. p. 583-588 4470294 (Proceedings - IEEE International Conference on Data Mining, ICDM).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Failure prediction in IBM BlueGene/L event logs

AU - Liang, Yinglung

AU - Zhang, Yanyong

AU - Xiong, Hui

AU - Sahoo, Ramendra

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.

AB - Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.

UR - http://www.scopus.com/inward/record.url?scp=49749107565&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=49749107565&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2007.46

DO - 10.1109/ICDM.2007.46

M3 - Conference contribution

SN - 0769530184

SN - 9780769530185

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 583

EP - 588

BT - Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007

ER -

Liang Y, Zhang Y, Xiong H, Sahoo R. Failure prediction in IBM BlueGene/L event logs. In Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007. 2007. p. 583-588. 4470294. (Proceedings - IEEE International Conference on Data Mining, ICDM). https://doi.org/10.1109/ICDM.2007.46