BlueGene/L failure analysis and prediction models

Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, Ramendra Sahoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

177 Citations (Scopus)

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.

Original languageEnglish (US)
Title of host publicationProceedings - DSN 2006
Subtitle of host publication2006 International Conference on Dependable Systems and Networks
Pages425-434
Number of pages10
DOIs
StatePublished - Dec 22 2006
EventDSN 2006: 2006 International Conference on Dependable Systems and Networks - Philadelphia, PA, United States
Duration: Jun 25 2006Jun 28 2006

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks
Volume2006

Other

OtherDSN 2006: 2006 International Conference on Dependable Systems and Networks
CountryUnited States
CityPhiladelphia, PA
Period6/25/066/28/06

Fingerprint

Failure analysis
Hardware
Data storage equipment

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Liang, Y., Zhang, Y., Jette, M., Sivasubramaniam, A., & Sahoo, R. (2006). BlueGene/L failure analysis and prediction models. In Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks (pp. 425-434). [1633531] (Proceedings of the International Conference on Dependable Systems and Networks; Vol. 2006). https://doi.org/10.1109/DSN.2006.18
Liang, Yinglung ; Zhang, Yanyong ; Jette, Morris ; Sivasubramaniam, Anand ; Sahoo, Ramendra. / BlueGene/L failure analysis and prediction models. Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks. 2006. pp. 425-434 (Proceedings of the International Conference on Dependable Systems and Networks).
@inproceedings{61b511e2a218404fa88800fb8b9b1869,
title = "BlueGene/L failure analysis and prediction models",
abstract = "The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80{\%} of the memory and network failures, and 47{\%} of the application I/O failures.",
author = "Yinglung Liang and Yanyong Zhang and Morris Jette and Anand Sivasubramaniam and Ramendra Sahoo",
year = "2006",
month = "12",
day = "22",
doi = "10.1109/DSN.2006.18",
language = "English (US)",
isbn = "0769526071",
series = "Proceedings of the International Conference on Dependable Systems and Networks",
pages = "425--434",
booktitle = "Proceedings - DSN 2006",

}

Liang, Y, Zhang, Y, Jette, M, Sivasubramaniam, A & Sahoo, R 2006, BlueGene/L failure analysis and prediction models. in Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks., 1633531, Proceedings of the International Conference on Dependable Systems and Networks, vol. 2006, pp. 425-434, DSN 2006: 2006 International Conference on Dependable Systems and Networks, Philadelphia, PA, United States, 6/25/06. https://doi.org/10.1109/DSN.2006.18

BlueGene/L failure analysis and prediction models. / Liang, Yinglung; Zhang, Yanyong; Jette, Morris; Sivasubramaniam, Anand; Sahoo, Ramendra.

Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks. 2006. p. 425-434 1633531 (Proceedings of the International Conference on Dependable Systems and Networks; Vol. 2006).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - BlueGene/L failure analysis and prediction models

AU - Liang, Yinglung

AU - Zhang, Yanyong

AU - Jette, Morris

AU - Sivasubramaniam, Anand

AU - Sahoo, Ramendra

PY - 2006/12/22

Y1 - 2006/12/22

N2 - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.

AB - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.

UR - http://www.scopus.com/inward/record.url?scp=33845589803&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845589803&partnerID=8YFLogxK

U2 - 10.1109/DSN.2006.18

DO - 10.1109/DSN.2006.18

M3 - Conference contribution

AN - SCOPUS:33845589803

SN - 0769526071

SN - 9780769526072

T3 - Proceedings of the International Conference on Dependable Systems and Networks

SP - 425

EP - 434

BT - Proceedings - DSN 2006

ER -

Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R. BlueGene/L failure analysis and prediction models. In Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks. 2006. p. 425-434. 1633531. (Proceedings of the International Conference on Dependable Systems and Networks). https://doi.org/10.1109/DSN.2006.18