Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)

Research output: Contribution to journalArticle

Abstract

Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

Original languageEnglish (US)
Article number10351
JournalScientific reports
Volume9
Issue number1
DOIs
StatePublished - Dec 1 2019
Externally publishedYes

Fingerprint

Crohn Disease
Genome
Logistic Models
ROC Curve
Area Under Curve
Genome-Wide Association Study
Quality Control
Inborn Genetic Diseases
Nonlinear Dynamics
Machine Learning
Genetic Markers
Inflammatory Bowel Diseases
Genotype
Genes

All Science Journal Classification (ASJC) codes

  • General

Cite this

International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). / Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. In: Scientific reports. 2019 ; Vol. 9, No. 1.
@article{1918ba6d0d9b4f8b8ddf2c8e2ab50bd8,
title = "Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data",
abstract = "Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.",
author = "{International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)} and Alberto Romagnoni and Simon J{\'e}gou and {Van Steen}, Kristel and Gilles Wainrib and Hugot, {Jean Pierre} and Laurent Peyrin-Biroulet and Mathias Chamaillard and Colombel, {Jean Frederick} and Mario Cottone and Mauro D’Amato and Renata D’Inc{\`a} and Jonas Halfvarson and Paul Henderson and Amir Karban and Kennedy, {Nicholas A.} and Khan, {Mohammed Azam} and Marc L{\'e}mann and Arie Levine and Dunecan Massey and Monica Milla and Ng, {Sok Meng Evelyn} and Ioannis Oikonomou and Harald Peeters and Proctor, {Deborah D.} and Rahier, {Jean Francois} and Paul Rutgeerts and Frank Seibold and Laura Stronati and Taylor, {Kirstin M.} and Leif T{\"o}rkvist and Kullak Ublick and {Van Limbergen}, Johan and {Van Gossum}, Andre and Vatn, {Morten H.} and Hu Zhang and Wei Zhang and Andrews, {Jane M.} and Bampton, {Peter A.} and Murray Barclay and Florin, {Timothy H.} and Richard Gearry and Krupa Krishnaprasad and Lawrance, {Ian C.} and Gillian Mahy and Montgomery, {Grant W.} and Graham Radford-Smith and Roberts, {Rebecca L.} and Simms, {Lisa A.} and Katherine Hanigan and Brant, {Steve R.}",
year = "2019",
month = "12",
day = "1",
doi = "10.1038/s41598-019-46649-z",
language = "English (US)",
volume = "9",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",
number = "1",

}

Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. / International Inflammatory Bowel Disease Genetics Consortium (IIBDGC).

In: Scientific reports, Vol. 9, No. 1, 10351, 01.12.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

AU - International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)

AU - Romagnoni, Alberto

AU - Jégou, Simon

AU - Van Steen, Kristel

AU - Wainrib, Gilles

AU - Hugot, Jean Pierre

AU - Peyrin-Biroulet, Laurent

AU - Chamaillard, Mathias

AU - Colombel, Jean Frederick

AU - Cottone, Mario

AU - D’Amato, Mauro

AU - D’Incà, Renata

AU - Halfvarson, Jonas

AU - Henderson, Paul

AU - Karban, Amir

AU - Kennedy, Nicholas A.

AU - Khan, Mohammed Azam

AU - Lémann, Marc

AU - Levine, Arie

AU - Massey, Dunecan

AU - Milla, Monica

AU - Ng, Sok Meng Evelyn

AU - Oikonomou, Ioannis

AU - Peeters, Harald

AU - Proctor, Deborah D.

AU - Rahier, Jean Francois

AU - Rutgeerts, Paul

AU - Seibold, Frank

AU - Stronati, Laura

AU - Taylor, Kirstin M.

AU - Törkvist, Leif

AU - Ublick, Kullak

AU - Van Limbergen, Johan

AU - Van Gossum, Andre

AU - Vatn, Morten H.

AU - Zhang, Hu

AU - Zhang, Wei

AU - Andrews, Jane M.

AU - Bampton, Peter A.

AU - Barclay, Murray

AU - Florin, Timothy H.

AU - Gearry, Richard

AU - Krishnaprasad, Krupa

AU - Lawrance, Ian C.

AU - Mahy, Gillian

AU - Montgomery, Grant W.

AU - Radford-Smith, Graham

AU - Roberts, Rebecca L.

AU - Simms, Lisa A.

AU - Hanigan, Katherine

AU - Brant, Steve R.

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

AB - Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

UR - http://www.scopus.com/inward/record.url?scp=85069470428&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069470428&partnerID=8YFLogxK

U2 - 10.1038/s41598-019-46649-z

DO - 10.1038/s41598-019-46649-z

M3 - Article

AN - SCOPUS:85069470428

VL - 9

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

IS - 1

M1 - 10351

ER -