Learning a complex metabolomic dataset using random forests and support vector machines

Young Truong, Xiaodong Lin, Chris Beecher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Scopus citations

Abstract

Metabolomics is the omics science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly with other omic data, genes and proteins, have great potential to unravel complex human diseases. The dataset used in this study has 63 individuals, normal and diseased, and the diseased are drug treated or not, so there are three classes. The goal is to classify these individuals using the observed metabolite levels for 317 measured metabolites. There are a number of statistical challenges: non-normal data, the number of samples is less than the number of metabolites; there are missing data and the fact that data are missing is informative (assay values below detection limits can point to a specific class); also, there are high correlations among the metabolites. We investigate support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification. We use the variables selected with RF in SVM and visa versa. The benefit of this study is insight into interplay of variable selection and classification methods. We link our selected predictors to the biochemistry of the disease.

Original languageEnglish (US)
Title of host publicationKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages835-840
Number of pages6
ISBN (Print)1581138881, 9781581138887
DOIs
StatePublished - 2004
Externally publishedYes
EventKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Seattle, WA, United States
Duration: Aug 22 2004Aug 25 2004

Publication series

NameKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

OtherKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CitySeattle, WA
Period8/22/048/25/04

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Keywords

  • Metabolomics
  • Missing Data
  • Random Forest
  • Support Vector Machines

Fingerprint

Dive into the research topics of 'Learning a complex metabolomic dataset using random forests and support vector machines'. Together they form a unique fingerprint.

Cite this