Resampling-based similarity measures for high-dimensional data

Dhammika Amaratunga, Javier Cabrera, Yung Seop Lee

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

An important issue in classification is the assessment of sample similarity. This is nontrivial in high-dimensional or megavariate datasets-datasets that are comprised of simultaneous measurements on thousands of features, many of which carry little or no information regarding consistent sample differences. Conventional similarity measures do not work particularly well for such data. As an alternative, we propose a distance measure that is based on a refiltering process: at each step of the process a random subset of features is selected and a cluster analysis is performed using only this subset; the relative frequency with which a pair of samples clusters together across several such random subsets forms the similarity measure. The features chosen at any step may be completely random or enriched by awarding the more informative features a higher chance of selection; this enrichment turns out to be particularly effective. We use actual datasets from the burgeoning genomics literature to demonstrate the superior performance of this similarity measure, especially the enriched form of the similarity measure, compared to more conventional measures such as Euclidean distance or correlation, or, if the data are categorical, Hamming distance.

Original languageEnglish (US)
Pages (from-to)54-62
Number of pages9
JournalJournal of Computational Biology
Volume22
Issue number1
DOIs
StatePublished - Jan 1 2015

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Keywords

  • Deep sequencing
  • dissimilarity
  • feature selection
  • microarrays
  • similarity
  • supervised classification
  • unsupervised classification.

Fingerprint

Dive into the research topics of 'Resampling-based similarity measures for high-dimensional data'. Together they form a unique fingerprint.

Cite this