Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Stephen L. France, J. Douglas Carroll, Hui Xiong

Research output: Contribution to journalArticlepeer-review

32 Scopus citations

Abstract

Previous work has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. We extend this work. We frame statistical theory on the relationships between the Euclidean, cosine, and correlation distance metrics in terms of item neighborhoods. We discuss the differences between the cosine and correlation distance metrics and illustrate our discussion with an example from collaborative filtering. We introduce a family of normalized Minkowski metrics and test their use on both document data and synthetic data generated from the uniform distribution. We describe a range of criteria for testing neighborhood homogeneity relative to underlying latent classes. We discuss how these criteria are explicitly and implicitly linked to classification performance. By testing both normalized and non-normalized Minkowski-p metrics for multiple values of p, we separate out distance compression effects from normalization effects. For multi-class classification problems, we believe that distance compression on high dimensional data aids classification and data analysis. For document data, we find that the cosine (and normalized Euclidean), correlation, and proportioned city block metrics give strong neighborhood recovery. The proportioned city block metric gives particularly good results for nearest neighbors recovery and should be used when utilizing document data analysis techniques for which nearest neighborhood recovery is important. For data generated from the uniform distribution, neighborhood recovery improves as the value of p increases.

Original languageEnglish (US)
Pages (from-to)92-110
Number of pages19
JournalInformation Sciences
Volume184
Issue number1
DOIs
StatePublished - Feb 1 2012
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Keywords

  • Dimensionality
  • GINI
  • Latent classes
  • Minkowski metrics
  • Nearest neighbors
  • Normalization

Fingerprint Dive into the research topics of 'Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization'. Together they form a unique fingerprint.

Cite this