Scaling up top-K cosine similarity search

Shiwei Zhu, Junjie Wu, Hui Xiong, Guoping Xia

Research output: Contribution to journalArticlepeer-review

34 Scopus citations

Abstract

Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.

Original languageEnglish (US)
Pages (from-to)60-83
Number of pages24
JournalData and Knowledge Engineering
Volume70
Issue number1
DOIs
StatePublished - Jan 2011

All Science Journal Classification (ASJC) codes

  • Information Systems and Management

Keywords

  • Cosine similarity
  • Diagonal traversal strategy
  • Max-first traversal strategy
  • Similarity search

Fingerprint

Dive into the research topics of 'Scaling up top-K cosine similarity search'. Together they form a unique fingerprint.

Cite this