Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.
All Science Journal Classification (ASJC) codes
- Information Systems and Management
- Cosine similarity
- Diagonal traversal strategy
- Max-first traversal strategy
- Similarity search