TY - GEN
T1 - Hypergraph partitioning for document clustering
T2 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM SIGIR 2008
AU - Tianming, Hu
AU - Hui, Xiong
AU - Wenjun, Zhou
AU - Sam, Yuan Sung
AU - Hangzai, Luo
PY - 2008
Y1 - 2008
N2 - Hypergraph partitioning has been considered as a promising method to address the challenges of high dimensionality in document clustering. With documents modeled as vertices and the relationship among documents captured by the hyperedges, the goal of graph partitioning is to minimize the edge cut. Therefore, the definition of hyperedges is vital to the clustering performance. While several definitions of hyperedges have been proposed, a systematic understanding of desired characteristics of hyperedges is still missing. To that end, in this paper, we first provide a unified clique perspective of the definition of hyperedges, which serves as a guide to define hyperedges. With this perspective, based on the concepts of hypercliques and shared (reverse) nearest neighbors, we propose three new types of clique hyperedges and analyze their properties regarding purity and size issues. Finally, we present an extensive evaluation using real-world document datasets. The experimental results show that, with shared (reverse) nearest neighbor based hyperedges, the clustering performance can be improved significantly in terms of various external validation measures without the need for fine tuning of parameters.
AB - Hypergraph partitioning has been considered as a promising method to address the challenges of high dimensionality in document clustering. With documents modeled as vertices and the relationship among documents captured by the hyperedges, the goal of graph partitioning is to minimize the edge cut. Therefore, the definition of hyperedges is vital to the clustering performance. While several definitions of hyperedges have been proposed, a systematic understanding of desired characteristics of hyperedges is still missing. To that end, in this paper, we first provide a unified clique perspective of the definition of hyperedges, which serves as a guide to define hyperedges. With this perspective, based on the concepts of hypercliques and shared (reverse) nearest neighbors, we propose three new types of clique hyperedges and analyze their properties regarding purity and size issues. Finally, we present an extensive evaluation using real-world document datasets. The experimental results show that, with shared (reverse) nearest neighbor based hyperedges, the clustering performance can be improved significantly in terms of various external validation measures without the need for fine tuning of parameters.
KW - Clique
KW - Document clustering
KW - Hypergraph partitioning
KW - Shared nearest neighbor
UR - http://www.scopus.com/inward/record.url?scp=57349195523&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57349195523&partnerID=8YFLogxK
U2 - 10.1145/1390334.1390548
DO - 10.1145/1390334.1390548
M3 - Conference contribution
AN - SCOPUS:57349195523
SN - 9781605581644
T3 - ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings
SP - 871
EP - 872
BT - ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings
Y2 - 20 July 2008 through 24 July 2008
ER -