Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds

Sepehr Assadi, Vaggos Chatziafratis, Jakub Łacki, Vahab Mirrokni, Chen Wang

Research output: Contribution to journalConference articlepeer-review

4 Scopus citations

Abstract

The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta Dasgupta (2016). Assuming that the input is an n-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: • With O(n · polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm Charikar and Chatziafratis (2017). • When the space is more limited, namely, n1−o(1), we prove that no algorithm can even estimate the value of the optimum hierarchical tree to within an o(logloglognn) factor, even when allowed polylog n passes over the input and exponential time. • In the most stringent setting of polylog n space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an n1/2−o(1) factor gap between their HC objective value. • Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires storing almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graphs can preserve the cost of “balanced” hierarchical trees to within a constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results include a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.

Original languageEnglish (US)
Pages (from-to)4643-4702
Number of pages60
JournalProceedings of Machine Learning Research
Volume178
StatePublished - 2022
Event35th Conference on Learning Theory, COLT 2022 - London, United Kingdom
Duration: Jul 2 2022Jul 5 2022

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Keywords

  • Communication Complexity
  • Dasgupta’s Objective
  • Hierarchical Clustering
  • Lower Bounds
  • Streaming Algorithms
  • Sublinear Memory

Fingerprint

Dive into the research topics of 'Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds'. Together they form a unique fingerprint.

Cite this