TY - GEN
T1 - Robust communication-optimal distributed clustering algorithms
AU - Awasthi, Pranjal
AU - Bakshi, Ainesh
AU - Balcan, Maria Florina
AU - White, Colin
AU - Woodruff, David P.
N1 - Funding Information:
Acknowledgements This work was supported in part by NSF grants CCF-1422910, CCF-1535967, IIS-1618714, an Office of Naval Research (ONR) grant N00014-18-1-2562, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and a National Defense Science & Engineering Graduate (NDSEG) fellowship. Part of this work was done while Ainesh Bakshi and David Woodruff were visiting the Simons Institute for the Theory of Computing.
Funding Information:
This work was supported in part by NSF grants CCF-1422910, CCF-1535967, IIS-1618714, an Office of Naval Research (ONR) grant N00014-18-1-2562, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and a National Defense Science & Engineering Graduate (NDSEG) fellowship. Part of this work was done while Ainesh Bakshi and David Woodruff were visiting the Simons Institute for the Theory of Computing.
Publisher Copyright:
© Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David Woodruff; licensed under Creative Commons License CC-BY
PY - 2019/7/1
Y1 - 2019/7/1
N2 - In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [12], even when a constant fraction of the data are outliers. The communication complexity is Õ(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Ω(sk + z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Ω(sk + z) is a communication bottleneck, even for real-world instances.
AB - In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [12], even when a constant fraction of the data are outliers. The communication complexity is Õ(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Ω(sk + z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Ω(sk + z) is a communication bottleneck, even for real-world instances.
KW - Communication complexity
KW - Robust distributed clustering
UR - http://www.scopus.com/inward/record.url?scp=85069186837&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069186837&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.ICALP.2019.18
DO - 10.4230/LIPIcs.ICALP.2019.18
M3 - Conference contribution
AN - SCOPUS:85069186837
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019
A2 - Baier, Christel
A2 - Chatzigiannakis, Ioannis
A2 - Flocchini, Paola
A2 - Leonardi, Stefano
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019
Y2 - 9 July 2019 through 12 July 2019
ER -