TY - GEN
T1 - On efficient low distortion ultrametric embedding
AU - Cohen-Addad, Vincent
AU - Karthik, C. S.
AU - Lagarde, Guillaume
N1 - Funding Information:
Karthik C. S. would like to thank the support of the Israel Science Foundation (grant number 552/16) and the Len Blavatnik and the Blavatnik Family foundation. Guillaume Lagarde would like to thank the support of the DeepSynth CNRS Momentum project. Ce projet a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du Programme Appel à projets générique JCJC 2018 portant la référence suivante : ANR-18-CE40-0004-01.
Publisher Copyright:
© 37th International Conference on Machine Learning, ICML 2020.
PY - 2020
Y1 - 2020
N2 - A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of n points in Ω(log n) dimensions incurs a quite prohibitive running time of Θ(n2). In this paper, we provide a new algorithm which takes as input a set of points P in Rd, and for every c ≥ 1, runs in time n1+ c ρ 2 (for some universal constant ρ > 1) to output an ultrametric ∆ such that for any two points u, v in P, we have ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation of P. Here, the best ultrametric is the ultrametric ∆e that minimizes the maximum distance distortion with respect to the l2 distance, namely that minimizes max ∆e (u,v)/ku−vk2. u,v∈P We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε > 0, no algorithm with running time n2−ε can distinguish between inputs in l∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
AB - A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of n points in Ω(log n) dimensions incurs a quite prohibitive running time of Θ(n2). In this paper, we provide a new algorithm which takes as input a set of points P in Rd, and for every c ≥ 1, runs in time n1+ c ρ 2 (for some universal constant ρ > 1) to output an ultrametric ∆ such that for any two points u, v in P, we have ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation of P. Here, the best ultrametric is the ultrametric ∆e that minimizes the maximum distance distortion with respect to the l2 distance, namely that minimizes max ∆e (u,v)/ku−vk2. u,v∈P We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε > 0, no algorithm with running time n2−ε can distinguish between inputs in l∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
UR - http://www.scopus.com/inward/record.url?scp=85105147464&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105147464&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85105147464
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 2056
EP - 2066
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
T2 - 37th International Conference on Machine Learning, ICML 2020
Y2 - 13 July 2020 through 18 July 2020
ER -