TY - GEN

T1 - On efficient low distortion ultrametric embedding

AU - Cohen-Addad, Vincent

AU - Karthik, C. S.

AU - Lagarde, Guillaume

N1 - Funding Information:
Karthik C. S. would like to thank the support of the Israel Science Foundation (grant number 552/16) and the Len Blavatnik and the Blavatnik Family foundation. Guillaume Lagarde would like to thank the support of the DeepSynth CNRS Momentum project. Ce projet a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du Programme Appel à projets générique JCJC 2018 portant la référence suivante : ANR-18-CE40-0004-01.
Publisher Copyright:
© 37th International Conference on Machine Learning, ICML 2020.

PY - 2020

Y1 - 2020

N2 - A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of n points in Ω(log n) dimensions incurs a quite prohibitive running time of Θ(n2). In this paper, we provide a new algorithm which takes as input a set of points P in Rd, and for every c ≥ 1, runs in time n1+ c ρ 2 (for some universal constant ρ > 1) to output an ultrametric ∆ such that for any two points u, v in P, we have ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation of P. Here, the best ultrametric is the ultrametric ∆e that minimizes the maximum distance distortion with respect to the l2 distance, namely that minimizes max ∆e (u,v)/ku−vk2. u,v∈P We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε > 0, no algorithm with running time n2−ε can distinguish between inputs in l∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

AB - A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of n points in Ω(log n) dimensions incurs a quite prohibitive running time of Θ(n2). In this paper, we provide a new algorithm which takes as input a set of points P in Rd, and for every c ≥ 1, runs in time n1+ c ρ 2 (for some universal constant ρ > 1) to output an ultrametric ∆ such that for any two points u, v in P, we have ∆(u, v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation of P. Here, the best ultrametric is the ultrametric ∆e that minimizes the maximum distance distortion with respect to the l2 distance, namely that minimizes max ∆e (u,v)/ku−vk2. u,v∈P We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε > 0, no algorithm with running time n2−ε can distinguish between inputs in l∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

UR - http://www.scopus.com/inward/record.url?scp=85105147464&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85105147464&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85105147464

T3 - 37th International Conference on Machine Learning, ICML 2020

SP - 2056

EP - 2066

BT - 37th International Conference on Machine Learning, ICML 2020

A2 - Daume, Hal

A2 - Singh, Aarti

PB - International Machine Learning Society (IMLS)

T2 - 37th International Conference on Machine Learning, ICML 2020

Y2 - 13 July 2020 through 18 July 2020

ER -