TY - GEN
T1 - Theory of the GMM Kernel
AU - Li, Ping
AU - Zhang, Cun Hui
N1 - Publisher Copyright:
© 2017 International World Wide Web Conference Committee (IW3C2)
PY - 2017
Y1 - 2017
N2 - In web search, data mining, and machine learning, two popular measures of data similarity are the cosine and the resemblance (the latter is for binary data). In this study, we develop theoretical results for both the cosine and the GMM (generalized min-max) kernel [26], which is a generalization of the resemblance. GMM has direct applications in machine learning as a positive definite kernel and can be efficiently linearized via probabilistic hashing to handle big data. Owing to its discrete nature, the hashed values can also be used to build hash tables for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a general family of distributions and includes the multivariate normal and t-distribution as special cases. The consistency result holds as long as the data have bounded first moment (an assumption which typically holds for data commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. We also prove the limit of cosine under elliptical distributions. In comparison, the consistency of GMM requires much weaker conditions. For example, when data follow a t-distribution with ν degrees of freedom, GMM typically provides a better estimate of similarity than cosine when ν < 8 (ν = 8 means the distribution is very close to normal). These theoretical results help explain the recent success of GMM and lay the foundation for further research.
AB - In web search, data mining, and machine learning, two popular measures of data similarity are the cosine and the resemblance (the latter is for binary data). In this study, we develop theoretical results for both the cosine and the GMM (generalized min-max) kernel [26], which is a generalization of the resemblance. GMM has direct applications in machine learning as a positive definite kernel and can be efficiently linearized via probabilistic hashing to handle big data. Owing to its discrete nature, the hashed values can also be used to build hash tables for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a general family of distributions and includes the multivariate normal and t-distribution as special cases. The consistency result holds as long as the data have bounded first moment (an assumption which typically holds for data commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. We also prove the limit of cosine under elliptical distributions. In comparison, the consistency of GMM requires much weaker conditions. For example, when data follow a t-distribution with ν degrees of freedom, GMM typically provides a better estimate of similarity than cosine when ν < 8 (ν = 8 means the distribution is very close to normal). These theoretical results help explain the recent success of GMM and lay the foundation for further research.
UR - http://www.scopus.com/inward/record.url?scp=85029074760&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85029074760&partnerID=8YFLogxK
U2 - 10.1145/3038912.3052679
DO - 10.1145/3038912.3052679
M3 - Conference contribution
AN - SCOPUS:85029074760
SN - 9781450349130
T3 - 26th International World Wide Web Conference, WWW 2017
SP - 1053
EP - 1062
BT - 26th International World Wide Web Conference, WWW 2017
PB - International World Wide Web Conferences Steering Committee
T2 - 26th International World Wide Web Conference, WWW 2017
Y2 - 3 April 2017 through 7 April 2017
ER -