A spectral algorithm for learning class-based n-gram models of natural language

Karl Stratos, Do Kyum Kim, Michael Collins, Daniel Hsu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Scopus citations

Abstract

The Brown clustering algorithm (Brown et al., 1992) is widely used in natural language processing (NLP) to derive lexical representations that are then used to improve performance on various NLP problems. The algorithm assumes an underlying model that is essentially an HMM, with the restriction that each word in the vocabulary is emitted from a single state. A greedy, bottom-up method is then used to find the clustering; this method does not have a guarantee of finding the correct underlying clustering. In this paper we describe a new algorithm for clustering under the Brown et al. model. The method relies on two steps: first, the use of canonical correlation analysis to derive a low-dimensional representation of words; second, a bottom-up hierarchical clustering over these representations. We show that given a sufficient number of training examples sampled from the Brown et al. model, the method is guaranteed to recover the correct clustering. Experiments show that the method recovers clusters of comparable quality to the algorithm of Brown et al. (1992), but is an order of magnitude more efficient.

Original languageEnglish (US)
Title of host publicationUncertainty in Artificial Intelligence - Proceedings of the 30th Conference, UAI 2014
EditorsNevin L. Zhang, Jin Tian
PublisherAUAI Press
Pages762-771
Number of pages10
ISBN (Electronic)9780974903910
StatePublished - 2014
Externally publishedYes
Event30th Conference on Uncertainty in Artificial Intelligence, UAI 2014 - Quebec City, Canada
Duration: Jul 23 2014Jul 27 2014

Publication series

NameUncertainty in Artificial Intelligence - Proceedings of the 30th Conference, UAI 2014

Other

Other30th Conference on Uncertainty in Artificial Intelligence, UAI 2014
Country/TerritoryCanada
CityQuebec City
Period7/23/147/27/14

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'A spectral algorithm for learning class-based n-gram models of natural language'. Together they form a unique fingerprint.

Cite this