Using sketches to estimate associations

Ping Li, Kenneth W. Church

Research output: Contribution to conferencePaperpeer-review

25 Scopus citations


We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not.1 A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. AtWeb scale, sampling rates as low as 10-4 may suffice.


OtherHuman Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, HLT/EMNLP 2005, Co-located with the 2005 Document Understanding Conference, DUC and the 9th International Workshop on Parsing Technologies, IWPT
CityVancouver, BC

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems


Dive into the research topics of 'Using sketches to estimate associations'. Together they form a unique fingerprint.

Cite this