Histogramming data streams with fast per-item processing

Sudipto Guha, Piotr Indyk, S. Muthukrishnan, Martin J. Strauss

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Scopus citations

Abstract

A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating A i by Hi = bj for i ∈ Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ∥A - H∥22 = ∑i|Ai-H i|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that ∥A -H∥22≤ (1 + ε) ∥A -Hopt22 Our algorithm considers the data items A0,A1,.. in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ∥A∥, 1/ε ), and determines the histogram in time poly((B, log(N), log ∥A∥, 1/ε ). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., Ω(N), or worked longer, i.e., N log Ω(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.

Original languageEnglish (US)
Title of host publicationAutomata, Languages and Programming - 29th International Colloquium, ICALP 2002, Proceedings
EditorsPeter Widmayer, Stephan Eidenbenz, Francisco Triguero, Rafael Morales, Ricardo Conejo, Matthew Hennessy
PublisherSpringer Verlag
Pages681-692
Number of pages12
ISBN (Print)3540438645, 9783540438649
DOIs
StatePublished - 2002
Event29th International Colloquium on Automata, Languages, and Programming, ICALP 2002 - Malaga, Spain
Duration: Jul 8 2002Jul 13 2002

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2380 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other29th International Colloquium on Automata, Languages, and Programming, ICALP 2002
CountrySpain
CityMalaga
Period7/8/027/13/02

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Keywords

  • Histograms
  • Streaming algorithms

Fingerprint Dive into the research topics of 'Histogramming data streams with fast per-item processing'. Together they form a unique fingerprint.

Cite this