Query-aware partitioning for monitoring massive network data streams

Theodore Johnson, Shan Muthukrishnan, Vladislav Shkapenyuk, Oliver Spatscheck

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.

Original languageEnglish (US)
Title of host publicationSIGMOD 2008
Subtitle of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data 2008
Pages1135-1146
Number of pages12
DOIs
StatePublished - Dec 10 2008
Event2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08 - Vancouver, BC, Canada
Duration: Jun 9 2008Jun 12 2008

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
CountryCanada
CityVancouver, BC
Period6/9/086/12/08

Fingerprint

Monitoring
Processing
Servers
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Cite this

Johnson, T., Muthukrishnan, S., Shkapenyuk, V., & Spatscheck, O. (2008). Query-aware partitioning for monitoring massive network data streams. In SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008 (pp. 1135-1146). [1376730] (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1376616.1376730
Johnson, Theodore ; Muthukrishnan, Shan ; Shkapenyuk, Vladislav ; Spatscheck, Oliver. / Query-aware partitioning for monitoring massive network data streams. SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. pp. 1135-1146 (Proceedings of the ACM SIGMOD International Conference on Management of Data).
@inproceedings{3b38cfb32a0c425f99c999c146f2d9b6,
title = "Query-aware partitioning for monitoring massive network data streams",
abstract = "Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.",
author = "Theodore Johnson and Shan Muthukrishnan and Vladislav Shkapenyuk and Oliver Spatscheck",
year = "2008",
month = "12",
day = "10",
doi = "10.1145/1376616.1376730",
language = "English (US)",
isbn = "9781605581026",
series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
pages = "1135--1146",
booktitle = "SIGMOD 2008",

}

Johnson, T, Muthukrishnan, S, Shkapenyuk, V & Spatscheck, O 2008, Query-aware partitioning for monitoring massive network data streams. in SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008., 1376730, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1135-1146, 2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08, Vancouver, BC, Canada, 6/9/08. https://doi.org/10.1145/1376616.1376730

Query-aware partitioning for monitoring massive network data streams. / Johnson, Theodore; Muthukrishnan, Shan; Shkapenyuk, Vladislav; Spatscheck, Oliver.

SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. p. 1135-1146 1376730 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Query-aware partitioning for monitoring massive network data streams

AU - Johnson, Theodore

AU - Muthukrishnan, Shan

AU - Shkapenyuk, Vladislav

AU - Spatscheck, Oliver

PY - 2008/12/10

Y1 - 2008/12/10

N2 - Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.

AB - Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.

UR - http://www.scopus.com/inward/record.url?scp=57149138292&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57149138292&partnerID=8YFLogxK

U2 - 10.1145/1376616.1376730

DO - 10.1145/1376616.1376730

M3 - Conference contribution

SN - 9781605581026

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 1135

EP - 1146

BT - SIGMOD 2008

ER -

Johnson T, Muthukrishnan S, Shkapenyuk V, Spatscheck O. Query-aware partitioning for monitoring massive network data streams. In SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. p. 1135-1146. 1376730. (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1376616.1376730