infolab.stanford.edu

Stanford Stream Data Manager

Motivation

In applications such as network monitoring, telecommunications data management, clickstream monitoring, manufacturing, sensor networks, and others, data takes the form of continuous data streams rather than finite stored data sets, and clients require long-running continuous queries as opposed to one-time queries. Traditional database systems and data processing algorithms are ill-equipped to handle complex and numerous continuous queries over data streams, and many aspects of data management and processing need to be reconsidered in their presence. In the STREAM project, we are reinvestigating data management and query processing in the presence of multiple, continuous, rapid, time-varying data streams. We are attacking problems ranging from basic theory results to algorithms to implementing a comprehensive prototype data stream management system.

The STREAM project has been supported in part by the National Science Foundation under grants IIS-0118173, IIS-9817799, IIS-0324431, and IIS-1098447.

News

(January 2006) The STREAM project has officially wound down. Key students have finished their Ph.D.'s and left Stanford, while others have moved on to new and exciting different research topics. (Not to be taken as an indication that research in streams and continuous queries is "done" -- many topics are still wide open.) We will endeavor to continue answering questions and supporting the publicly available system, however resources are minimal at this stage.
(ongoing) We maintain a Stream Query Repository as a resource for researchers in data streams.
(March 2004) A new overview paper on the STREAM project is available: STREAM: The Stanford Data Stream Management System. It will appear in a book on data stream management edited by Garofalakis, Gehrke, and Rastogi.
(August 2003) The latest informal get-together of data streams research groups was hosted by David Maier's group at OGI in Portland. Meeting notes are available from the Stream Team Web page. The next Stream Team meeting is planned for spring '04, location TBD. [Summer '04: It was held in Berkeley, but the notes are in an indefinite state of "being assembled".]
(May 2003) We instantiated the Linear Road Benchmark with a detailed suite of CQL schemas and queries.

System Availability: Internet System and Source Code

Our prototype Data Stream Management System is available for public use. You may try the system over the internet, or you may download the code.

Internet System

You can try the STREAM prototype without downloading and installing the source -- we start up a server at Stanford and a client on your machine. CLICK HERE to give it a try, and don't forget to read the Online Help so you know what you're doing. If the server above is overloaded or unavailable, try this one instead.

Using the internet-accessible system you can try out much of the prototype's functionality: a few predefined streams and queries (including the Linear Road Benchmark), registering new streams and queries, visualizing and monitoring query plans, and other features.

Source Code

If you're interested in installing the prototype system at your own site, the source code is available for download. This release contains the code for the main STREAM server and for a GUI client to interact with the server over a network. The server can also be used as a library, and directly linked from a C++ program. The code release includes a fairly comprehensive user manual with detailed information about the functionality and design of the STREAM prototype.

People

Faculty

Students (grad and undergrad)

Alums

John Cieslewicz
Mayur Datar
Keith Ito
Jon McAlister
Itaru Nishizawa (Hitachi, Ltd.)
Justin Rosenstein
Vitthal Srinivasan
Dilys Thomas

Talks

The Stanford Data Stream Management System.
Jennifer Widom's standard talk on the project, updated January 2006
- Slides in PowerPoint
- Slides in pdf
CQL: A Language for Continuous Queries over Streams and Relations.
Invited talk given by Jennifer Widom at the 2003 DBPL Workshop (Sept. 2003)
- Slides in PowerPoint
- Slides in pdf
Randomization for Massive and Streaming Data Sets.
Talk given by Rajeev Motwani at the Stanford Computer Science Forum - Annual Affiliates Meeting (May 2003)
- Slides in PowerPoint
Models and Issues in Data Stream Systems.
Invited talk given by Rajeev Motwani at the 2002 PODS Conference (June 2002)
- Slides in PowerPoint

Demos

The STREAM prototype Data Stream Management System is available here to try over the Internet. You may also download the source code here.
We demo'd StreaMon, the adaptive query processing engine of the STREAM system, at the ACM SIGMOD 2004 Conference. Here was the demonstration description.
We demo'd the STREAM system and its dynamic visualizer at the ACM SIGMOD 2003 Conference. Here was the demonstration proposal.

Papers

Overviews and Surveys

The STREAM Group. Stanford Data Stream Management System (latest overview paper)
To appear in a book on data stream management edited by Garofalakis, Gehrke, and Rastogi.
The STREAM Group. STREAM: The Stanford Stream Data Manager (short overview paper)
IEEE Data Engineering Bulletin, March 2003
R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System
In Proc. of CIDR 2003, Jan. 2003
This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing approximate answers when resources are limited. This paper describes specific contributions made so far and enumerates our next steps in developing a general-purpose Data Stream Management System.
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems
Invited paper in Proc. of PODS 2002, June 2002
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
S. Babu and J. Widom. Continuous Queries over Data Streams
In SIGMOD Record, Sep. 2001
We specify a general and flexible framework for query processing in the presence of data streams. The framework captures most previous work on continuous queries and data streams, as well as subsuming related concepts such as triggers and materialized views. We further map out problems, techniques, and challenges in processing continuous queries over data streams.

System

U. Srivastava and J. Widom. Flexible Time Management in Data Stream Systems
In Proc. of PODS 2004, June 2004
Flexible application-defined time poses challenges to a Data Stream Management System, since streams may be out of order and uncoordinated with each other, they may incur latency reaching the DSMS, and they may pause or stop. We formalize these challenges and specify how to generate heartbeats so that queries can be evaluated correctly and continuously in an application-defined time domain. Our heartbeat generation algorithm is based on parameters capturing skew between streams, unordering within streams, and latency in streams reaching the DSMS. We also describe how to estimate these parameters at run-time, and we discuss how heartbeats can be used for processing continuous queries.
B. Babcock, M. Datar, and R. Motwani. Load Shedding for Aggregation Queries over Data Streams
In Proc. of ICDE 2004, March 2004
We present load shedding techniques for a restricted class of stream queries: Aggregation queries over sliding windows, possibly with selections, projections and foreign key joins with stored relations. We present optimal solutions for placing load shedders (operators which randomly drop tuples) in the query plan, which reduce the load on the system below the required threshold, while minimizing the inaccuracy introduced in the queries.
D. Thomas and R. Motwani. Caching Queues in Memory Buffers
In Proc. of SODA 2004, Jan. 2004
We study the problem of maintaining queues in a cache, which occurs in a number of important settings like DataStream systems, Network Router design and Distributed Messaging services. We analyze why DataStream systems built on top of buffer managers that use traditional algorithms like LRU perform badly. We provide online competitive algorithms for this problem for different interesting cost models.
B. Babcock, S. Babu, M. Datar, R. Motwani, and D. Thomas. Operator Scheduling in Data Stream Systems
To appear in VLDB Journal, 2005
This paper is an extended version of our paper titled "Chain: Operator Scheduling for Memory Minimization in Data Stream Systems" that appeared in the proceedings of SIGMOD 2003. This paper extends the Chain operator-scheduling strategy proposed in the SIGMOD paper to minimize run-time memory requirements subject to user-specified latency constraints. This paper also proves an NP-completeness result showing the intractability of the problem of minimizing run-time memory requirements in the stream setting.

Query Language

A. Arasu and J. Widom. A Denotational Semantics for Continuous Queries over Streams and Relations
In SIGMOD Record, Sep. 2004
We present formal, denotational semantics for a generic continuous query language based on streams, time-varying relations, and three classes of operators over streams and relations.
A. Arasu, S. Babu and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution
To appear in VLDB Journal, 2005
We first present an abstract semantics for continuous queries over streams based on several building blocks: formal definitions for streams and relations, mappings among them, and any relational query language. We then propose a concrete language, CQL (for Continuous Query Language), which instantiates the abstract semantics using SQL as the relational query language and window specifications derived from SQL-99 to map from streams to relations. Finally, we present the implementation of CQL in the STREAM prototype, describing query execution plans, operators, inter-operator queues, synopses, and sharing of data and computation among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road Benchmark recently proposed for Data Stream Management Systems.
Note: A preliminary, much shorter version of this paper appeared as the November 2002 Technical Report An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations, and an even shorter version titled "CQL: A Language for Continuous Queries over Streams and Relations" appeared as an invited paper in the DBPL workshop, Sep. 2003.

Query Processing

K. Munagala, U. Srivastava, and J. Widom. Optimization of Continuous Queries with Shared Expensive Filters
Technical Report, Nov. 2005
We consider the problem of optimizing and executing multiple continuous queries, where each query is a conjunction of filters and each filter may occur in multiple queries. When filters are expensive, significant performance gains are achieved by sharing filter evaluations across queries. A shared execution strategy in our scenario can either be fixed, in which filters are evaluated in the same predetermined order for all input, or adaptive, in which the next filter to be evaluated is chosen at runtime based on the results of the filters evaluated so far. We show that as filter costs increase, the best adaptive strategy is superior to any fixed strategy, despite the overhead of adaptivity. We show that it is NP-hard to find the optimal adaptive strategy, even if we are willing to approximate within any factor smaller than logarithmic in the number of queries. We present a greedy execution strategy and show that it approximates the best adaptive strategy to within a factor polylogarithmic in the number of queries and filters. We also show how the execution overhead of adaptive strategies can be reduced by appropriate precomputation. Finally, we present an experimental evaluation demonstrating the effectiveness of our techniques.
U. Srivastava, K. Munagala and J. Widom. Operator Placement for In-Network Query Processing
In Proc. of PODS 2005, June 2005
In this paper we consider the problem of executing queries over a network of nodes where data is acquired at low-capability sensors and then transmitted through a hierarchy of nodes having progressively increasing network bandwidth and computational power. The goal is to perform ``in-network'' query processing and to decide the placement of the query plan operators such that the total cost of computation and transmission is minimized. We give optimal operator-placement algorithms for queries involving possibly expensive conjunctive filters, and joins.
S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive Caching for Continuous Queries
In Proc. of ICDE 2005, April 2005
We study the problem of using caches to improve performance and adaptivity in continuous multiway joins. We propose different cache types and algorithms for cache maintenance, monitoring cache cost and benefits, selecting caches to use, allocating memory to caches, and adapting over the entire spectrum between stateless MJoins and cache-rich join trees as stream and system conditions change. Although we focus on joins, our algorithms generalize easily to query plans composed of of one or more operator pipelines, and to any number of such query plans.
K. Munagala, S. Babu, R. Motwani, and J. Widom. The Pipelined Set Cover Problem
In Proc. of ICDT 2005, Jan. 2005
A classical problem in query optimization is to find the optimal ordering of a set of possibly correlated selections or joins. We provide an abstraction of this problem as a generalization of set cover called pipelined set cover, where the sets are applied sequentially to the elements to be covered and the elements covered at each stage are discarded. We show that several natural heuristics for this NP-hard problem, such as the greedy set-cover heuristic and a local-search heuristic, can be analyzed using a linear-programming framework which bounds not only the approximation ratio, but also the running time of the corresponding algorithms. We also consider the online version of pipelined set cover and present a competitive algorithm with a logarithmic performance guarantee.
A. Arasu and J. Widom. Resource Sharing in Continuous Sliding-Window Aggregates
In Proc. of VLDB 2004, Sep. 2004
We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address sliding-window aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (time-based, tuple-based, suffix, historical), and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through a thorough experimental study.
U. Srivastava and J. Widom. Memory-Limited Execution of Windowed Stream Joins
In Proc. of VLDB 2004, Sep. 2004
We address the problem of computing approximate answers to sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. The objective of the approximation may be either to return a maximum-size subset of the result or a random sample of the result. We introduce a new age-based model of stream arrival that is often more appropriate for addressing these problems than the traditional frequency-based model used in previous work. We also provide an algorithm for optimal memory allocation across multiple joins being executed in the system.
S. Babu, U. Srivastava, and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams
In ACM TODS, Sep. 2004
We introduce the important concept of k-constraints, which are likely to hold in data stream environments even when strict constraints do not hold. We demonstrate how to incorporate k-constraints into a data stream query processor in order to reduce memory overhead for continuous queries. We show empirically that k-constraints are very effective at reducing the memory requirement in a wide variety of SPJ queries and that these constraints can be monitored and exploited with very low computational overhead.
S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive Ordering of Pipelined Stream Filters
In Proc. of SIGMOD 2004, June 2004
We consider the problem of pipelined filters, where a continuous stream of elements is processed by a set of commutative filters. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) We identify and study a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity.
A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams
In ACM TODS, March 2004.
We consider conjunctive queries with arithmetic comparisons over multiple continuous data streams. We specify an algorithm for determining whether or not a query can be evaluated using a bounded amount of memory for all possible instances of the data streams. When a query can be evaluated using bounded memory, our algorithm produces an evaluation plan based on constant-sized synopses of the data streams.
Note: This paper is an extension of the paper of the same name that appeared in PODS 2002.
U. Srivastava, S. Babu, and J. Widom. Monitoring Stream Properties for Continuous Query Processing (short paper)
In Proc. of MPDS 2003, June 2003

Distributed Streams

C. Olston, J. Jiang, and J. Widom. Adaptive Filters for Continuous Queries over Distributed Data Streams
In Proc. of SIGMOD 2003, June 2003
We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique for reducing the overhead. Users register continuous queries with precision requirements at the central stream processor, which installs filters at remote data sources. The filters adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times.
B. Babcock and C. Olston. Distributed Top-K Monitoring
In Proc. of SIGMOD 2003, June 2003
We study a useful class of queries that continuously report the k largest values obtained from distributed data streams ("top-k monitoring queries"), which are of particular interest because they can be used to reduce the overhead incurred while running other types of monitoring queries. We show that transmitting entire data streams is unnecessary to support these queries and present an alternative approach that reduces communication significantly. In our approach, arithmetic constraints are maintained at remote stream sources to ensure that the most recently provided top-k answer remains valid to within a user-specified error tolerance.

Statistics

A. Arasu and G. Manku. Approximate Counts and Quantiles over Sliding Windows
In Proc. of PODS 2004, June 2004
We consider the problem of maintaining approximate counts and quantiles over fixed- and variable-size sliding windows in limited space. For quantiles, we present deterministic algorithms whose space requirements are O(1/e log(1/e)log N) and O(1/e log(1/e) log(eN) log N) in the worst-case for fixed- and variable-size windows, respectively, where N denotes the current number of elements in the window and e, the relative error. Our space bounds improve upon the previous best bounds of O(1/e^2 polylog (1/e,N)). For counts, we present both deterministic and randomized algorithms. The deterministic algorithms require O(1/e log^2 (1/e)) and O(1/e log^2 (1/e) log eN) for worst-case space for fixed- and variable-size windows, respectively, while the randomized ones require O(1/e log (1/(e d))) and O(1/e log(1/(ed)) log eN) worst-case space, where d denotes the probability of failure. We believe no previous work on space-efficient approximate counts for sliding windows exists.
B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining Variance and k-Medians over Data Stream Windows
In Proc. of PODS 2003, June 2003
We present a novel technique for solving two important and related problems in the sliding window model -- maintaining variance and maintaining k-medians clustering.
M. Datar and S. Muthukrishnan. Estimating Rarity and Similarity over Data Stream Windows
In Proc. of European Symposium of Algorithms, Sep. 2002
We present solutions to two problems in the sliding window model: estimating rarity and similarity over streams. The rarity of a stream is defined as the ratio of the number of elements that occur exactly once to the number of distinct elements. The similarity between two streams is defined as the similarity between the sets of distinct elements seen over the two streams; the ratio of the size of the intersection to the size of the union.
G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing Data Streams Using Hamming Norm
In Proc. of VLDB 2002, Aug. 2002
We present solution to the problem of computing Hamming norm over data streams. Hamming norm computation is more general than the well studied distinct value estimation problem. Our solution uses sketching techniques and works in the presence of inserts and deletes.
G. S. Manku and R. Motwani. Approximate Frequency Counts over Streaming Data
In Proc. of VLDB 2002, Aug. 2002
This paper present algorithms for computing frequency counts exceeding a user-specified threshold over data streams. Some applications are also presented.
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics Over Sliding Windows
In SIAM Journal on Computing, Vol. 31 No. 6
We consider the problem of maintaining statistics over sliding windows. We design data structures with small memory requirements and provide matching lower bounds.
Note: This paper is an extension of the paper of the same name that appeared in SODA 2002.
B. Babcock, M. Datar, and R. Motwani. Sampling From a Moving Window Over Streaming Data
In Proc. of SODA 2002, Jan. 2002
We introduce the problem of sampling from a moving window of recent items from a data stream and develop the "chain-sample" and "priority-sample" algorithms for this problem.

Clustering

S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams: Theory and Practice
IEEE Trans. on Knowledge and Data Engineering, vol. 15 (2003)
Under the data stream model, the data set to be processed is assumed to be too large to be processed together in RAM, and to be only accessible via linear scans, so that, for example, random access is unavailable. This model has recently attracted attention for its applicability to numerous types of data, including telephone records, web documents and clickstreams. For algorithms designed to analyze such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe a one-pass, memory-efficient streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. High-Performance Clustering of Streams and Large Data Sets
In Proc. of ICDE 2002, Feb. 2002
We give innovative techniques to transform theoretically well-founded algorithms for clustering into ones that perform well in practice. We further show that their performance is competitive with popular empirical approaches for clustering data streams.
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams
In Proc. of FOCS 2000, Nov. 2000
We study clustering under the data stream model of computation where given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far using little memory and time.

Applications

S. Babu, L. Subramanian, and J. Widom. A Data Stream Management System for Network Traffic Management
In Proc. of NRDM 2001, May 2001
In this short position paper, we describe the demands of network traffic management applications and we discuss how a Data Stream Management System can provide a general and scalable platform for deploying these applications.

Last modified: January 5, 2006