Making sense out of the fast data and stream processing conundrum

5 min readJun 9, 2018

With mushrooming of streaming frameworks, I believe there is way too much literature around some of the underlying constructs. Fast data, streaming, bounded, windowing, unbounded, reactive…you get the drift. This blog attempts to bring some method into the madness by doing a characterization of the data and their processing systems.

How does data look like? Data may be bounded or unbounded

1.1 In simple terms, bounded data is a slice of data in time and it is typically consumed with a fixed periodicity & high latency. For e.g. seller transaction data for purpose of creating monthly reports1.2 By unbounded data we mean that the data is essentially an infinite stream of events and is consumed instantly. For e.g. clickstream data from users of an website1.3 Bounded data can also be viewed as a finite subset of unbounded data depending on the strategy adopted for computation. For e.g. clickstream data can also be treated as a series of micro batches

2. How do we process this data? Data processing can be broadly batch or streaming with variations.

2.1 Batch processing on an bounded data set e.g. offline analytics to compute trends around e-commerce product collections. 2.2 Batch processing on unbounded data which requires fixed, sliding or session windows to carve out micro batches of data e.g. user web session analysis for a social networking website. 2.3 Stream processing on on unbounded data e.g. trending tweets or hot selling products. Even though the system will perpetually running, it will have to employ windowing for sake of processing data in temporal chunks. This means it will have to consider event-time to processing-time lag as well as overall ordering of events for purpose of consistency & correctness.

3. What results are we looking to produce? Outcome of data processing may be a time bound or time agnostic, accurate or approximation computation of result.

3.1 Data may be processed as a series of time agnostic micro batches; usually to perform filtering operations or inner joins on data. For e.g. a bus booking website that identifies failed bookings for immediate manual followups. Another e.g. filtering failure alerts to take action on most critical ones such as hardware crashes3.3 Data may be processed as a series of event time ordered micro batches. For e.g.in a bank or a payment company, the data comprising of multiple ordered events pertaining to a single transaction - SUBMITTED, IN PROCESS, ONHOLD, SUCCESS, CANCEL, etc. 3.4 Data may be processed as a series of processing time ordered micro batches. For e.g. in an e-commerce company, order stream analysis to find geo clusters from which orders are originating

4. Are there any stateful computations? When are results calculated and materialized? The moment we introduce a window, the computation becomes stateful. For e.g. getting the average temperature reported by a sensor over the last hour is a stateful computation. The outcome of data processing would then be the value of the state at a given point in time. This needs to be tied to a strategy for materialization of results which is a function on watermarks, triggers and accumulators.

4.1 Watermarks: Temporal boundaries that helps define completeness of observability of all past events till that point in time. Its like saying "I can confirm that there were only 50 events that occurred in the one minute window from 15:10 to 1:11 ". Watermarks can be perfect boundaries or based on heuristics. The former results in loss of observed events due to late arrivals if the watermarking process is too fast or increases latency otherwise. The latter on the other hand is an approximation and maximizes the observability of events without compromising on speed.4.2 Triggers: A mechanism that declares the condition for emitting the outcome. Triggers can be periodic or based on certain conditions; they can also be compositional i.e. a predicate on a set of child triggers. For e.g. triggers can based on event count, event time progress(on watermark boundaries), server time progress, punctuations in incoming stream(think EOF, \n), logical operators on multiple child triggers(AND,OR), etc4.3 Accumulators: Define relationships between multiple results observed within the same window and how a state is to be carried forward to the next window. There can be multiple strategies around this - discarding previous state, accumulating, retracting plus accumulating, etc. For e.g. when computing the top K selling products of all time, we are employing accumulation.

5. Is the state a continuously mutating state? If the state is carried over from one window to the next, then in addition to accumulators as described above, the system will have to support checkpoints i.e. periodic snapshots of the state (either full or incremental) so that the system can recover from failure and resume with the last computed value of the state. This can be accomplished using barriers which flow alongside the regular events without interrupting the flow. Barriers can act as a trigger and reference points for the system to align all input streams, temporarily buffer newer events, snapshot the state and resume processing the never events. The snapshotting can also be done asynchronously using copy-on-write principles.

With such a plethora of stream processing frameworks how does one make a discriminative choice? Here is a list of traits that can help decide.

Streaming Model: Is the streaming model native or micro-batching. Native means low latency and low throughput (Apache Storm, Samza, Flink) and vice versa with micro-batching (Storm Trident, Spark Streaming).
Guarantees: Exactly once(Storm Trident, Spark, Fink). At least once (Storm, Samza)
Does it support state management? Except default flavor of Storm all others support state management.
How good are the functional primitives to support stateful computations such as windowing, aggregating, triggering, watermarking, checkpointing, etc?
Fault tolerance: how is fault tolerance accomplished? Is it via record acks (Apache Storm) or RDD based checkpointing (Apache Spark) or log based (Samza) or plain checkpointing (Flink)?

That concludes this blog post on stream processing. I hope I have cleared some of the air around it and you as a reader are better equipped to model streaming problems.

Making sense out of the fast data and stream processing conundrum

Written by Shashank Baravani