A few years back, I got my hands dirty building a quintessential Lambda architecture that involved computing a user affinity model for an e-commerce company. The objective was to understand users’ historical and recent tastes and surface interesting recommendations by pivoting around these interests. Check the image below for reference.
Since then, the technology landscape has evolved rapidly, specifically, in favor of streaming systems, so much so that experts are already writing obituaries proclaiming that the Lambda is already dead. This blog posts examines the veracity of such claims.
Lambda for User Affinity Model computation
The ingestion pipeline is responsible for ingesting all product or store centric user activity that happens in real time. Events include product page views, free text search queries (translated to store page views), product reviews, ratings, orders, cart adds, wishlist adds, etc. The events are streamed to kafka and persisted in Hbase.
The batch computation pipeline was responsible for computing the user affinity model i.e. user’s affinity to catalogue expressed as scored and sorted collection of stores, product cluster, product groups, etc. A model might look like this [(Store=T-Shirt, brand=Nike, score=11.23), (Store=Mobiles, OS=Android, Price-Range=0–10K, score=5.17)…..etc, etc].
The heavy lifting was done by MapReduce jobs which took advantage of the data locality principle by sourcing the user data from HDFS. The jobs ran daily triggered by arrival of batch data. The computed model was written back to HDFS, transformed to HFiles and then ETLed into Hbase, to be made available for serving during query time.
The realtime pipeline was designed to be a reactive and asynchronous system responsible for staying up to date on user’s most recent interests. This meant that every single user event (page views, searches, etc) was ingested using Apache Storm and the corresponding model updated in real time. As the user continued his journey on the site, the system remained up to date on his most recent areas of interest along with his historic preferences.
Challenges with this architecture
- Two execution paths: One ends up with two separate execution paths for streaming and batch. As is evident by looking at the image above, it’s a maintenance nightmare dealing with such a plethora of clusters, components and frameworks.
- Two programming models: A side effect is that even the code bases tend to diverge since the code that executes in a batch world works on a large but finite data set while a stream processing system works on an infinite event stream
- Diverse skill sets: Evidently, more developers with diverse skill sets are needed to just manage the platform instead of focussing on core business problems.
Is Lambda dead?
The advocates of stream-only processing systems (also called as kappa architecture) proclaim that, by going for a stream only model (Spark, Flink, Beam, etc), all of the above drawbacks of the Lambda systems are taken care of. Engineers use the same code, skills and systems to solve for both the batch and the real time world. All of this is based on a hypothesis that is batch is a special case of streaming and hence a generic programming model can be devised that can cater to both needs.
While it is not difficult to visualize batch as a subset of streaming, I disagree with overt generalization of this statement and I would like to offer a rebuttal.
Full recompute vs incremental updates
Let’s say, we had built the user affinity model computation on a so-called kappa architecture wherein we recompute the user model after every n events. It would have been much simpler and this is how it would look like.
Now, one may ask why recompute instead of incremental updates? Well, herein lies the rub. Not all use cases are amenable to an incremental update only model. Some model computations are so heavy that it has to be split into a heavy re-compute in the batch world plus a light update in real time.
Further more, assume our model computation involves resolving dependencies by making many external calls, say, 1 call for every batch of 50 events from a user activity trail of 10k events. This means, potentially, we will need to make 200 external calls to compute the model of an highly active user. This is not surprising, since in absence of data locality in real time, we are forced to gather data by making external calls.
Clearly, too many I/O calls will slow down the entire pipeline if most of the time is spent in waiting than computing. Thus we have lost one end of the performance bargain i.e. latency.
Parallelism is not limitless
To make up for latency, we are tempted to bump up parallelism, to ensure good throughput at least. But beyond a point, the CPUs starts maxing out because of such high degree of concurrency. So either the async pipeline will begin to lag or we will be forced to augment the capacity to allow for more processing power. Thus the other leg of performance - throughput is also compromised.
Reprocessing the whole data set
Even if the computations are simpler, in some scenarios, the entire data set may have to be recomputed whenever there is a drastic change in dependent systems, bugs get fixed or underlying data is modified. For example, in our case, whenever the catalogue is re-arranged and products are moved across stores or when products are decommissioned, we are forced to reprocess the whole dataset, in order to correct our understanding of the user.
Cost of serializing data in a streaming pipeline
One most ignored challenges faced by most streaming pipelines is that whenever the DAGs(underlying data flow model) become complex and the data transferred between the nodes become very big in size, a lot of time and CPU gets spent in just serializing-deserializing data. It makes it worse when our computation model is stream only, because in such cases, DAGs are bound to get complex since they have to implement all checkpoints of intermediate results which otherwise would have been done locally by MapReduce.
My experience with Lambda
Our system was initially built as a de facto streaming only pipeline and weren’t scalable or performant. We found ourselves in the same situation that I have tried to paint in the preceding paragraphs. We came to the conclusion that batch is not a special case of streaming. At least in a few cases isn’t. We went back to the drawing board and re-architected the platform to have a clear separation between OLAP vs OLTP, batch vs real time, bounded data set vs unbounded event stream and historical vs real time view. Not but the least the DAGs in the streaming system were simplified to have just two stages, one for compute and one for I/O, in order to reduce the cost of serializing-deserializing data.
The hypothesis holds good if the computation process is very light and doesn’t ever require the output to be fully re-processed. For eg., a streaming system is good enough to cater to the problem statement “what is trending this week?”. Here, a single programming model for both batch and real time is sufficient.
To conclude, whether to opt for an elaborate batch+realtime system or to settle for a streaming only system is mainly determined by how much external data is required for each computation. For complex computations, I recommend an elaborate Lambda and for simpler aggregated views on quickly “decaying” data a streaming alone system would do.