Without the right tools and environment, Science cannot thrive as it should. The Data Science team at Xcede is constantly tasked with keeping up with cutting edge technologies within Software Engineering to make Data Science work for organisations in a genuine and quantifiable manner.
Over the past few years the increase in demand for talented professionals capable of dealing with real-time data analysis (and also therefore quicker inflow of data available for that analysis) has been immeasurable. From Advertising to Energy, Finance and Media the variety of data is certainly there with the end goal being the same, quicker insight often means a quicker impact on the base line figures.
It’s hard to know where to start at times when explaining the relationship between analysis of this nature and the software that enables it, as it is a vast and all-encompassing area of the field. Without delving too deeply into specific technologies, let’s analyse the aforementioned “same goal” that all companies have, and the paradox that it throws up in that very same instance. I say this, because there are two problems within real “Data Science” and “Big Data”, not one!
Firstly, we’re gaining information from a continuous inflow of data. Secondly, we’re analysing an immense volume of data recurrently. Each are mammoth problems in their own right. For a few years now, there have been two favoured and popular solutions to each of these problems. The ubiquitous Hadoop has become unavoidable on the market and has become widely adopted for “Big Data” tasks due to its ability to perform batch processing so well.
As Rajat Jain of Qubole has summed up extremely well in the past:
“The problem with these approaches is that business requirements are both historic and real-time—simultaneously. Many organizations find the two challenges of extracting real-time data and analysing immense volumes of data converge with time. Real-time data accumulates and the inevitable demand for an aggregated historic view requires batch processing. And the batch processing solution is slow, which eventually leads to business users or customers asking to get immediate or near real-time insight, such as the most recent data updates to react faster to market changes.”
As a result, one solution to this problem is to create an architecture to address both problems in tandem. This solution is what Nathan Marz (the creator of open source projects Storm and Cascalog) subsequently dubbed, “Lambda Architecture”.
Lambda Architecture Broken Down into its Separate Parts
First, a batch layer collects the views on your collected data and repeats the process when it is done indefinitely in a continuous turnover fashion (hence the output is always outdated by the time it is available since new data has been received while this was going on). Alongside this process, a parallel speed processing layer closes this gap by constantly processing the most recent data in near real-time.
As Jain explained as early as 2013, “Any query against the data is answered by querying both the speed and the batch layers’ serving stores, and the result is merged to give a near real-time view on the complete data set.”
Lambda Architecture is only a concept for this action (check out the Illustration below for a demonstration of this). The technologies with which the different parts of a Lambda Architecture are implemented are all separate from this concept, and can be chopped and changed depending on departments.
Hadoop and Storm have proved to be a popular combination with the system, but in the interest of unbiased writing, other good software solutions are available! Have a look at some of the innovative solutions being created here at Lamba-Architecture.net. For the more visual among you, check out an example of the type of set-up possible in the image below (via Trivadis).
For the serving layer, systems like HBase, Cassandra, Vertica etc are available (for example). VoltDB has been used as an alternative to Storm and indeed instead of Kafka, Beanstalk, RabbitMQ, ZeroMQ, Apache ActiveMQ, OpenMP are being toyed with (RabbitMQ with the most favour so far as I understand it). It all depends on the team in question, as per usual.
At Xcede we’ve helped to build world class Data Science and Engineering units for start-ups and multinational institutions alike based on the Lambda Architecture concept and the technologies involved. If you’re interested in creating a similar environment for your company, please feel free to get in touch with our Data Recruitment Consultants.