Faster Batch Processing: The Untold Story

Big Data is About to Get a Lot Bigger

We live in a world where an unrelenting deluge of data has become the new normal. And data generation is galloping ahead at an unprecedented speed. By 2020, analyst firm IDC expects data volumes to reach 44 zettabytes, up from around 4.4 zettabytes in 2013. Split among the 7 billion people in the world, this equates to about 5.7 terabytes per person!

But, that’s only half the story!

The mind-boggling bits of data streaming in a constant flow from social media, clickstreams, mobile devices and sensor-embedded, connected “things” means that big data must now learn to be faster than ever before. True, “velocity” has always been one of the key dimensions of big data, along with “volume” and “variety”. However, with the explosion in new types of information and the need for instant analysis, enterprises must find more effective ways to leverage as-it-happens information, so that insights can be gleaned when they are at their most valuable.

Fast Data: the New Currency in Business

As fast data emerges to be the new currency in business, it is incumbent on corporations to embrace the concept wholeheartedly. An important first step to realizing the potential benefits of fast data is in choosing the right frameworks for data processing.

Traditionally, many systems have been focused around storing, moving and retrieving data in large batches, all at once- batch processing. However, this comes with certain limitations, including the lack of ability to process data quickly. In order to handle data in near real-time, it is necessary to have faster batch processing, with the right set of processing tools to support it.

Batch Processing Gaining Steam with Apache Spark

To address the demand for fast data, Apache, the group in charge of Hadoop standards, created Spark, which runs on top of Hadoop and provides an alternative to the traditional batch MapReduce model. Promising remarkably better performance (100x faster than Hadoop’s MapReduce!) on in-memory and (10x faster!) even when running on disk, it provides a much more efficient batch processing framework with a significantly lower latency. With its sophisticated capabilities such as in-memory data storage and near real-time processing, performance can be many times faster than other big data technologies.

Spark has the ability to deliver as-it-happens, actionable insights by processing large amounts of real, live data on the fly against trained, learning algorithms – much like hunting for the proverbial needles dropped in a haystack, except that in this case, one can find the needles the instant they are dropped!

Spark in the Realm of Real World

It may seem that Spark has just popped onto the scene. But, with its enhanced speed and sophistication, it is catching on quite quickly. Companies such as Uber, Netflix and Pinterest are already leveraging the tool for reducing time to insight and action.

It has diverse, game-changing applications in the real world ranging from demand forecasting to detecting faulty products in a manufacturing line to identifying fraudulent transactions against previously identified fraud footprints, among many others.

Companies that use a recommendation engine will find that Spark gets the job done incredibly fast. Yahoo, for instance, uses machine learning algorithms running on Spark to figure out what news stories individual Web visitors are interested in and personalize news pages based on user preferences. Similarly, Spark can provide intelligent research solutions for superior deal management by generating targeted alerts that highlight business opportunities for clients in near real-time.

It can also open up new business applications such as targeting real-time product recommendations/offers to shoppers within the vicinity of a certain aisle at a retail store. Or spot real-time trends on social media to provide predictive intelligence, sentiment analysis or customer segmentation for marketing purposes.

Conclusion

With new sources of data and the internet of things gaining traction, fast data will play an increasingly prominent role in business. According to Gartner, the internet of things will be the single biggest driver for fast data demand. And how enterprises leverage this data will literally mean the difference between business success and failure. Already being dubbed as the Next Big Thing, Spark seems to be the engine that would catapult data processing into a beautifully fast big data future.

Author
Recent Posts

Krittika Banerjee

Research Analyst at Aspire Systems

Fond of exploring contemporary technical and digital innovations, Krittika is always updated with what is new on this front. She writes about innovation and latest technology trends in various sectors.

Latest posts by Krittika Banerjee (see all)