Big Data is a term used to refer huge volumes of data that belong to one logical entity. The volumes we are talking here is something that commonly used softwares/systems are generally not capable of handling within accepted time limits. There is no specific limit/range for the size of a single big data set. However, it typically can range anywhere from a few dozens of Terabyte (1024 GB) data to few Perabyte (1024 TB) data.

Didn’t Big Data exist earlier?

Of course, however what was considered as Big Data (in terms of its size) was always a changing target over the period of time. The storage size of (first released) 5 1/4 inch floppy disk was only 100KB. Until 1987 the max storage of floppy disk was only 2.8MB. The machines that handled these floppy disks were only capable of reading so much of data within a certain elapsed time. Today our smartphones are expected to deal with GBs of data. Therefore, the quantum of the data being processed has always been growing constantly.

So, what is the problem now?

Until recent years the problem of Big Data was more applicable in areas like scentific research, military, astronomy, etc. But lately, there is a tremendous increase in the number of systems and users, which has resulted in more data. Just to give you an idea – Every day, 2.5 quintillion (1 quintillion = 1018) bytes of data are created and 90% of the data in the world today was created within the past two years. The rapid growth rate of the data poses a big challenge for solution developers as they have to continously scale up their solution to keep pace with the data growth. Added to the growth rate, speed of data in/out and variety of data makes life much more difficult for solution developers.

Some of the day to day areas where Big Data has spread include – social data (like facebook), internet search indexing (like google), centralized medical systems, stock markets, census systems, etc. Users have become so very demanding that they would like to see any sort of data with the click of a button and they really don’t want to wait for impractical amount of time.

Just to give you an example, around 4 years back Facebook had 7 million users but today they have 850+ million users. More than 75% of this growth happened in the last 18 months. Facebook gets 100 billion hits per day processing hundreds of millions of request  per second.

What is the solution?

In the past the approach to solve this problem was more towards building powerful hardware, but this approach just could not keep up with the pace of data growth. Moreover, the cost of this solution was so high that only the NASA’s and Fortune 100 companies can afford.

Later, Grid Computing  concept emerged as a potential solution for the above problem. While Grid Computing promised a lot of innovative ideas, it took a while for implementing the concept to something that can be commercially used.

Distributed Computing made this possible and viable for small scale solutions. Cloud Computing is a boon for enabling distributed computing in a much more simpler and cost effective manner. Cloud Computing, as many of us know, provides scalable infrastructure that can be dynamically leveraged on the fly. In the absence of Cloud Computing, one has to invest buying the hardwares (considering the peak load).

But Hardware alone is not enough.  This requires a mechanism through which a single big task can be logically broken down and assigned to the different nodes (involved in the computing), and then consolidate the results back to obtain the single output. One has to remember that there are lot of complexities involved in performing the above calculation, particularly the scenarios dealing with failures.

This is where Apache Hadoop brings in a lot of value. Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

There are quite a few variants of Hadoop framework to support multiple technologies. We will look at few of them in my next blog.