What is Big Data and is it really so Big?

"Big Data" is a hot term this days. We all hear  "Big Data here",  "Big Data there", "Big Data everywhere".  All sounds fancy and yet mysterious a bit. But what is Big Data? Was there Big Data in the past or it was born recently?  What does "Big Data around us"  means?  Can we touch it? Obviously there is no a single definition for the Big Data and everyone may define it differently while also confusing the true meaning of this term.
The funny part is that few decades ago a sentence " I have a challenge how to process our Big Data" - would probably imply that employee is not qualified enough. But saying the same this days sounds good and perhaps implies that employee is well qualified and working on the real challenges.  So what have changed in the last few decades?

Imagine a first grade kid who just got his first homework assignment and he is about to share the shocking news with his parents "Mom, Dad! I got 10 pages in math to complete by tomorrow! How can i do it?! Oh no!"

If you think about, this is the situation where the kid believes he can't complete an entire homework in a single day, no matter what he does. This is is exactly the Big Data challenge for the kid.

This example is not so far from the reality of the Big Data. "Big Data" is simply a large amount of data that we can't load entirely into memory or process entirely by using simple techniques we aware.  That basically means that  Big Data is a relative term: while it's Big for some, it's normal for others.  It all depends how you want to process the data, what resources you have to process it and what techniques you plan to use.

There is absolutely nothing new in storing relatively large amounts of data. There was always data that we stored in databases, tapes, storage disks, etc.  But you never heard a term "Big Data " in the past. So how is that our data became a Big Data?

It was a common approach to buy expensive servers, with plenty of disks and RAM. Then many used common tools, like RDBMS to store all the data and  SQL was the popular API to process the data. But then something had changed. Data continue to grow so we stored more data. On the other hand expensive RDBMS solutions became less optimal to handle those new amounts of data. Data also became a strategic value for those who store it therefore storage solutions had to provide availability, resiliency and fault tolerance. This is exactly when  MapReduce: Simplified Data Processing on Large Clusters introduced new model, describing how to better store and analyze the data, while trying to reduce costs of both and improve efficiency.  This is also about time when the term "Big Data" was born.

Running analytics on large amounts of data with distributed workloads,  while it's not possible to load entire data into memory - this is exactly what is Big Data about.  The storage part never comes alone, but also accompany with how to process it. This days, Map Reduce still remains the most effective way to process large amounts of data. Many Big Data engines like Apache Spark, Dask, Modin, Hadoop Map Reduce, NoSQL databases all capable to process large amounts of data stored in object storage or distributed file systems.

This is what Big Data about!

Comments

Popular posts from this blog

Brew your beer at home - cooling the wort

Modin dataframes and IBM Cloud Object Storage