How big is Big Data?

Big Data adoption was into the 7th or 8th year by the time I got to know Big Data. One of my first questions was, “How big, in terms of volume must data be to qualify as Big Data?

It is an obvious question as the notation itself has ‘big’ in it. Over time, I learnt that any answer will need a lot of qualifiers. In fact, the qualifiers themselves will need more qualifiers.

Characteristics of Big Data

Though volume is one parameter that characterizes Big Data. There are a couple of more widely accepted parameters. They are Velocity and Variety and together with Volume they are the three V’s of Big Data. Mostly, combinations of these parameters (in varying degrees) along with the stated objective define Data as Big Data. There are three or more V’s that have been added to the list but these three are the most popular.

Volume

On the question of how big data should be for it to be classified as big, most experts answer will be, “it depends”. Some tend to answer in terms of byte size (5TB, 1PB etc.,) One answer I liked was, if the volume of data is so high that it cannot be stored in a single server and/or processed in a timely manner by the single server you may have big data. In addition, if the volume is expected to grow over time, you may need a big data solution. Those will be the minimum Volume criteria for the size of the data. Again, those alone do not characterize Big Data.

Velocity

The Velocity of data can be defined as one or more of the following:

The rate at which data is arriving – measured in Volume per Time Unit
The speed at which data needs to be processed – measured in Time Unit
The frequency at which the data needs to be processed – Runs per Time Unit

Let me illustrate this with an example. Let us say McDonalds is giving out Star Wars character toys with every happy meal purchase. They may have somehow determined the nationwide popularity of each of the figurines and would have ordered a certain quantity of each. Now, if they were collecting data on happy meal orders on an hourly or daily basis and/or also by geography, they may realize that particular toy(s) is more popular than others either nationwide or in a particular region. It would make sense for them to place orders and stock restaurants accordingly. McDonalds has about 38000 stores and serves 68 million customers every day. To implement a solution, McDonalds would have to process 68 million transactions (at the least) with the associated store, location and inventory data to decide which toys to order. Conservatively, it could be 100 petabytes of data per day to ingest, store, process, place orders and automate transport logistics. The data has a high rate (every store transaction arrives in near real time), needs to be processed fast and there will be high frequency of processing (every day).

Variety

Variety may be the spice of life, but it is also the reason d’etre for Big Data’s evolution. Data in the computing world once meant data managed by RDBMS only. Such data is easy to comprehend, store and query. Large volumes of structured data can be stored in Data Warehouses and processed with solutions built on Symmetrical Multiprocessing (SMP) or a Massively Parallel Processing (MPP) architectures that provide fast performance. They are typically expensive and solutions are not portable across vendors.

In the McDonalds example, if all sources of data were RDBMS, the solution can be implemented in a Data Warehouse with maybe limitations in rate of processing and associated cost.

But what about semi-structured and unstructured data? E-mails, social media feeds, data center logs, and events from IoT devices were once treated as background noise which had no inherent value or insights to offer. This data has some or no structure and as such cannot be processed by traditional methods. Companies (Google, Facebook…) that used this kind of data as their only raw material developed technologies to mine this data for value. Industries who have been sleeping on huge volumes of this kind data woke up. So, data with a lot of Variety, enriched with Structured data or by itself, came to define big data.

Going back to the McDonald’s case, all data sources in McDonald’s OLTP system are most likely RDBMS. Typically, the owners of these systems in any organization are not going to let everyone have direct access to their data. They will offer to publish daily dumps of this RDBMS data in an intermediate format like CSV, JSON or XML. These formats can be processed by big data technologies thereby enabling ML based predictions. Also, a Warehouse solution cannot process twitter feed #babyYodaCute along with transactions data to provide better insights whereas a Big Data solution can.

In conclusion, when your data has Velocity (rate of data, speed of processing and frequency of processing), Variety (unstructured, semi-structured and structured) and large Volume you probably have data that qualifies as Big Data.

How big is Big Data?