BD02: The ‘Big’ in Big Data – 4 Vs

“The world is one big data problem.” (Andrew McAfee)

This is part-2 in our series of Big Data articles. Read Part1: BIG DATA – SOMETHING NEW OR ‘OLD WINE’ ?.

How does one define ‘big’? While 75,000 IPS (SAGE system, see part1) was huge in 1955, a 100$ phone today has hundreds of thousands times more processing capacity … The first ever hard drive I had was 1.2 GB, and all my friends were like WOW… Now even my 1 year old’s musical toy has 8 gigs…

So, there is no point in putting a definitive threshold above which data may be characterized as ‘Big’. Instead, Big Data is defined by a set of 4 qualities, or the 4 Vs of Big Data – Volume, Velocity, Variety and Veracity. Let us take a closer look at each of these, to understand when do we start classifying data as Big Data.

Volume – Scale of Data

Volume is about how much data are we dealing with. Gigabytes (10^9), any database will do, terabytes (10^12) or Petabytes (10^15), get me a data scientist… Exabytes (10^18), well Google has done it, so can we…  Zettabytes (10^21 ~ total number of observable stars) or Yottabytes (10^24), I have no clue what to do, ‘today’.

In the sixties, data used to be produced by data entry employees of a company, sitting in front of big mainframe systems. Now it is produced by machines, networks, sensors, human to human interaction and human to machine interaction. By some estimates, 3 Quintillion bytes (Trillion Gigabytes) of data is produced each day (amount processed is much-much less), and  40 Zettabytes of data will be created by 2020, a 200 times increase from 2010.

Yet, for 99.9% of the organizations volume alone is not as much of a problem as when it comes with the other V’s of Big Data.

Variety – Different forms of data

Variety refers to the many sources and types of data existing in today’s world. Till like just ten years ago, a simple entity relationship diagram could explain all the business data of an organization Now, data comes in the form of photos, videos, monitoring devices, PDFs, audio, text, binary stream etc. 5 billion pieces of content is shared on facebook every-day, and it follows no structure… 500 million tweets are sent each day, again unstructured….

This variety of unstructured data creates problems for storage, mining and analyzing data. Few terabytes of data in a relational schema, and any decent relational database will be able to handle it… A few gigabytes of unstructured data, and even the Oracle RAC would find it difficult to manage.

Velocity – Data Streams

Velocity is the pace at which data flows through the system. Huge volume of unstructured data is still easy to handle (comparatively), but add high velocity to it, and we have to move away from the world of traditional (relational) data solutions.

To give you some examples, the New York stock exchange generates more than 2 terabytes of trade information during each trading session. Twitter users generate nearly 100,000 tweets every minute … and so on.

Unless this high speed data is processed in real time, there is little value to be gained out of it. So, in addition to storage and processing of large volume of unstructured data, it must be done real fast, for a Big Data solution to be effective.

Veracity – Uncertainty of data

When we are dealing with huge amounts of unstructured data, coming in at high velocity, there are bound to be errors/inconsistencies in the data. Gone are the days, when a violation of foreign key constraint could discard the erroneous data. With all structure gone out, there is no way software can distinguish between valid and invalid data, unless business specific rules are applied. As an example, in the world’s biggest biometric database, the UIADI (Aadhar) system in India, there are thousands of fake identities, as per some conservative estimates.

That is the challenge most BigData systems have to deal with… to keep the data clean, and keep ‘dirty data’ from accumulating into the system, as much as possible. Also, to have safeguards, so that dirty data may not spread itself, or spoil the otherwise correct data. Imagine the impact on a system, computing average income of all citizens of a country, if just a few hundred people with fake incomes worth billions are added to the records.

Take any one of these four ‘V’ characteristics – Volume/Variety/Velocity/Veracity, and any data store worth its salt would be able to handle it. Throw 3 or 4 of these and most traditional solutions start to sweat. That is where the modern Big Data solutions come to the rescue.

To add to these 4 V’s, there are two more, that make the Big Data definition complete.

Visualization – If you can’t see it, it doesn’t matter

Once we are able to store and process huge volume of high speed data coming from a variety of sources, we need a way to present the results in a human readable. That is what is meant by Visualization of Big Data. A whole plethora of tools now exist just to solve the problem of data visualization. After all, what’s the point of storing and processing all the data, if you cannot see the results.

Value – what’s in it for me?

So, you are able to store, process and visualize the data. Now what?

Unless you are doing this as part of an academic research, chances are, you were asked to do a ‘cost-benefit analysis’ of this whole exercise, before you even started. That is what this sixth ‘V’ or Value is all about.

And there’s a huge value to be gained out of Big Data, if done right. It has provided organizations with new ways of finding previously left out customers, open up new market segments, led to reduction in costs by optimizing inventory management and better understand customer spending patterns etc.

 

When we say Big Data, we are talking about an ecosystem, where huge amount of unstructured data is being produced at a great speed, which is being stored in some form of persistent store, automatically processed in real time to deduce business insight, and results are available in some form of human readable format. In coming posts, we will look at all these aspects in greater detail.

Posted on: 19th November 2016, by :

Leave a Reply

Your email address will not be published. Required fields are marked *