BD03: What defines a BigData database

“Complex applications combine different types of problems, so picking the right language for each job may be more productive than trying to fit all aspects into a single language.” (Pramod J. Sadalage, NoSQL Distilled.)

Most people I talk to, on BigData, equate it with NoSQL systems. And by NoSQL some mean non-SQL, or the smarter ones may say ‘Not Only SQL’. But, the ones that start the conversation by saying its a misnomer, and should have been called as no-rel, referring to non-relational, are the ones that grab the attention. Interestingly, NoSQL term was coined by Carlo Strozzi in 1998 to name his lightweight ‘relational’ database which did not offer SQL interface, and was not BigData capable. He himself later suggested that current ‘no-sql’ movement should have been called no-rel instead. And even that definition is incomplete (and even incorrect in some cases).

To be defined as BigData database, a system must be able to handle all or at-least a few of the 4Vs of Big Data i.e. Variety, Velocity, Volume and Velocity. In order to do that, most of the current BigData technology landscape differs from the traditional database world in one of more of the following aspects.

Schema on Read (vs Schema on write) – To handle ‘Variety and Veracity’

For a long time now, the database world has relied on ‘Schema on Write’ semantics. Define the structure of your data first (tables with defined columns) and the structure the data to fit into this schema. While offering a lot of advantages, like early validation of data, easy reporting, etc, it is quite inflexible, when it comes to storing data that changes its schema a lot (unstructured).

To counter this inflexibility, ‘Schema on Read’ mechanism came into being. It allows storage of incoming data ‘as-is’, while making sense of it only when reading. This allows externalizing of the schema, where the reading application can be aware of the actual structure of data while the data store itself doesn’t care. Most (but not all) BigData systems allow storage of such unstructured data.

ACID vs BaSE (CAP Theorem) – To handle the ‘Velocity’

In 2002, computer scientist Eric Brewer coined the CAP theorem, which states (source):

any networked shared-data system can have at most two of three desirable properties: 1) consistency (C) equivalent to having a single up-to-date copy of the data; 2) high availability (A) of that data (for updates); and 3) tolerance to network partitions (P)

The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A. Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. In even more layman terms, If we have a distributed system, a choice has to be made between Consistency and Availability.

While the traditional database systems are always consistent (ACID compliance), the modern Big Data systems are more relaxed and favour availability over consistency for small duration, making them BaSE(Basically available, Soft state, Eventual consistency). To a purist database administrator, giving up consistency would be her worst nightmare (remember dirty/phantom reads), but organizations have realized many new age systems can live with some delay in consistency (Your facebook status update might take a few seconds before it is available on your friend’s timeline, for e.g.).

By giving up on strong consistency requirements, systems achieve high availability, fault tolerance and high write performance (think how?).

Data Partitioning – To handle the Volume

When the volume of data breaches the terrabyte mark (Peta/Exa/Zettabyte…), storing it on one disk drive becomes infeasible (or atleast difficult). Also, for fault tolerance, we would want multiple copies of data stored over separate racks, and even separate data centers. What it means is that each byte of data written to the system, is copied over to multiple nodes (replication)… and that not all nodes contain all copies of data (sharding/partitioning). For e.g. a cluster might have 5 nodes, with a replication factor of three, meaning each byte is copied over to three nodes.

While partitioning may be applied to any database using external logic, most BigData solutions come with sharding capabilities in-built. This allows for massive scalability in terms of handling volumes of data and also in-built protection against data loss when some nodes fail. This allows use of cheap commodity hardware giving a much lower storage price per byte of data stored.

Conclusion

We started the article by saying NoSQL was a bad name, and maybe NoREL would have been a better one, as most big data solutions get away relational schema. But a new breed of databases, under the moniker ‘NewSQL‘ (what a bad name again!) are relational, follow ACID compliance, and yet have Big Data capabilities (So, have they managed to work around CAP theorem!!)

Posted on: 14th January 2017, by : AdityaBhushan