
We all have that love and hate relationship with the database, more specifically the data management system (DBMS). It’s an integral part that defines how to access one of the most valuable assets of the 21st century, and it drives many of our passion and determines our workflow. But it also is arguably one of the most tedious parts of the data science process. In this post, I will discuss some of the fundamental concepts behind the database and tie a few popular terms we hear around the data world.
Overview
The goal of this post is to identify some of the terms we often hear and briefly define their relationships. This is intended to help us orient ourselves through a myriad of semantics as we dive deeper into each variation.
First of all, a database is a broad term for the storage of organized data. We access and manipulate the database following the principles defined by the database management system (DBMS). MySQL, Microsoft SQL Server (MS SQL Server), Oracle, PostgreSQL (Postgres) are all different DBMS.
All of these example DBMS above have a relational data model and uses SQL (Structured Query Language) as their query language or database language. (Postgres is an object-relational database.) You may have seen the abbreviation of the relational database management system: RDBMS.
Because SQL is a language for the relational model, we often see SQL and relational being used interchangeably, such as SQL database, which indicates that the database uses the relational model and SQL as its language.
The database models are the underlying structure of the DBMS. Of course, there is more than just a relational model. There are a lot of different types of data models: object-oriented, object-relational, hierarchical, network, entity-relationship, flat file, inverted file, multidimensional, associative, to name a few. We won’t go into details of every single one of them. As the relational data model is a majority, many other non-relational data models are often being grouped into a term, NoSQL. But what constitutes relation?
Relational vs. Non-Relational (SQL vs. NoSQL)
A database is an organized set of data that has some relations. So by definition, the term ‘non-relational’ to describe a database could seem a bit misleading. It’ll be more precise to say ‘non-tabular-relational’. Relational databases use tabular structures (rows and columns) to group data with the same attributes and to connect the relations between these tables. Another way to organize data would be by documents (or observations), which is a model many non-relational databases adapt.

If you know SQL, you can immediately see why and how SQL supports relational databases. You must have spent enough time familiarizing yourself with the idea of JOIN. Figuring out how to establish such relations across different tables within a schema is a key aspect of accessing a relational database.
On the other hand, NoSQL simply means not-relational or not only SQL, and naturally, there’s a wide variety of NoSQL databases that fit this category. Some of the popular names in this NoSQL category include MongoDB, Couchbase, InfinityDB, OrientDB, Apache HBase, etc. NoSQL databases are gaining popularity mostly because our databases are getting big.

The Era of Big Data
The relational database makes intuitive sense. It’s an effective system that utilizes efficient indexing to streamline the searching and manipulation of data. So why are the No-SQL databases becoming so popular?
The relational system works and is very efficient. But the fact that each table holds a distinctive part of the whole data and is intricately tied to one another becomes a constraint when we don’t want to (or cannot) process the entire dataset at once. What if our server is maxed out and cannot process more data? We can get a bigger server and migrate the whole database. But data grows again, and again, and again…
As we enter the era of Big Data, we needed to adopt a distributed system that will allow us to divide and delegate processing across many different servers (also to simply attach an additional one as data grows). Consequently, we needed a new database model that offers more flexibility than the relational model and can scale for parallel processing. That’s where NoSQL came in without the usual relations between tables.
What about Hadoop?
Another big word in the Big Data Era is Hadoop. Apache Hadoop offers a framework that allows the aforementioned distributed computing for big data. It’s equipped with the distributed file system called the Hadoop Distributed File System (HDFS) and ** the MapReduce model. But Hadoop is not a database or DBMS as it does not directly define the database system. I’ll cover the details of Hadoop in a later post. But Apache does come with its DBMS, Apache HBase,** designed to support the overall Hadoop framework.
Apache HBase is a column-based NoSQL database management system. Wait, I just said NoSQL means no tabular relations, so what does ‘column-based NoSQL’ even mean? Let’s first discuss what it means to be ‘column-based’ or ‘columnar’.

Column-based (Columnar) vs. Row-based
I think the confusion is caused by our cognitive heuristics that coined these terms. It’s so much easier to think of data space like a table or multiple tables. We grew up seeing tables everywhere and they are implemented in all the analytic tools. But it’s time to expand our dimensions.
In this case, it’s easier to think of the columns and rows as attributes and values instead. When one says a Database is column- or row-based, it’s referring to how data is physically stored. It’s indicating whether each data is stored as a series of values ( Adi, 41, New York, Designer )
or per individual attributes { Name: ( Adi, Erin, John, Kim )}
.

In this diagram, we can quickly see some properties of each case. When the data is stored per each attribute, the database is sparse as it stores nothing when the value is null. Also, a columnar database makes it easier to compute values within the same attribute across a large set of data, since we don’t have to load all other attributes that come with each data points in a row-based database.
Using NoSQL, we take this even further. Instead of assigning direct relations between "tables", we are mapping the structure as a form of the nested pairs of an attribute (key) to value. It’s like a fractal, you flexibly nest data of the same structure and adds sub-attributes as added dimensions. I’m keeping it a bit open here because each variation of NoSQL will treat this part slightly differently.
One important thing to keep in mind is that there is no perfect database type that will solve all problems. NoSQL should be optimized for distributed processing and effective in adapting to new data, but compared to SQL it has limitations in standards and reliability. Plus, NoSQL is an umbrella term that covers several different models that perform differently based on the data. It’s crucial to weigh the business goals with the properties of a specific database before committing to full migration.
Happy Learning!
