Understanding the Relational Model of Database Management Systems (RDBMS)

and why it's so popular across the industry

SudoPurge
Towards Data Science

--

Behind all the hype and glory of Machine Learning in popular media, one extremely crucial component that kinda gets overlooked is the role of Database Management Systems (DBMS). Of course, not everyone really NEEDS to know the ins and outs of how a database is managed. But if you are using Machine Learning in any capacity, there will, to a very high degree of probability, be a time in your professional career where understanding how it accomplishes some of its most common tasks will save you countless hours.

Around the 60s, developers were still using hardcoded software for managing data. Whenever this data would not fit into very rigid predefined architectures, the software would essentially need to be refactored every time to fit the kinds of data stored and the kinds of methods used to retrieve and manipulate the data. If a developer used hash tables to define the data structure, as the database grew and the needs evolved, the system would need to be re-written. And this was a recurring event.

A mathematician named Edgar F. Codd working at IBM saw this need for a general-purpose architecture, which could really bypass the hard-coded rigidness to fit the evolving needs. Enter Relational Database Management Systems (RDBMS). In 1970, he published a paper describing this architecture, in which the system itself could figure out the best possible representation and procedure to store, retrieve and manipulate data.

Codd, E. F. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM 26.1 (1983): 64–69. Communications of the ACM. Web.

Codd proposed that we use simple, primitive data structures like scalars, integers, floats, and strings to store data. The data would form relations in the form of tuples. Perhaps in simpler and more relatable terms; imagine a data table (also referred to as a relation), where each cell is one of the simple scalar data types, connected in each row as tuples forming relations of their own. Each entity (or row) has a unique ID, called the primary key. Tables can also talk to each other via these keys which are now referred to as foreign keys to form custom relations using the seven fundamental operations. The fundamental operations are listed below along with their corresponding standard SQL syntax:

i) Select: Selects and returns a new relation that is a subset of the tuples that meet certain conditional predicates: SELECT * FROM Shop_A WHERE author_id = 102;

ii) Projection: Returns a new relation with tuples that contain only the specified attributes, can rearrange attribute ordering, can manipulate the values: SELECT author_id-100, book_id FROM Shop_A WHERE author_id > 102;

iii) Union: Returns all the tuples in the two input relations as one new relation but only works if the input data types and attributes are the same in both: (SELECT * FROM Shop_A) UNION ALL (SELECT * FROM Shop_B);

iv) Intersection: Returns a new relation of tuples that are in both the input relations but only works if the input data types and attributes are the same in both: (SELECT * FROM Shop_A) INTERSECT (SELECT * FROM Shop_B);

v) Difference: Returns a new relation with tuples that are in the first input but not in the second input but only works if the input data types and attributes are the same in both: (SELECT * FROM Shop_A) EXCEPT (SELECT * FROM Shop_B);

vi) Product: Returns a new relation that's the dot product of the two inputs, aka. all possible combinations, generally used for testing: SELECT * FROM Shop_A CROSS JOIN Shop_B; or, SELECT * FROM Shop_A, Shop_B;

vii) Join: Returns a new relation of tuples that are in both input relations, but unlike Union, the data and attribute types do not need to be the same, as long as there is at least one match within the tuples: SELECT * FROM Shop_A NATURAL JOIN Shop_B;

Further advanced operations have since been added like Division, Aggregation, Sorting, and many more.

These lower-level operations are used to evaluate relations with a technique called relational-algebra. The RDBMS uses these primitive relational algebraic expressions in order to determine the most efficient how and this is relational calculus. A very complex component of the system called the query optimizer gets this job done.

Imagine you want to create a new relation between table A and B, each containing 5 billion rows, where you want all the tuples with a primary key less than 500 that are present in both A and B with some attributes that are the same. One procedure to do it would be to NATURAL JOIN all the 10 billion rows first and then filter SELECT keys less than 500. Another procedure could be to filter SELECT less than 500 keys first from A and B and then NATURAL JOIN the 1000 (at most) entries. Depending on the actual size of the database, the performance factor of the second procedure would be much more efficient than the first one.

This is where the Relational Model excels. The user just needs to state what they want to do, and the RDBMS will figure out how to do it in the most efficient way. That’s why querying languages like SQL are called “non-procedural”. The explicit steps (or procedures) are built into the system. Unlike programming languages like Python or C/C++ where the user needs to explicitly state that iterate over this data this many times, look for this characteristic, if n is larger than a certain number choose this other method, yada yada, SQL only needs to be told the desired end result.

The RDBMS is based primarily on the three pillars:

i) Structure (schema): relations and their contents, like the attributes, types, relations within and between data.

ii) Integrity: insurance that the database’s contents satisfy the constraints (aka some predefined native structure like memory allocation) to validate the instance of the database given these constraints, with no redundancy.

iii) Manipulation: how the data can be accessed or modified by running queries.

Because the system can evolve with evolving data and needs, it's extremely versatile and fits the business needs of most databases. RDBMS is sort of the AI of DBMS by virtue of its intelligent design.

References:

  1. Pavlo A, 2019, Intro to Database Systems, lecture notes, 15-445/645, Carnegie Mellon University
  2. Codd, E. F. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM 26.1 (1983): 64–69. Communications of the ACM. Web.

P.S. For more short and to the point articles on Data Science, Programming and how a biologist navigates his way through the Data revolution, consider following my blog.

With thousands of videos being uploaded every minute, it’s important to have them filtered out so that you consume only the good quality data. Hand-picked by myself, I will email you educational videos of the topics you are interested to learn. Sign-up here.

Thank you for reading!

--

--