The world’s leading publication for data science, AI, and ML professionals.

Seminal Papers in Data Science: A Relational Model for Large Shared Data Banks

50 years later, a review of some main concepts from E.F. Codd's 1970 paper that laid the groundwork for relational databases and SQL

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

Even with the rising popularity of NoSQL, most companies are still using some form of SQL-based relational database management system. While SQL (then called SEQUEL) was first introduced by IBM’s Donald D. Chamberlain and Raymond F. Boyce in 1974, their work built on the ideas of Edgar F. Codd. Codd was another IBM computer scientist who proposed a relational model for database management in 1970. In this post, I discuss some of the main takeaways from Codd’s influential paper, and how his ideas relate to our modern use of SQL.

Relations

Codd uses the term relation to describe what is essentially the cornerstone of his model. The relation is formally described as follows:

"Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary."

-Codd (1970)

This definition may look completely foreign, but if you are familiar with SQL, Codd is actually getting at something quite familiar. Codd proposes that a relation can be represented as an array, based on the following conditions:

"An array which represents an n-ary relation R has the following properties:

(1) Each row represents an n-tuple of R.

(2) The ordering of rows is immaterial.

(3) All rows are distinct.

(4) The ordering of columns is significant – it corresponds to the ordering S1, S2, …, Sn of the domains on which R is defined (see, however, remarks below on domain-ordered and domain-unordered relations).

(5) The significance of each column is partially conveyed by labeling it with the name of the corresponding domain."

-Codd (1970)

Codd also presents an example of an array representation of a relation __ in Figure 1 below.

Codd (1970)
Codd (1970)

At first glance, Codd’s array representation of a relation has many similarities with a modern SQL table. For example, the above relation of degree 4 (or, relation with four domains) appears to be analogous to what we would call a table with four columns in SQL. This table would be named supply, and Codd’s domains would correspond to SQL columns. A SQL table also resembles Codd’s relation in the sense that the ordering of rows isn’t important, the rows are distinct, and each column is labeled.

However, there is a notable difference between Codd’s relation and a SQL table: Codd stipulates that the ordering of columns is significant, while this is certainly not the case in SQL. This goes hand-in-hand with Codd’s assertion that each row of the __ relation is an n-tuple – in SQL, since column order of a table is generally not relevant, each row of a SQL table is not required to be a tuple.

So, why would Codd propose the need for column ordering in relations? He gives an example using Figure 2 from the paper (see below).

Codd (1970)
Codd (1970)

In the component relation in Figure 2, Codd presents a situation where two domains (columns) have identical names – part. To keep track of these two part columns, it would make sense to assign a fixed order to the columns. While Codd envisioned this as a legitimate issue during the writing of the 1970 paper, this is no longer the case, as a modern SQL databases will not allow two columns of the same table to have identical names.

Normalized Sets

Codd refers to normalization as the elimination of nonsimple domains from relations. But what does Codd mean by nonsimple domains?

"Nonatomic values can be discussed within the relational framework. Thus, some domains may have have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on."

-Codd (1970)

In Figure 3(a) below, Codd further elaborates by drawing a tree diagram example of an unnormalized set of relations. The employee relation is made up of simple domains man#, name, birthday, and nonsimple domains jobhistory and children. These nonsimple domains can themselves be thought of as relations, containing both simple and nonsimple domains of their own.

Codd observes that this unnormalized set of relations can’t be expressed in a single two-dimensional array. However, after normalization, the set of relations can be expressed as four two dimensional arrays (see Figure 3(b)).

Codd (1970)
Codd (1970)

But, how exactly should this normalization be performed? Codd elaborates:

"Starting with the relation at the top of the tree, take its primary key and expand each of the immediately subordinate relations by inserting this primary key domain or domain combination. The primary key of each expanded relation consists of the primary key before expansion augmented by the primary key copied down from the parent relation. Now, strike out from the parent relation all nonsimple domains, remove the top node of the tree, and repeat the same sequence of operations on each remaining subtree."

-Codd (1970)

So, the tree-based relationship in Figure 3(a) is converted to the normalized form in Figure 3(b), getting rid of the nonsimple domains, and a primary key is defined to uniquely identify the rows in each relation. The primary key can either be a single domain (man# for the employee relation) or multiple domains (man#, jobdate for the jobhistory relation). Additionally, each of the relations can be cross-referenced with each other using the man# domain. This is nearly analogous to one of the foundational concepts in modern SQL schemas – that the primary key of one table can serve as the foreign key of another table, and therefore, the relationship between the tables can be preserved while the tables remain separate 2-d arrays.

Joins

Codd outlines a number of operations that could be performed on relations (tables), one of which is a join.

"Suppose we are given two binary relations, which have some domain in common. Under what circumstances can we combine these relations to form a ternary relation which preserves all of the information in the given relations?"

-Codd (1970)

Codd provides an example of how two relations R and S could be joined using the part domain from each:

Codd (1970)
Codd (1970)

Since the part domains of each relation have nonunique values, the "natural join" of R and S contains more rows (5) than either of the two tables (3), owing to the multiple combinations of supplier and project that can result from identical part values.

If we imagine that r and s are tables in a SQL Database, the following SELECT statements would produce the same result as Codd’s "natural join":

SELECT supplier, r.part, project FROM r JOIN s ON s.part = r.part;

*SELECT FROM r NATURAL JOIN s;**

In Sql, a NATURAL JOIN is one that by default performs the same task as an INNER JOIN, but columns that have the same name in each table will only appear once.

Closing

There are many more concepts from Codd’s 1970 paper that I haven’t discussed, but that’s where I’ll leave it for this post. If you’re interested in learning more about how relational databases and the groundwork for SQL came to be, check out the paper in the references below.

References

2019 Database Trends – SQL vs. NoSQL, Top Databases, Single vs. Multiple Database Use


Related Articles