Data comes in different shapes and sizes

Let’s talk about tables, trees and graphs

Maxim Zaks
Towards Data Science
5 min readNov 11, 2017

--

Tables — tabular representation of data, is probably the oldest way of storing data. It is also the simplest. We define rows and columns, where a column represents a property and a row represents an entry which is a combination of properties.

CSV is a simple representation of tabular data. Below you can see a CSV representing my closest family:

Maxim,June 12,Berlin
Efim,November 24,Essen
Margarita,August 20,Bochum
Issai,May 9,Bochum

Every row represents a family member and family members are represented by following properties:

  • Name
  • Birthday
  • City

Those properties are good representation of a person, but they say nothing about relationships between those people. What if we add another three properties to establish the relationships between them:

  • Father
  • Mother
  • Siblings

In this case the CSV might look something like this:

Maxim,June 12,Berlin,3,2,1
Efim,November 24,Essen,3,2,0
Margarita,August 20,Bochum,,,
Issai,May 9,Bochum,,,

The relationships are described as the index of the row (index starts with 0).

So Maxims father is Issai, Maxims mother in Margarita and Maxims sibling is Efim. On the second row, we see that Maxim is Efims siblings (no surprises there) and that they have same parents. Margarita and Issai have no reference to father, mother and siblings, not because they don’t have any, but because those people are not listed in this data set.

What would happen if Maxim would have multiple siblings?

In case of CSV and generally tabular data representation this is quite unfortunate. However we have two options:

  1. Create a custom syntax for array of values. E.g. use + character to separate multiple row indices 1+10+12.
  2. Have a second table which represents adjacency matrix of the relationships. With CSV format we can only have one table per file, so we would need to create another file.

We learned that tabular data representation is good for one to one and many to one relationships, but can become tricky, if we need to have a one to many relationship.

What about Trees?

A node in a tree (if it is not a leaf) has a one to many relationship with it’s children. Let’s try to represent the same data with XML:

<person name="Maxim" birthday="June 12" city="Berlin">
<father>
<person name="Issai" birthday="May 9" city="Bochum"/>
</father>
<mother>
<person name="Margarita" birthday="August 20" city="Bochum"/>
</mother>
<siblings>
<person name="Efim" birthday="November 24" city="Essen">
<!-- 😔 we have to repeat father and mother now -->
<father>
<person name="Issai" birthday="May 9" city="Bochum"/>
</father>
<mother>
<person name="Margarita" birthday="August 20" city="Bochum"/>
</mother>
<siblings>
<!-- 😨 OMG we have a cycle, abort!!! -->
</siblings>
</person>
</siblings>
</person>

A tree has to have one root element. In our first attempt, we changed the semantics of the data set and decided to represent it as a “family tree” of Maxim. This however also highlighted that our data is too complicated for such representation. We had to replicate Issai and Margarita, when we added Efim as sibling. And than we realised that Efim and Maxim build a referential cycle. Which makes purely hierarchical representation impossible.

OK, let’s try to use XML with explicit references:

<people>
<person id="0" name="Maxim" birthday="June 12" city="Berlin">
<father ref="3"/>
<mother ref="2"/>
<sibling ref="1">
<!-- could have another sibling tag here
<sibling ref="123">
-->
</person>
<person id="1" name="Efim" birthday="November 24" city="Essen">
<father ref="3"/>
<mother ref="2"/>
<sibling ref="0">
</person>
<person id="2" name="Margarita" birthday="August 20" city="Bochum"/>
<person id="3" name="Issai" birthday="May 9" city="Bochum"/>
</people>

Now we can have multiple sibling nodes in a person node. However we have one small disadvantage compared to CSV. The reference in CSV was the index of a row. In XML a node might change it’s index dependent on parser and XML producer implementations, so it is necessary to have an explicit id attribute on every person node and a ref attribute on father mother sibling nodes, which reflect the id of the person. It is not a big inconvenience, when we write the data, but it is a bigger inconvenience, when we read the data. In order to follow the reference, user will have to build up some kind of a lookup table, where a person can be found by id.

Can JSON help us with this problem?

JSON is also a tree structure, but if XML is very homogeneous — we can describe data only through elements and there content. JSON is heterogeneous — we have arrays, objects, strings, numbers, boolean literals and null at our disposal. In JSON we can represent the same data set as following:

[
{
"name":"Maxim",
"birthday":"June 12",
"city":"Berlin",
"father":3,
"mother":2,
"siblings":[1]
},{
"name":"Efim",
"birthday":"November 24",
"city":"Essen",
"father":3,
"mother":2,
"siblings":[0]
},{
"name":"Margarita",
"birthday":"August 20",
"city":"Bochum"
},{
"name":"Issai",
"birthday":"May 9",
"city":"Bochum"
}
]

We say that root element is an array. It has three children which are objects, where siblings property is an array of people indices. Elements inside of an array have a stable index, so we do not need explicit id properties. We also avoid building up an explicit lookup table, as arrays are accessible by index.

There is something we did not talk about yet.

CSV, XML and JSON are all text based formats. In order to work with the data set and follow the references, we need to parse and transform it into some kind of in-memory model. If we have a large dataset it might become very inconvenient.

Textual vs. Binary

When we switch from textual to binary representations, we can chose/build a format which allows random value access. In this case, a reference can be represent by offset of entry we are referring to.

An offset can be absolute, or relative. An absolute offset is simpler for read, write and validation operations. A relative offset, can help you reduce size of the binary and user will be able to combine multiple buffers together in one file. If you are interested, how this kind of technique is solved in FlatBuffersSwift have a look at following article:

Thank you for taking the time, please clap if you liked the article.

--

--

Tells computers how to waste electricity. Hopefully in efficient, or at least useful way.