The world’s leading publication for data science, AI, and ML professionals.

NoSQL Databases – The Solution to a Fast-Paced, Smartphone World

What they are and why they're useful

Image Source
Image Source

Relational databases are a foundational component of modern technology. They are everywhere nowadays, particularly since they have been around since the 1970s (Shoutout IBM! ✊). And they are everywhere because they are very reliable and fairly easy to access once you learn how to use their code. You can store, track, and analyze data all in one organized place. In most situations, a relational database is a great choice. However, as we live in the age of the internet and smartphones, there are some forms of data that aren’t a great fit for a traditional relational database. For these situations, we can utilize a newer and growing type of database -a NoSQL database.

Twitter – if ya ain’t got one, you aren’t cool. Everybody is tweeting nowadays. In fact, according to Brandwatch, there are FIVE HUNDRED MILLION TWEETS SENT EVERY DAY. Yes. In numbers, that’s 500,000,000 tweets. Per day. That’s 6,000 tweets per second! And that, is a lot of data, folks. But, not every tweet is the same. Some tweets are one character, others utilize all 140 characters. Some people tweet once a year, some once a day, and others once a minute. And from all that, there are all kinds of interesting metadata that can be collected and analyzed – account names, retweets, comments, likes, time it was sent, location of the tweeter, mentions – just to name a few. This kind of rapid fire information storage would not be a very good fit for a traditional database.

But… why? To explain, I’ll use one type of metadata as an example – number of characters in a tweet. If I were to use a traditional database, each table would have to represent tweets with a certain number of characters. For example, "Table – 1 Character", "Table – 2 Characters", "Table – 3 Characters", etc., all the way up to 140 characters. If every tweet was exactly the same length, then maybe it could work, because I would then just have, "Tweet 1, Tweet 2, Tweet 3…" But, that’s not the case with Twitter and its tweeters. Also, a relational database would have the issue that it could possibly store multiple copies of the same tweet, if the tweeter uses the tweet more than once. This results in redundancy, takes up valuable space, and could make the querying process significantly slower. However, storing tweet metadata is the perfect kind of situation for a NoSQL database.

There are different types of NoSQL databases, which are generally grouped into four categories:

  • Key-Value Stores
  • Column Stores
  • Graph Databases
  • Document Stores

I will primarily be focusing on document stores in this post. A document store is essentially a database that stores its records as unique documents. Popular document stores include Couchbase and MongoDB. Their length can be determined by the user, and each document store can contain more than one document. Tweets would be perfect for this – any tweet and its metadata can be stored as its own unique document, and that information can then be embedded in another document that can be for the twitter user who sent the tweet. It’s like a document… inside a document… inside a document ("Inception" reference, FYI).

Image Source
Image Source

Each document then contains key-value pairs – similar to a dictionary in Pandas – with the values being the actual data. The keys, however, are unique to each document, which provides an incredible amount of flexibility because the key-value pairs aren’t bound to having to use the same keys, like in a traditional relational database. You can create whatever key(s) you want as you go, for whatever data you need. This flexibility allows for unpredictable data, such as tweets or even text messages, to be much more easily managed than with the more rigid relational databases.

However, there is one important caveat. With the increase of flexibility in the storage of data, there is also an increase in difficulty in querying your data. It is very possible that each document – because each one can have its own keys – can have its own schema. That means you, as the data analyst/scientist, must know _exactl_y what you are looking for, and you must make sure that whoever is storing the data does a good job at keeping key names unique. "Tweets" and "tweets" are different names, but could store the same kinds of information. If you run a query looking for all the data in "Tweets" all the information in "tweets" won’t be collected because they key names are different!

I hope this post was able to provide you with some clarity on what a NoSQL database is and why they are useful. Thank you for reading!

LinkedIn


Related Articles