Relational databases store data in tabular form with labelled rows and columns. Although relational databases usually provide a decent solution for storing data, speed and scalability might be an issue in some cases.
SQL (Structured Query Language) is used by most relational database managements systems to manage databases that store data in tabular form. NoSQL refers to non-SQL or non-relational database design. It still provides an organized way of storing data but not in tabular form.
The common structures adapted by NoSQL databases to store data are key-value pairs, wide column, graph, or document. There are several NoSQL databases used in the Data Science ecosystem. In this article, we will be using one of the popular ones which is MongoDB.
MongoDB stores data as documents. A document in MongoDB consists of field-value pairs. For instance, the following can be a document in MongoDB.
{
"name": "John",
"age": 26,
"gender": "Male",
}
Documents are organized in a structure called "collection". As an analogy, we can think of documents as rows in a table and collections as tables.
A document is in JSON (JavaScript Object Notation) format. JSON is a commonly used format but it has some drawbacks.
Since JSON is a text-based format, it makes it hard to parse through. It is also not a very optimal choice with regards to memory efficiency. The number of supported data types is also limited with JSON.
In order to overcome these challenges, MongoDB introduces a new format called BSON (Binary JSON). It can be considered as the binary representation of JSON. It is highly flexible and fast compared to JSON. Another advantage of BSON is that it consumes less memory than JSON.
Since BSON is binary encoded, it is not human readable which might be an issue. However, MongoDB solves this issue by allowing users to export BSON files in JSON format.
We can store the data in BSON format and view it as JSON format. Any file in JSON format can be stored in MongoDB as BSON.
After a brief introduction, let’s do some practice. You can install MongoDB community edition on Linux, macOS, or Windows. How to install is clearly explained in MongoDB documentation.
I have installed it on my Linux machine. We start MongoDB from the terminal using the following command.
$ sudo systemctl start mongod
Then, we just type mongo in the terminal to start using it.
$ mongo
The following command will print out the MongoDB databases in the server.
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB
test 0.000GB
We can select a database with the use command.
> use test
switched to db test
We have previously mentioned that documents in MongoDB are organized in a structure called collection. In order to view the collections in a database, we use the show collections command.
> show collections
inventory
There is currently one collection in the test database. We want to create a new collection called "customers". It is fairly simple. We just create a new document and insert into the collection. If the specified collection does not exist in the database, MongoDB automatically creates it.
> db.customers.insertOne(
... {name: "John",
... age: 26,
... gender: "Male",
... }
... )
{
"acknowledged" : true,
"insertedId" : ObjectId("60098638be451fc77a72a108")
}
As we see in the output, the document successfully inserted in the collection. Let’s run the "show collections" command again.
> show collections
customers
inventory
We now have two collections in the test database. We can query the documents in a collection with the find command.
> db.customers.find()
{ "_id" : ObjectId("60098638be451fc77a72a108"), "name" : "John", "age" : 26, "gender" : "Male" }
The field-value pairs as well as the object id are displayed. Since there is only one document in the customers collection, we do not specify any condition in the find function.
Find function allows for specifying condition to query a database. In this sense, it is similar to the select statement in SQL.
Conclusion
We have covered a brief introduction to NoSQL with MongoDB. There is, of course, much more to cover in both NoSQL and MongoDB.
Both SQL and NoSQL are of crucial importance in data science ecosystem. The fuel of data science is data so everything starts with proper, well maintained and easily accessible data. Both SQL and NoSQL are critical players for these processes.
I will be writing more articles on both SQL and NoSQL databases. Stay tuned!
Thank you for reading. Please let me know if you have any feedback.