As a data scientist, analyst, or engineer, you probably deal with tabular data pretty often. The row-column format is the most common way to store data as it is easy to grasp by most non-technical users.
This two-dimensional format is showcased in several tools and frameworks such as:
- On the spreadsheet format in Microsoft Excel or Google Sheets.
- On the dataframe format in R or Python’s pandas.
- On tables stored on databases.
A lot of data sources used for machine learning models are stored in the latter. Adding to that, a lot of the models’ output is also served through tables that fit in some kind of relational schema, helping users make sense of the results and querying them easily. As more organizations integrate data scientists and engineers into their core information technology teams, job roles tend to get a bit mixed up – and that is not, inherently, a bad thing.
But… is a fuzzy job role the only reason why data scientists should learn SQL (structured query language) and general database theory? I don’t think so. Personally, I believe that learning those concepts will be an asset for two major reasons (other than maintaining your own data sources / model output):
- You will be able to communicate with other data professionals more easily – being able to translate dataframe commands to SQL-like statements.
- You will tend to develop more efficient data pipelines doing complex operations inside SQL and leveraging the indexes of your tables.
SQL is a piece of the database world. It’s just an interface for us to communicate with data that lies around in some database server – and building or maintaining databases is more than that.
The goal of this post is to detail some database concepts and provide you a summary on studying SQL and databases. These are the 5 angles that I believe that are relevant for data scientists and ML engineers to explore – you shouldn’t consider this as extensive list but a rough guide on the first high-level concepts that you can focus throughout your journey on learning SQL and databases.
Let’s start!
Basic Querying
Retrieving data from tables stored in databases with SQL queries tends to be a bit more efficient than bringing the entire data frame to Python and working it in Pandas (specially if we are talking of working on local environments).
Learning the basic query methods is a great start to build good data wrangling skills that will speed up your queries and data pipelines. You can start by checking how to perform simple operations on tables:
- Selecting a subset of columns from a specific table;
- Filtering a table based on a condition;
- Grouping data by a specific column;
The simplest select statement contains only two keywords – SELECT
and FROM
. You can start by doing this simple query in a sample database in MySQL or another open source system – spicing up your query with other keywords such as WHERE
, GROUP BY
or HAVING
as you turn your queries a bit more complex.
To help you, here’s a couple of resources you can use to learn more about simple query statements:
- W3 Schools Select Statement Page;
- Tech on the Net Select Statement Example;
- First Section of my SQL For Absolute Beginners Course;
Indexes and Keys
Learning about table ID’s and primary keys will help you better understand how to combine tables, avoid duplicates in your data pipeline and build smart data structures when you are working with multiple data sources. No table sitting in a schema is an island (or at least, shouldn’t be) – primary and foreign keys are the main concept that help you paint a overall picture of your data and understand the underlying structure and the column(s) that identify a single row in our table.
Additionally, you also stumble upon the concept of indexes. Normally, primary keys and indexes are tied to each other – but you can have one of them in a table without having the other. Indexes are mostly used to speed up certain searches at the expense of objects that occupy space in your memory. A well designed index may improve the speed of a database system by 10-fold.
Check the following resources to know more:
- Technopedia’s Primary Key webpage;
- Microsoft Primary and Foreign Key explanation;
- Oracle’s Guide to Table Indexes;
Data Models
How does one organize a database and represent a specific reality in the format of 2-dimensional objects? Data Models (do not confuse this with machine learning models) are ways to represent multiple instances in table format. Checking data models will make you understand relationships and primary keys much better because those concepts are central when building proper data schemas.
A basic logical data model (sometimes referred as schema) may look like the following:

The data model contains a set of constraints and connections, detailing how the underlying data is organized. There are many flavours of data models and each of them fulfil different objectives. For instance:
- The Relational Model – probably the most famous data model and one of the first ways to organize database tables. The relational model structures the relationship between tables that use entities at its core.
- The Star Schema – this schema is tailored for modern Data Warehouses, splitting information into facts and dimensions table. The main goal of a star schema is to keep track of changes using several mechanisms such as effective to and from concepts that state the validity of a row.
- The Snowflake Schema – similar to the star schema, the snowflake schema expands some of the former functionalities adding dimensional layering and a low-level of data redundancy.
The three examples above are just a starting point to understand how one can model data into different tables – there are more schemas that I have not detailed. Certain situations may happen where the underlying tables you need to query will fit one of the schemas described here.
Some resources to learn about schemas:
- MSSQLTips Page about Relational Schema’s
- Microsoft’s Star Schema example;
- Integrate.io Snowflake vs. Star Schame page;
Joining Data
As we’ve seen, building and understanding schemas is anchored on the relationship between tables. Adding to that, connecting different tables is a central concept of data models.
As you stack years of experience in Data Science, you’ll notice that a significant amount of mistakes in data pipelines come from poor joining (and data modelling skills). Multiple mistakes may happen with poor joining – the most common ones are:
- A certain portion of your rows get cut due to a poor selection of the join type.
- The entities of the joined tables are ill-defined and duplicates are produced.
Arguably, most of the errors of data pipelines are centered around the two issues above. Either mistake causes significant damage to your data science models, either by removing a significant portion of the phenomena you want to model or by creating "false" instances that will have more weight than they should.
Joins are known by two things – their type and their key. Type relates to the domain that they capture from the underlying tables. Key refers to the column that is used a connector between tables.

For instance, an inner join
connects two (or more) tables by selecting the common domain in them. If one of the keys is not present in both tables, it will not be propagated to the resulting table.
By contrast, a right join
and left join
end up considering one table (left or right) as the "controlling" table that is propagated to the result, regardless of its state on the concurring table.
Joining is central to build clear and errorless data pipelines. From the concepts I’ve detailed, this one should have the most impact in your career as a data scientist – let’s see some resources:
- Fourth Section of my SQL For Absolute Beginners Course;
- W3 Schools Table Join Example;
- DoFactory Join Tutorial;
NOSQL
Outside of the relational databases and SQL realm and after learning the aforementioned Database concepts, you will probably come into contact with the concept of NOSQL.
NOSQL (short for not only sql) is an important concept to expand your database knowledge and understand how you can store data outside of the formats we talked above. While data models, schemas and relational databases are an excellent way to store and serve data, their rigidity makes them unfit to deal with unstructured data. NOSQL databases (such as MongoDB) try to work around that by storing data outside of the typical "schemas" we’ve seen above.
As the cost of storage decreases and data redundancy becomes less of an issue, key-value databases started to become the efficient solution to store data that does not fit so well in rigid relational schemas.
You can check more resources about NOSQL here:
Thank you for taking the time to read this post. My goal was to give you the concepts that will help you develop a study plan around learning SQL or Databases – these resources shouldn’t be enough to know everything you need but should give you and headstart on learning the basics.
As you’ve noticed throughout the post, I’ve set up a course on learning SQL from Scratch on Udemy where I’ve included a deep dive into these concepts and more – the course is suitable for absolute beginners and I would love to have you around!
