Starting a new adventure into DS? Then you need a few things…
Contents: (1), 2, 3
Every day the concept of Data Science keeps evolving and with it we find more concepts of other fields assimilated into data science. That is because we keep discovering new ways of applying the tools that Data Science provides.
As consequence, when you go to a Data Scientist interview, you will encounter questions covering a wide range of tools, algorithms and technologies that try to replicate what you are going to use in your day to day work.

The goal of this article and the following series is to explore together little by little some of the questions and skills that you need to cover to apply for a Data Science Position.
SQL
The first step working with data is…. to be able to gather the datasets that you require so that you can create analytics, reports and models. This includes the data retrieval but as well aggregations, basic data cleaning and filtering.
So let’s start with some examples of simple problems that you may be asked to solve on the spot:
*Students enrolled []**
You have a table with records of students but there are faulty records…
/*
A table containing the students enrolled in a course has
incorrect data in records with ids between 40 and 100 (inclusive).
This is the definition of the table:
TABLE course_enrollments
id INTEGER NOT NULL PRIMARY KEY
year INTEGER NOT NULL
studentId INTEGER NOT NULL
Write a query that updates the field 'year'
of every faulty record to 2019.
*/
This requires to update the table by filtering the specific ids and specifying the new value as per the requirements:
UPDATE course_enrollments
SET year = 2019
WHERE id BETWEEN 40 AND 100
*Session Duration []**
Now suppose that you are managing a website and you want to understand how your users behave and how successful is your website…
/*
Write a query that selects the userId and the average
session duration for each user who has more than one session.
Sessions are kept in the following table:
TABLE sessions
id INTEGER PRIMARY KEY,
userId INTEGER NOT NULL,
duration DECIMAL NOT NULL
*/
The main two points are first the aggregation across userId so that we can calculate the average and second the condition to apply in the aggregation…
SELECT userId, avg(duration)
FROM sessions
GROUP BY userid
HAVING COUNT(*) > 1
*Race participants []**
Retrieve how many race participants we have with the name Jackson.
/*
Write a query that returns the number of race participants
whose first name is Jackson.
String comparisons should be case sensitive.
You are given to following data definition:
TABLE participants
id INTEGER PRIMARY KEY,
firstName VARCHAR(30) NOT NULL,
lastName VARCHAR(30) NOT NULL
*/
This is quite simple, isn’t it?
SELECT COUNT(*)
FROM participants
WHERE firstname = 'Jackson'
*Faulty records []**
/*
A table containing the students enrolled in a course has
incorrect data in records with ids between 40 and 100 (inclusive).
This is the definition of the table:
TABLE course_enrollments
id INTEGER NOT NULL PRIMARY KEY
year INTEGER NOT NULL
studentId INTEGER NOT NULL
Write a query that updates the field 'year' of every faulty record to 2019.
*/
Again the problem definition is longer than the solution…
UPDATE course_enrollments
SET year = 2019
WHERE id BETWEEN 40 AND 100
*Distinct company names []**
/*
We have information about companies in two separate tables:
TABLE uk_companies
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(50) NOT NULL
TABLE usa_companies
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(50) NOT NULL
Write a query that select all distinct company names.
*/
The title is already a big give-away of the problem and the only thing left is to join together the two tables…
SELECT DISTINCT name
FROM (
SELECT name
FROM uk_companies
UNION ALL
SELECT name from usa_companies
)

*Create table []**
Sometime we have need to create new tables. In this case there are going to be variations depending on the database (PrestoDB, MySQL, PostgreSQL…)
/*
We have the following two tables
to allocate users and their roles:
TABLE users
id INTEGER NOT NULL PRIMARY KEY,
userName VARCHAR(50) NOT NULL
TABLE roles
id INTEGER NOT NULL PRIMARY KEY,
role VARCHAR(50) NOT NULL
The new users_roles table should contain the mapping between each user and their roles.
Each user can have many roles, and each role can have many users.
Write a SQLite create table statement so that:
- Only users from the users table can exist within users_roles.
- Only roles from the roles table can exist within users_roles.
- A user can only have a specific role once.
*/
Keeping everything tidy, we need to consider the new key that we will consider in our table as well as the primary keys of the existing tables that will become our foreign keys…
CREATE TABLE users_roles (
userId INTEGER NOT NULL,
roleId INTEGER NOT NULL,
FOREIGN KEY(userId) REFERENCES users(id),
FOREIGN KEY(roleId) REFERENCES roles(id),
PRIMARY KEY (userId, roleId)
)
*Company Employees []**
/*
We own a company with the following employee hierarchy.
We consider that an employee is a manager if other employee
includes in his/her managerId the first employees id.
And any manager may or may not also have a manager.
TABLE employees
id INTEGER NOT NULL PRIMARY KEY
managerId INTEGER REFERENCES employees(id)
name VARCHAR(50) NOT NULL
Write a query that selects the names of employees
who are not managers.
*/
This can be solved with an inner join of the table with itself as follows:
SELECT name
FROM employees
WHERE id NOT IN (
SELECT emp1.Id
FROM employees emp1
INNER JOIN employees emp2 ON emp1.id = emp2.managerid
)

Well done arriving at this point. With this, we close the chapter on SQL and I hope you enjoyed it, and maybe you even learnt something new with these simple examples.
If you already use SQL on your daily routine, then probably this has been too easy. It is true that there is much more to explore in SQL queries (going into the performance of the queries and more complex joins and filters for example) but interviews are time limited. So unless your role specifically focuses on the data management…. your interviewer will move on to other topics like the ones we are about to cover in the following articles.
That will include probability, machine learning models, deep learning and much more….
Contents: (1)
[*] These queries are examples similar to the queries that I use on initial assessments of the people that I interview, but we aware of other queries that may involve more complex tasks… I cannot give everything away, right?
Thanks for reading
I hope you have enjoyed this article. Please leave your thoughts and ideas if you are interested in the topic.