The world’s leading publication for data science, AI, and ML professionals.

6 Questions I was Asked at Data Scientist Interviews

A guide that will help you prepare for your next interview

Photo by Scott Graham on Unsplash
Photo by Scott Graham on Unsplash

Data Science has experienced a monumental growth in recent years. Thus, the demand for data scientists has increased tremendously which drove many people to make a career change to work in this field.

There is one particular action at the core of this series of events: interviews. The ambition and aspiration to become a data scientist are not enough to get you a job. A comprehensive set of skills is expected from the candidates.

Data science is an interdisciplinary field so the required skills do not focus on a certain topic. In this article, I will share the 6 questions that was asked to me at data scientist interviews.

I have picked the questions in a way that covers the different subjects so you get an overview of what to typically expect at a data scientist interview. The questions are related to Python, machine learning, SQL, and databases.

I will not only provide the answers but also explain the topic in a broader context.


Question 1: What are the L1 and L2 regularization techniques and the difference between them?

In Machine Learning, overfitting arises when a model tries to fit the training data so well that it cannot generalize to new observations. An overfit model captures the details and noise in training data rather than the general trend. Thus, overfit models seem to be outstanding on training data but performs poor on new, previously unseen observations. The main reason of overfitting is model complexity.

Regularization controls the model complexity by penalizing higher terms in the model. If a regularization terms is added, the model tries to minimize both loss and complexity of model.

The two main reasons that cause a model to be complex are:

  • Total number of features (handled by L1 regularization), or
  • The weights of features (handled by L2 regularization)

L1 regularization, also called regularization for sparsity, is used to handle sparse vectors which consist of mostly zeroes. L1 regularization forces the weights of uninformative features to be zero by subtracting a small amount from the weight at each iteration and thus making the weight zero, eventually.

L2 regularization, also called regularization for simplicity, forces weights toward zero but it does not make them exactly zero. L2 regularization acts like a force that removes a small percentage of weights at each iteration. Therefore, weights will never be equal to zero. If we take the model complexity as a function of weights, the complexity of a feature is proportional to the absolute value of its weight.

L1 regularization penalizes |weight| whereas L2 regularization penalizes (weight)².


Question 2: What is the difference between classification and clustering?

Classification and clustering are two types of machine learning tasks.

Classification is a supervised learning tasks. Samples in a classification task have labels. Each data point is classified according to some measurements. Classification algorithms try to model the relationship between measurements (features) on samples and their assigned class. Then the model predicts the class of new samples.

Clustering is an unsupervised learning task. Samples in clustering do not have labels. We expect the model to find structures in the data set so that similar samples can be grouped into clusters. We basically ask the model to label samples.


Question 3: Given a list of tuples, How can you sort the list according to the second items in the tuples?

This is a coding question. The choice of programming language is usually Python. We have the following list of tuples which needs to be sorted based on the second items in tuples.

list_a = [('a', 2), ('b', 3), ('c', 1), ('d', 6), ('e', 5)]

We have two options. The first option is to return a sorted version of the original list so the original one is not modified.

#First option
list_a = [('a', 2), ('b', 3), ('c', 1), ('d', 6), ('e', 5)]
sorted_list = sorted(list_a, key = lambda tpl: tpl[1])
print(sorted_list)
[('c', 1), ('a', 2), ('b', 3), ('e', 5), ('d', 6)]
print(list_a)
[('a', 2), ('b', 3), ('c', 1), ('d', 6), ('e', 5)]

The second option is to sort in place which means the original list is modified.

#Second option
list_a = [('a', 2), ('b', 3), ('c', 1), ('d', 6), ('e', 5)]
list_a.sort(key = lambda tpl: tpl[1])
print(list_a)

Question 4: What is the "yield" keyword in Python used for?

In Python, an object is an iterable if we can iterate over its elements using a loop or comprehension (e.g. list, dictionary).

Generators are iterators which are a specific kind of iterable. Generators do not store the values in memory so we can iterate over them only once. The values are generated as we iterate.

The yield keyword can be used as the returned keyword in functions. The difference is that function returns a generator if the yield keyword is used instead of return.

It is very useful and efficient when we have a function that returns a large set of values which will only be used once.

When a function contains the yield keyword, it becomes a generator function. In other words, the yield converts a function to a generator so it returns values one by one.


Question 5: What is normalization and denormalization in a database?

Both are techniques that are used when designing a database schema.

The goal of normalization is to reduce data redundancy and inconsistency. The number of tables is increased with normalization.

The goal of denormalization is to execute queries faster. It is achieved by adding redundancy. The number of tables is lower than the normalization technique.

Consider we are designing a database for a retail business. The data to be stored contains customer data (name, email address, phone number) and purchase data (purchase date and amount).

Normalization suggests to have separate tables to store customer data and purchase data. The tables can be related by using a foreign key such as customer id. In that case, when there is an update on customer data (e.g. email address), we only update one row in the customer table.

Denormalization suggests to have all data in table. When we need to update the email address of a customer, we need to update all the rows that contains a purchase of that customer. The advantage of denormalization over normalization is to run queries quicker.


Question 6: SQL query

It is highly likely that you will have a question about SQL queries. I was asked to write the select statement to retrieve data from a table based on the given query.

Consider we have the following item tables.

mysql> select * from items limit 5;
+---------+--------------+-------+----------+
| item_id | description  | price | store_id |
+---------+--------------+-------+----------+
|       1 | apple        |  2.45 |        1 |
|       2 | banana       |  3.45 |        1 |
|       3 | cereal       |  4.20 |        2 |
|       4 | milk 1 liter |  3.80 |        2 |
|       5 | lettuce      |  1.80 |        1 |
+---------+--------------+-------+----------+

Find the average price of items at each store and sort the results by average price. We can solve it by applying the avg function to the price column and grouping the values by store id. The sorting can be achieved by adding the order by clause at the end.

mysql> select avg(price), store_id
    -> from items
    -> group by store_id
    -> order by avg(price);
+------------+----------+
| avg(price) | store_id |
+------------+----------+
|   1.833333 |        1 |
|   3.650000 |        3 |
|   3.820000 |        2 |
+------------+----------+

Conclusion

These are the questions I have actually been asked at an interview. You may not encounter the exact same question but the topics are usually the same.

It is important to note that the questions are likely to come from different areas. This is an indication of what is expected from data scientist. Having a broad range of skills will take you one step ahead in the competitive job market.

I’m planning to write a more comprehensive articles that include more interview questions. Stay tuned for upcoming articles!

Thank you for reading. Please let me know if you have any feedback.


Related Articles