The world’s leading publication for data science, AI, and ML professionals.

3 Types of Data Science Engineer Interview Questions

Nail the data science engineer interview with confidence

Photo by Hunters Race on Unsplash
Photo by Hunters Race on Unsplash

My background is primarily in software engineering and Data science. As I began looking for a job in data science, interviewers noticed my experience in software. Many interviewing individuals did not have software backgrounds but were from mathematics, physics, or signal processing. During my interviews, I commonly saw questions in three main areas: data ingestion and cleaning, scalability, and research and development.


1. Data Ingestion and Cleaning

Data ingestion and cleaning are two essential topics in any Data Science job. When working in data science, you will focus a lot on what data sources you are calling your ground truth, how you are ingesting those data sources, and then what methods you use to clean the data.

Data Ingestion

Data ingestion can come in many forms, and depending on the team you are working on, the questions may vary significantly. Suppose you are looking to become a data engineer. In that case, you will need good foundational knowledge of database concepts and answer more targeted questions on how you would interact with or develop new databases. Types of questions you may get include:

  • Data Experience – Explain the types of data you have worked with and how you have stored that data? For example, how much image data have you used? How did you store and use those images? Interviewers want to understand the type of data you have experience with.
  • Data Size – This is a common question: what is the scale of data you have worked with? The interviewers want to know if you have worked on large datasets, especially if the job description notes working with big data.
  • Big Data Ingestion – Can you perform ETL on large datasets? What tools and technologies have you worked with when performing ETL?
  • Database Types – What are the main differences between different database types, such as NoSQL vs. SQL databases? When would you use each type of database type?
  • Merges – How would you merge two datasets?
  • Joins – What are the different kinds of joins in SQL, and in what case would you use each one?

In any question, be sure to work through the solution out-loud. Explain the experiences you have thus far, and be honest about what you don’t know.

Data Cleaning

Data cleaning was not something brought up in every interview I have done, but there were some cases in which it was. Therefore, I believe it is good to prepare for these types of questions. The most notable was a case in which the interviewers showcased a tiny dataset up on a whiteboard.

Sample fake dataset as an example - image from the author.
Sample fake dataset as an example – image from the author.

Looking over a dataset, as seen above, the interviewers asked – "How would you clean each column of this dataset?" This task’s objective was to discuss the issues you see in the dataset and how you would handle each case. As you look over a problem like this, consider:

  • Column Names – Are they consistent, and if not, how can you create a compatible schema?
  • String Columns – What values stand out to you? How would you clean those values out, and why? If values are missing, what would you do with the data?
  • Dates – Is there a consistent date format used? If not, what format should you use and why? What happens if the dates are not complete or are wrong?
  • Numerics – What values stand out to you to be removed or cleaned? If data is missing, would you impute the missing points or leave them blank?

As you work through an example like this, try to talk through your decisions column by column. Explain the process you would go through to clean the data and any assumptions you are making based on your information. It is okay to ask questions if you are not 100% sure. If you do not know what they want to do with columns or rows with missing information, you can recommend some ideas that you would consider and ask them their thoughts. This type of question is a way to gauge how you would tackle a data cleaning problem. I found this to be one of the most memorable questions I have received when interviewing as it brought out some very engaging conversations.

What other aspects of data cleaning would you consider in an example like this?


2. Scalability

Moving on from ingestion and cleaning, another common area to get questions in are scalability. The interviewers want to understand two things, (1) how you can work with their current processes and scale analyses to run on big data, and (2) how you can bring code to a production state.

  • Parallelization – You will need to understand how to parallelize large amounts of data to rapidly-produce analytic results and visualizations. One focus area I see a lot is how you can process and reproduce the results 10X faster?
  • Software Development Processes – Depending on the interview, coding and development may not focus on the discussion. Suppose you are coming in as a software developer into a data science team. In that case, the conversations may include talks on how to productionalize the code, develop more rigorous processes, and introduce CI/CD pipelines. Be able to speak to your skills in this area and explain how they can provide value to the team. Explain what these types of processes and development practices can do to ensure a more robust analytics set.
  • Automation – How can you automate processes to speed up analyses? How can we reproduce results 10X faster? Automation can commonly come up in interviews when discussing methods the team already works on and how they would want to utilize your software skills to scale processes.

Scalability and automation are essential topics of discussion if you come from a software background. Software engineers have experience developing production code, working with CI/CD pipelines, and automating processes. Let your skills shine here and explain how they can be valuable to a data science team.


3. Research and Development

This last interview area is broader because it looks at research and development as a whole. In many job interviews with data science teams, you will walk through the projects you have worked on and analysis you have done.

  • Key Accomplishments – Know what research work you have put on your resume, cover letter, LinkedIn, or application. Only put on your resume what you know you can speak well about. You want to showcase your best work here to discuss it with the interviewers.
  • Collaboration – Understand your part of the research project and explain how you collaborated with others to complete the work. What was the overall outcome of the project? How did processes improve after completion of this work?
  • Fail-Fast – Another aspect of research and development is, not every project leads to a production-ready solution. Learn to recognize instances that are not working well and be able to fail-fast. Explaining this in an interview helps the interviewers understand your project management and critical thinking skills. They want to see that you can operate in this type of environment.

Data science is all about research and development of new processes and methods to drive business value. Understanding how to work in on an R&D team to drive valuable actions will help immensely.


Final Thoughts

Data science interviews will vary from company to company, but some common areas to expect questions are data ingestion and cleaning, scalability, and research and development. The team interviewing you wants to know that you can work with various data sources and clean the data effectively for use in analyses. They want to know if and how you can scale their processes and work with big data and automated processes. And lastly, they want to see that you can function in a fast-paced research environment.

What common questions or areas have you discussed during your interviews?


If you would like to read more, check out some of my other articles below!

Top 8 Skills for Every Data Scientist

3 Programming Books Every Data Scientist Should Read

Creating Custom Aggregations to Use with Pandas groupby

Top 3 Articles for Data Visualization

Don’t be too Proud to Ask For Help

Understanding the Analytic Development Lifecycle


Related Articles