The world’s leading publication for data science, AI, and ML professionals.

Data Structure and Algorithm Questions in Data Science: A Discussion

False positive or false negative?

Alongside data manipulation, analytical insights, and business communication, the ability to translate mathematics and statistics into code to model or explore a problem is one of the hallmarks of a great data scientist.

A well understood facet of data analysis is that more data = better insights, but to write efficient code as dataset sizes increase, we need to understand data structures and Algorithms. Processing gigabytes or even terabytes of data without taking the time-space complexity into account will generally end up with painfully long wait times in the data pipeline, whether it’s to train a machine learning model or just for some simple data cleaning.

This something that can be screened for when looking to hire a data scientist (hello Leetcode and HackerRank), but is it a good indicator of capability or just a false positive/negative?

As always, it depends what you’re looking for.

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

The good

Where data structure and algorithm questions can really shine is in allowing a candidate to display coding proficiency and problem solving abilities. To solve algorithm problems well the following are needed:

  • Knowledge of the available data structures and operations in a chosen language
  • The problem solving ability to understand and break down a problem on the fly
  • Communication of understanding and thought processes
  • The ability to translate a solution into efficient code
  • All of the above are done real-time under pressure

The bad

Whilst the above are essential analytical skills to have, algorithmic problem solving alone doesn’t define a good data scientist. This can lead to both false positives and false negatives as it generally doesn’t test for what the majority of such roles actually require:

  • Stakeholder management and project (including adhoc questions) prioritisation
  • Orchestration, pipelines, and query optimisation (data engineering skills)
  • Fact, dimension, and denormalised tables (analytics engineering skills)
  • Data visualisation, communication, and strategic insights (data analytics skills)
  • Model prototyping, development, and evaluation (Data Science skills)

The conclusion

Although learning about time-space complexity and taking the time to familiarise myself with Data Structures has definitely widened my data science toolbox and improved my overall coding and problem solving skills, the above are just some examples of skills which aren’t tested for by solely relying on data structure and algorithm questions.

As part of the discussion around this topic, I believe that if used such questions should only be considered as part of a candidate’s overall ability, and ideally should be mixed in with knowledge, project based, and approach type questions. The more old school approach of setting a take-home task reflective of the work the role requires can be a much more useful assessment of technical ability, but in such cases care should be taken to respect a candidates time. I’m a proponent of setting a 3–6 hour time limit to standardise the time taken across candidates whilst giving them the chance to demonstrate their skills.


An aside: Big O notation

Photo by Aron Visuals on Unsplash
Photo by Aron Visuals on Unsplash

For those who haven’t encountered time-space complexity before, here’s a quick aside.

For taking time-space complexity into account, the worst case time-space complexity for an algorithm it described by big O notation, where the space complexity includes the space used by the input (auxiliary space is term for only the extra space used by only the algorithm). Big O notation is in the form of O(complexity) where the complexity is written in terms of the size of the input n.

For example, if a list of names has n entries, the searching for a name from the start of the list to the end has O(n) time complexity because the worst case is that the name is the last value in the list. The space complexity is O(n) because the input is of length n but the auxiliary space used is O(1), which is just the space required to store the target name for comparison with the list entries.

Algorithms which take constant O(1) time regardless of the size of the input are generally data structure operations like accessing an array index, inserting a node into a linked list, or pushing onto a stack.


Related Articles