Do we need a certification of a minimum standard?
I am excited by how many students and professionals are learning programmatic Data Science tools nowadays. More and more, schools and colleges are adding practical components to erstwhile theoretical courses, where people need to get to grips with Python or R and try to work with real life datasets.
But this causes another problem: everyone is now calling themselves a Data Scientist. No matter what position I am hiring for, that term is on over 80% of the resumes I look at. It has actually made me start to ignore the term because it is not a differentiator of talent any more.
So this begs the question – when does someone actually become a Qualified Data Scientist?
The definition of a qualified Data Scientist
Based on my personal experience, one strong indicator of a Qualified Data Scientist (QDS) is when they can accurately estimate how long it will take them to complete a non-trivial piece of work that contains elements unfamiliar to them. To be able to do this you need to:
- Have the knowledge to work out steps will be needed to do the work, which of these steps you can already do and which you need to research further
- Have the ability to do the steps you think you can do without unexpected issues
- Have the confidence to learn the parts you don’t know in the time you have estimated
If I were to design some sort of certification exam for a QDS award, it would be open-book, probably somewhere between 4 hours and a full day, and it would involve a choice of novel problems to be solved and chunky datasets to work with. The problems would be set up in steps, where some of the early steps would need a strong, fluent knowledge and rapid recall of the core of the chosen language and some of the later steps would require research and applications of packages, modules and methods that would not be regarded as part of the core of the language, and would require the examinee to conduct online research in order to complete. Extra credit would be given for efficient and elegant approaches.
The steps to becoming a Qualified Data Scientist
What would be the steps involved in studying to become a QDS? I believe at an absolute minimum it would take a year of full time study, over two intense semesters, and would have several progressive learning modules with accompanying practical components:
Semester 1:
- Elementary mathematics and statistics: This would ensure that people are understanding some of the things they are coding. It doesn’t have to be super complex, but it should cover data types and structures, statistical aggregations and descriptors, measures of error and accuracy in data, discrete mathematics and a few other concepts.
- Language basics: This would teach the basic components of how to manipulate data in the chosen language and relate strongly to the concepts learned in 1. Students should be given exercises where they have to perform manipulations and operations in their chosen language and then explain what is the mathematical meaning of what they just did.
- Operating system management: Students should be given experience of both Unix and a Windows environment, using the command line, understanding environment variables, permissions and numerous other concepts which play a role in how their language interacts with its platform. Setting up core software, IDEs and integrations should be an important part of this module.
- Project work: Following Steps 1 thru 3, substantial project work should follow with the aim of ensuring frequent coding practice, experience at discovering and resolving errors, using version control and producing high-quality output and results. Students should be encouraged to access community resources like StackOverflow and learn how to appropriately interact with those communities.
Semester 2:
- Explanatory and Predictive Modelling: This should teach the theoretical difference between modelling to explain a phenomenon versus modelling to predict a phenomenon, the different design choices that can be made for each and the most typical methods/algorithms used in practice.
- Common algorithms, methods and tools: Students should be exposed to the most common cross-platform algorithms, what they are appropriate for (relating back to 1), and how to execute them in their chosen language.
- Code abstraction: Students should learn the importance of being DRY (Don’t Repeat Yourself), become comfortable writing functions and understand the value of abstraction.
- Development: Students should be taught the basic steps and principles of software development and shown the key resources for development in their chosen language. They should be encouraged to participate in language development communities.
- Debugging: Students will need to build the confidence to handle errors. Systematic debugging processes should be taught, with the exploration of typical error messages and tracebacks. Students should be exposed to errors both generated from within the language and from how the language interacts with the operating system.
- More project work: Semester 2 project work should focus on the QDS exam, to continue to encourage rapid confident coding in core language features but introducing problems that require research into methods that are unfamiliar to students and where some code abstraction will help make work more efficient and where some debugging will be likely.
My main reason for writing this is to get some thinking out of my system and down on paper while it’s fresh, but I’d love to get some reactions from readers to this. Would a course/qualification of this kind of structure provide a better basis for hiring Data Scientists? Am I overshooting? What am I missing?
_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com._
