The world’s leading publication for data science, AI, and ML professionals.

Three skills you’ll need as a senior data scientist

Machine learning and statistical knowledge goes without saying

Image by Free-Photos from Pixabay
Image by Free-Photos from Pixabay

TLDR: This is not about helping you to enter the realm of Data Science. Rather, this is about the promotion that comes after. There’s no easy formula to reach a senior data scientist position. It takes time, passion, experience and persistence. Having a good theoretical foundation of machine learning by itself won’t cut it either. You need to develop habits that allow you to make good principled decisions, under budget constraints.


There’s a soaring demand for data scientist roles now more than ever. As more and more companies are engrossed in Machine Learning, it is anything but surprising. The less conspicuous side to this is that, the up-rise in demand is followed by fierce competition. The competition transcends just getting a role, it also matters to keep moving up in the career ladder.

Here, I discuss 3 things that can make the difference to whether you get the role or to whether you get that promotion.

Before we start, I have spent most of my time in corporate analytics teams (as opposed to tech companies). So most examples I draw are pertinent to corporate analytics teams than tech domain. However, if abstracted out, they still matter to you regardless of which space you’re in.

Also importantly, these opinions are my own and do not represent view of my workplace.

1. Critical thinking

I have interviewed others for both junior and senior roles while working as a Data Scientist. As long as I was there, we could not hire a single Senior/Lead data scientist to our team externally. Because we didn’t want to ? No, we darn well needed them. What were they missing? Yes, you guessed it. It was critical thinking!

What is critical thinking? In the light of data science, I would say, critical thinking is,

Answering the "why"s in your data science project

Wait, what?

Before elaborating what I mean, the most important prerequisite is, know the general flow of a data science project (even though it’s more chaotic in practice). The diagram below shows that. This is a slightly different view to the cyclic series of steps you might see elsewhere (i.e often "Deployment" is connected to the "business problem"). I think this is a more realistic view than seeing it as a cycle (that’s a topic for another day).

Overview of the main steps in a data science project. The solid lines represent the ideal path to move in. However, in practice, it is always a back and forth process as indicated by the dashed arrows (there can be much more dashed arrows depending on the complexity of the problem). (Image by Author)
Overview of the main steps in a data science project. The solid lines represent the ideal path to move in. However, in practice, it is always a back and forth process as indicated by the dashed arrows (there can be much more dashed arrows depending on the complexity of the problem). (Image by Author)

Now off to elaborating. In a data science project, there are countless decisions you have to make; supervised vs unsupervised learning, selecting raw fields of data, feature engineering techniques, selecting the model, evaluation metrics, etc. Some of these decisions would be obvious, like, if you have a set of features, and a label associated with it, you’d go with supervised learning instead of unsupervised learning. On the other hand, there are other judgement calls you have to make that are less obvious, such as,

  • Which model would I pick (logistic regression vs xgboost) and why?
  • Which metric would I use (accuracy vs F1-score) and why?
  • With a limited computational budget, what hyperparameters would I tune and why?

There are no universal answers and are very contextual. For example,

  • If your data set is simple and doesn’t violate the assumptions of the model, then the logistic regression model will be better due to their wide-spread and explainability
  • If you’re working with a problem that has class-imbalance, accuracy might not be a great choice
  • If you’re using training an xgboost model on an imbalanced data, the hyperparameter [scale_pos_weight](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) is quite important

I have seen bad patterns data scientists fall into, overwhelmed by the number of things to look at/answer, they would start "playing lego" like zombies, to get something out.

Remember, it doesn’t take much to create a rogue model. Here are few scenarios.

  • As an example, imagine you’re developing a fraudulent claim identification model. Even if your model is not great, you can still fake a high success rate by having very high tolerance (cut-off probability set to a high value) in your model. But this means missing so many other potential fraudulent claims, costing the company.
  • Using an unreliable data field or one that’s likely to become obsolete can wreak havoc and leave several data scientists scratching their heads and wasting weeks of their time

A seemingly tiny checkpoint you overlooked might be enough. And it can cost money for the company and put your reputation on the line. When you answer not just "what you’re doing", but also "why you’re doing", it closes down most of the cracks, where problems like above can seep in.

Does that mean I have to meticulously scrutinize every decision?

Yes and no. When starting out, it will require lots of effort and persistence. You will force yourself (probably against your will) to question lot of things (if not everything) you do. But eventually it will develop as a temperament, tapering off the effort required. Additionally, as you gain more experience, you will develop a rules of thumb and design patterns that works well for most projects you work with. These will reduce the overhead caused by having to check every step of the way.

But what if don’t have time for that? This brings me to the second point in our discussion.

2. Making the right prioritization calls

When you’re working at a company, on a project, it is typically governed by a limited timeline and a financial budget (or a time-and-money budget). Often times possibilities of a data science project surpasses that budget. Who do ya call? Ghostub… sorry … prioritizing your to-do list. You don’t want to look bad by going over deadlines by over-promising. After all, the company needs to make profits against the premium data scientist salary they’re paying you.

In data science projects, there are so many unknowns and moving parts, you could keep submerged in for months.

Working in data science often reminds me of Russian dolls. Where one problem opens up the door to another (Source: Pixabay)
Working in data science often reminds me of Russian dolls. Where one problem opens up the door to another (Source: Pixabay)

Therefore, a good data scientist would ask questions like,

  • How much time I’d invest in finding the right data for the problem?
  • How much time I’d spend doing exploratory data analysis (EDA)?

and also trade-offs such as,

  • Is EDA more important then model selection for this project?

For example,

  • If you’re training a computer vision model to assess damage of a vehicle through photos, you don’t need to invest too much time in EDA. Rather, more effort will be required in selecting a model, as there’s so many different computer vision models available with different characteristics.
  • If you’re working on a problem, say, on a property price prediction model, you will invest more time understanding available external data such as, crime rates, school catchments, etc.

And there’s no universally accepted mantra tightly held among data science elders that you’d be bestowed upon at a critical point in your life. For different projects have different requirements.

Say yes to saying no

With prioritizations, comes "saying no" to certain things. Data scientists are terrified of saying no, seeing it as a weakness. On the contrary, it takes guts to say no and keep stakeholder expectations realistic (without rocking the boat of course). It is even more difficult to make sure you’re not saying no to anything critical.

What about the stuff I wanted to do but didn’t get to?

When prioritizing, it’s important to track things you have pushed back or "backlogged". There’s two things you need to do with the backlog items.

  • Maintain them in a easy-to-access repository. For example softwares like JIRA will provide you a section to maintain a backlog. When you have time, revisit the back log and see if you can check off some items
  • The tasks you de-prioritize or temporary rule out are like to introduce "assumptions" to your project. Be upfront about them as soon as you see them. Last thing you need is to reach the end of the project, just to know you have made a wrong assumption.

How do you get better at this?

The more projects you do, the better you will get at this. But, most importantly, when things go wrong, learn from them. Don’t just shrug your shoulders and move on. While doing so,

  • Talk to other colleagues in similar spaces, learn how they handle them
  • Read tech blogs of Google/Airbnb/Netflix/Uber and leverage their experience to make better decisions.
  • Also, the knowledge you gain from wearing your critical-thinking-hat would come in handy!

3. MLOps

MLOps is something I’d recommend any data scientist to be good at, or at least aware of.

Why MLOps?

Fueled by the success shown by machine learning algorithms in the industry, companies are adopting machine learning at a very rapid scale. This creates a insatiable hunger in companies to develop many models to reduce pain points, to gain efficiency, to improve profits, etc. Due to the many bespoke steps involved such as data ingestion, EDA, model evaluation, etc. it becomes difficult for companies to create standardization across data science projects. The following factors exacerbate the problem,

  • There are too many choices to consider (e.g. EDA steps, libraries, model serving API endpoints, etc.)
  • Best practices / design patterns in machine learning are lesser understood

What do we have when we have too much data science and too little standardization? … Chaos

Mlops alleviates this problem significantly.

What is MLOps?

MLOps enables data scientists and machine learning engineers to define machine learning pipelines, in which they manage the lifecycle of a data science project. This comes with the added benefit of allowing to continuously deploy models. These pipelines are quite flexible and don’t need to start all the way from a data source. They would typically give manual triggers on which the pipeline can be initiated. E.g. you upload a new version of clean data to an Amazon S3 bucket.

MLOps framework would typically offer following facilities,

  • Source the raw data and store in a suitable format (e.g. SQL database for structured data)
  • Clean and transform the data to features and store in a feature store
  • Optimize hyperparameters based on the available data using a specific algorithm (e.g. random search, Bayesian optimization)
  • Train a model with optimum hyperparameters found
  • Check the trained model against, validation criteria
  • If the model passes, push to a production environment

Again, these are high level steps and not in anyway exhaustive. For example, flexibility is there to include a manual intervention step before releasing the model. While doing all that, you also can track all the historical data, features and models that were used.

This is a very crude explanation of the MLOps space and there are much more functionality available. You can read the blogs of the following products, if you want to learn more about MLOps.

Popular MLOps platforms

  1. TensorFlow – TFX (https://www.tensorflow.org/tfx)
  2. Databricks – https://databricks.com
  3. Amazon Sagemaker – https://aws.amazon.com/sagemaker/
  4. Some companies develop their own MLOps framework

Why is it important to know about MLOps?

Now that you know what MLOps is, we’ll focus on why you should know about it. In the next few years, I expect a shift in what encompasses a data science role.

With the standardization and modularization of machine learning offered by cloud platforms, data science will become less about customizations and more about plug-and-play and continuous delivery

With cloud platforms offering many standardized components out of the box, and removing the need to manage custom development/production environments will free up data scientists and give time to focus on other things. This will enable data scientists to explore other areas like MLOps, which makes it possible for them to produce trained models much faster and track them, unlike in the old days.

Conclusion

This is my two cents on what matters to climb the career ladder, once you start your job as a data scientist. These opinions are my own and do not represent view of my workplace. I’m sure there are other important skills, but these are some of them that have helped me to grow my career. To reiterate,

  • Question the decisions you’re making in your data science project
  • Prioritize things to do in your project, you probably won’t have the time or funding to do everything you want to. Clearly define the assumptions you’re making.
  • Learn about MLOps, it’s going to be the future of data science.

Also remember, this is not an exhaustive list of skills. The more skills you garner, the better you’ll be. A few other noteworthy skills that came in handy are,

  • Communicating complex data science concepts in simpler ways to stakeholders
  • Developing a mindset to iteratively deliver a product, rather than following a water-fall method
  • Good understanding of ML system design

If you enjoy the stories I share about data science and machine learning, consider becoming a member!

Join Medium with my referral link – Thushan Ganegedara


Related Articles