The world’s leading publication for data science, AI, and ML professionals.

Overcoming Some of the Worst Parts of Being a Data Scientist

This field isn't perfect so it's best to set expectations properly

Opinion

Image by author
Image by author

Working in the tech industry, I often see a lot of glamorizing of tech roles primarily for Data Science, Machine Learning, and AI. I love when people get excited about the field and we get a broader audience joining in, but it’s crucial to manage your expectations properly if you hope to sustain that excitement.

A lot of people starting out in this field don’t have a mentor to convey the truths of what you’ll find in the day-to-day of this role. I hope this story starts that conversation so you can see some of the "worst" parts of the job. I put that in quotes because some people thrive in this kind of environment and love it, while other personalities tend to absolutely hate it.

Although the field becomes immensely stronger with a diverse crowd in it, not everyone will find their calling in Data Science or at least the current state of it and it’s perfectly okay to say it’s not for you. I’ll highlight some of the prominent issues I see today, but I guarantee there’s more that I’m not mentioning here or haven’t personally experienced. If nothing else, this story should give you a gauge of what to expect and how I recommend getting past them.

The issues I discuss here are likely ones that weren’t there 5 years ago and won’t be there in 5 years given how quickly this field evolves. For now, here are the 5 I discuss:

Issue #1: Ambiguity of Role(s)

Issue #2: As Much SQL As Possible

Issue #3: Quickly Communicating Complex Concepts

Issue #4: Abiding by the Cost-Effort-Value Matrix

Issue #5: Seemingly Limitless Uncertainty

Issue #1: Ambiguity of Role(s)

This is getting better by the day and it can be heavily dependent on the company you’re at (and its data culture), but it is still a problem the industry is figuring out. Here’s the current average understanding of what roles are primarily but not solely responsible for:

Data Engineer: Data Ingestion, Data Validation

Data Analyst: Data Validation, Data Preparation, and Data Exploration

Data Scientist: Data Preparation, Data Exploration, Data Modeling (Creation and Validation), Model Evaluation

Machine Learning Engineer: Model Evaluation, Model Deployment, and Model Monitoring

Full Stack Data Scientist/MLE: Everything Data

I’ll do you a favor and not create [yet another] role Venn diagram of the above. There’s endless content on the difference between roles and the above is a gross generalization.

This differentiation of roles has, in my opinion, the potential for a large problem as the handoff between these roles can often be disastrous. But even if everything goes smoothly, all of these roles don’t exist at all companies. Startups often have data professionals doing at least a little of everything and large organizations often have at least one of each role and mid-size organizations have all the flavors in between.

Some organizations think Data Science is a fad so they hire Data Engineers and then ask them to learn enough ML work to do the portions past their function (or Machine Learning Engineers and then ask them to learn enough DE to do the portions before their function). Other organizations think they want Data Scientists but need Data Analysts [first]. Other organizations have all of the roles but the differentiation of work isn’t clear enough making communication between teams really difficult.

A lot of this issue relies on how the company views data/what the data culture looks like which can be heavily influenced by the perspectives of the Leads, Management, and Executive team. The ambiguity is what causes confusion in the variety of job postings or the misalignment in the needs of the organization and what you may want to do.

Recommendation

If you’re applying for a role, I’d recommend having questions ready to interview the data team you’re trying to join. Analyzing their culture is a key part of the interview so you know what culture you will be joining and how they value data functions. Asking questions like what an average day of the role looks like and what excellent would look like can give a great idea of what the job is like past the job description.

Furthermore, pay very close attention to how they conduct the technical interview. Everything from what they ask of you (Leetcode questions, take-home assessment, pair programming, etc.) to how they evaluate your work is a gold mine of information on how data is perceived at the organization. Personally, I think quizzing on arbitrary Leetcode algorithms is a waste of time and tells me enough of the data culture that I usually turn down the interviews there. It has nothing to do with my ability to complete the task, it has everything to do with the reality of the job looking nothing like that and their inability to see it differently than software engineering. Take-home tests can show a great understanding that data work is heavily contextually dependent (while still assessing technical ability) and success should be measured with an ability to handle uncertainty. This also can be awfully done where the evaluation of the tests is poor which also prompts me to turn down companies pretty quickly. How I evaluate data interviews deserves a story on its own, but that’s a brief guide on how I personally approach interviews and I recommend you create a strategy that reflects the culture you want to be a part of.

If you’re already at a company, the recommendation I have is to adapt to the definition that your company persists. You’d be much better served to be a Data Engineer who has ML skills in an organization that strongly values that function over obsessing on having the Data Scientist title. We’re all doing our part to row the boat and anyone who tries to impose a hierarchy of importance between arbitrary titles is playing a losing game. Create value from data to guide effective decision-making, regardless of your title.

I also do want to separate this conversation from pay as I believe you should be fairly compensated for your efforts and skills regardless of title. If you’re a Data Analyst but performing the duties of the Data Engineer and Data Analyst, the pay should be appropriate. I’m advocating more for how you should be adaptable in framing yourself to be more successful and happy at your company.

Issue #2: As Much SQL As Possible

The Data Science career involves a lot of programming. Even with a lot of these tools popping up, your ability is more-so measured on being able to solve technical data problems regardless of the scenario you’re in. This could mean with Cloud or not, with Data Viz tools or not, with Python or not, etc. There are standards that we see throughout the industry that the majority have adopted but they can still vary enough due to the wide range of tools and technologies available. So far, all high-performing data organizations I’ve seen have been very strong with programming and also have the ability to learn tools as they need them.

One similar pattern I’ve seen in every single data infrastructure is the use of SQL and lots of it. SQL is the cat with 9 lives; it keeps coming again and again in various shapes and forms no matter where you look. SQL education gets underplayed a lot, but it is a significant part of the job. Being able to write optimized SQL queries and front load your dataset curation is very much in your favor when dealing with large-scale projects.

The difficulty of SQL writing can vary from project to project but it’s important to expect lots of SQL use and even opt for it whenever you can. Your project’s workflows will be faster and easier to debug. The issue with this isn’t as much that it’s a "problem" to do as much SQL as possible but more that it is not discussed nearly enough leading to aspiring Data Scientists not being ready. Learning SQL is not hard, but practicing SQL can be. Now there are some really solid free sites out there to practice but they’re all simulated, contrived environments with small/easy tables. Organizations can have a really messy data landscape where people may not even know where the data you’re looking for exists and how it connects to other data.

Recommendation

Try to incorporate SQL work in your personal projects. The need for personal projects in this space is undeniable but a lot of them typically start with a pd.read_csv() from a small csv file from Kaggle. This is fine, but it won’t help you properly practice and train for the actual job. Try to simulate a real-world scenario by loading the data into a small, open-source database and then querying it to get what you need. The amount you’ll learn in this exercise is incredible but if that’s overkill then you can even use pysql to practice in a better scenario than free sites.

If nothing else, I highly recommend going on the SQL training sites and attempting to solve challenges with it. Getting used to the language and thinking through how data can be retrieved and combined and manipulated via SQL can be immensely useful for the job.

My favorite way to learn SQL and sharpen my skills is SQL Murder Mystery.

Issue #3: Quickly Communicating Complex Concepts

Although it is generally known that the job involves communication skills, there is a lot of depth to this skill that doesn’t get discussed. A [very] common scenario is that you’ve built a complex statistical model or machine learning system and regardless of whether it performs well or not you have < 30 minutes to explain to a senior executive the results. In that time, you need to provide just enough technical details without inundating them with too much information that the point gets lost while also providing recommendations of what can be done next and wrapping it in a business context that matters to them. This is hard.

There’s no great way around this issue as not everyone can go through the time and effort needed to learn the intricacies of the data science world so you have to do the extra work to meet them where they’re at. This is not the same as obfuscating important points or "dumbing down" the content, it’s carefully providing enough information concisely and clearly in a way that is understood by your audience in a timely fashion.

I’ve seen this done best when you have a good team around you. Having one person discuss technical details while another can zoom out and provide a clear picture of the forest usually is the best scenario. I’ve seen many use cases where rambling happens from highly technical people believing they’re very strong communicators but in the end losing the audience they’re communicating to. Tread carefully with letting these people lead the conversations; too much information where a discussion is being hijacked by one person is as ineffective as too little information where nothing is said.

Recommendation

Be honest with yourself about your communication strengths and weaknesses. It helps no one to assume yourself an expert because you know the technical details. Ask for brutally honest feedback from your peers on how discussions go with you involved and where you can improve. Once you properly can identify your weaknesses, look to how you can work with your team to cover the areas you’re not as effective in but they may be. Having a collaborative discussion as a team is usually really successful in resonating with an audience because one person can focus on just the right technical details while the other can focus on the broader picture that the details relate to.

Taking courses on this topic is fine, but the field needs more people who are willing to admit weaknesses here and learn on the job. It’s really best learned with practice and each person on the team should be willing to improve. It’s challenging to openly admit mistakes or call people out on their weaknesses but a data culture that empowers people to grow into the best versions of themselves is the one that will last over time.

Issue #4: Abiding by the Cost-Effort-Value Matrix

When you start a Data Science or Machine Learning project, it’s usually the case that there’s a PM onboard to guide what needs to get done and when. This can work well or not, but it works best if the PM has training in managing advanced data initiatives. If not, this work gets communicated by the PM but it ends up being done by the Senior Data professionals.

In either case, it’s incredibly useful to understand how to manage your project effectively. Knowing how to prioritize the right next steps is a deeply crucial skill to learn. This requires a tenacious balance between cost, effort, and value which should evolve with the project as new information is discovered.

Although this can be done at a high level for a project, it becomes challenging when you’re having to do it at a low-level feature implementation for a data product. This is because it requires a deep understanding of what a certain iteration would require and how difficult it may be to implement. Data Products are inherently very iterative and an ability to appropriately estimate what it takes and what value is yielded is done best by those who can build it.

Recommendation

Be more than just a builder. Understand that your knowledge of the technical details will inevitably require you to understand data product management. This requires you to start developing an intuition for the cost of your work in terms of time and effort, but also developing a keen eye on what value your efforts will yield.

This work isn’t something that is easy as it requires an understanding of the business and a strong handle on the context your project operates within. It’s easy to quickly dismiss this work as outside of your "scope of responsibilities" but I’d argue your project can’t be done right if you are unable to understand the business your project is meant to serve.

The best way to develop this is just by putting your reps in. The more experience you get with real-world projects, the better your intuition becomes about difficulty. The more diverse of tools and concepts you immerse yourself in, the better your estimate becomes about cost. And the more connected you become to the business, the better your alignment to high-value work. How you prioritize work based on these three factors can vary on your team, department, and company but being able to understand your scenario and how to navigate it will immediately differentiate you from others.

Almost every experienced Data Scientist I’ve met has this talent. It’s something you develop through years of building and it does them a great service as they work on projects. I highly recommend juniors starting out to lean into this work (as opposed to over-indexing on the math behind algorithms) as it’ll significantly differentiate them as they progress in their career.

Issue #5: Seemingly Limitless Uncertainty

I swear this field changes nearly every 6 months. The cycle time for a new tool, new algorithm, new best practice, new "Auto-X", etc. is faster than any one person can keep up with. This comes with a few sub-issues:

  1. Mental fatigue. It’s okay to say you’re in the deep end and can’t handle it. The work in data seems to never end mainly because you’re the expert in not only finding better answers but usually also guiding better questions. This can be a mentally overwhelming job for many, especially under tight deadlines.
  2. Unlearning/Relearning quickly. The evolution and obsolescence of tools/libraries/concepts require us to adapt to the change in tide or eventually be left on shore. For some, the thrill of always being a learner is exciting. For others, it’s absolutely exhausting. From keeping up-to-date with the latest information to understanding how to adapt your workflow (and whether to) it can easily be more than a full-time job.
  3. People problems. Data by itself doesn’t have any value. We ascribe its value through narrative. The narrative depends on the people and the context surrounding it. This can lead to some really odd problems. I’ve had scenarios where my best-performing model was not the one used because the experts in the industry (the customers of the model) haven’t seen model results that high before. Similarly, model explainability is often desired more than model performance. Although I believe this is happening less over time as we are being able to better explain complex models, it still reflects how this is largely a human craft that inevitably leads to human problems.

Recommendation:

  1. Take your days off and make time to truly disconnect. It’s incredible how many times I’ve been frustratingly stuck on a problem banging my head on the keyboard and simply walking away and doing something completely unrelated helps me figure it out. This work is not a sprint, it’s a marathon that feels like it never ends. You can always conduct deeper analyses, create stronger models, and engineer better systems. Take time to rest so that when you’re on you’re delivering truly high-quality work.
  2. Honestly, there’s no way around this issue other than knowing that this is what the job is and enjoying it. If you are someone (or can become someone) who thrives in bettering themselves and loves the journey more than the destination, this craft can really feel like a new adventure each day. If that kind of idea exhausts you, you’ll likely burn out sooner or later (and maybe more than once). Keep in mind, burning out and exhaustion can happen even if you’re not going above and beyond. Just being immersed in an environment where unlearning/relearning that often is par for the course can be hard since you feel like you have to keep up or you get left behind.
  3. Make friends with people. Data Scientists are not thought of as a traditional sales roles but it’s incredible how often we have to sell things (our products, results, solutions, ideas, etc.). The easiest way I’ve found to get things sold and navigate the odd people problems is to try to make friends with your customers, coworkers, partners, etc. There have been so many wins I have earned that went 10x easier just because I tried to make them a friend first.

Conclusion

This field is great for those who deeply enjoy constantly proving their worth and being lifelong learners. With that being said, it does come some issues that newcomers should be wary about if they hope to find their calling in this space. I covered 5 of some common issues I read about online and hear from friends/coworkers but this isn’t an exhaustive list. I’d love to hear about what other ones people may think are issues and what are the recommended approaches for resolving those. The best recommendation I can provide to cover most issues that have come and gone in this field is to join and actively participate in a community. It’s amazing what sharing stories and connecting with people can do for a myriad of diverse and complex problems.

Thanks for reading!


Become a Medium Member with my Referral Link

Medium is a large repository of where I do my daily reading, and if you’re in the data space, this platform is a gold mine. If you wish to subscribe, here’s my referral link to sign up. Full disclosure: if you use this link to subscribe to Medium, a portion of your subscription fee will go directly to me. Would love to have you be a part of our community.

Join Medium with my referral link – Ani Madurkar


Related Articles