In data science, just like many other fields, you learn more by doing than by reading books or studying the technical aspects of the field. When anyone starts their data science learning journey, you will mostly spend a lot of time and effort learning many aspects, skills, and terminology. You will learn to code, maths and statistics, algorithms, visualization, and business basics.
And although all these concepts and topics are extremely important, knowing the theoretical side of a field doesn’t mean you will succeed in the field or can implement projects without a flaw. Sometimes, as beginners, we tend to do some simple to avoid mistakes that we only do because we lack the experience or we just weren’t taught to avoid these mistakes.
But, once we start building more and more projects, work on different themes with different teams. Then, on different datasets, we will develop an intuition on how to approach any problem, plan specific steps to reach the solution, and be able to solve any problem that arises in your way. So, although you will find your own way to avoid mistakes by building projects, you can also gain this knowledge by talking to data scientists further ahead in their Careers.
I have been where you are, and I talked with many data scientists about their learning journey and what they wished they knew earlier in their career that would’ve helped them progress faster and better. But as I heard a lot, you learn better by doing; when you experience something, it sticks in your mind better than when you hear it out. That being said, reading and gaining information will never be a bad thing.
In this article, we will walk through 9 common mistakes often done by newbies and sometimes experts intentionally or unintentionally that lead to false results or cause the project to take much longer to finish. You can find these mistakes and more in many blog posts such as SamrtBoost, JigSaw, CIO, and other online resources.
№1: Not having a Plan
Let’s start things off with the most commonly made mistake, even as professional data scientists, is to go ahead with a project without having a "plan of attack." Often, when we are given a Data Science problem, we need to answer "why" is the data behaving the way it does, and to answer that equation, we need to be clear on what to do. That’s having a plan and an idea of what are the steps we need to take.
№2: Choosing the wrong visualizations
If there’s something that I repeat a lot, it will be, choose your visualizations wisely. Visualizations are important in all stages of the project. For example, it’s critical in data exploration and makes you either spot or miss patterns or trends. So, makes sure that you know the different visualization tools available, what graphs and charts you can use, and which one will best describe your data and help you understand it better.
№3: Not considering bias in the data
In the data science field, there’s a famous saying that goes, "your results are only as good as your data." But, unfortunately, we often don’t have a say in how or where the data is collected. That’s why when we set up steps to solve a problem using a set of data, we need to consider that this data is maybe biased or not a good representation of the entire population. Doing so helps us avoid making wrong decisions and end up with skewed models.
№4: Not optimizing your model for the data you have
To have better results, your model has to be optimized for the data you have; your model needs to follow the change in data over time. In Machine Learning, this falls under optimizing the values of your hyperparameters to reach peak performance. Optimizing your model is not just a one-time step; often, every time your data changes or a change occurs in it, you will need to go back and modify your parameters to fit that change.
№5: Focusing more on accuracy than performance
This mistake is the one we all have fallen for at some point in our careers. Accuracy is important, but it is not the only factor of a good model. The accuracy of your solution depends on the algorithm you chose, the data you’re working with, and the parameters you set. Changing any of these things will affect the accuracy of your results. So, focus more on correctly interpreting your data, and you will get good accuracy.
№6: Ignoring that correlation doesn’t equal causation
Correlation and causation are two very different things, but sometimes we tend to connect them, not just in data science projects but also in our personal lives. Correlation is a statistical technique that is used to refer to the existence of a relation between two variables or two factors. But, just because a relation exists, that doesn’t mean causation does. So, test the data before jumping to conclusions.
6 Best Python IDEs and Text Editors for Data Science Applications
№7: Reusing implementations
Here is another common mistake: when we spend a lot of time working on a project, developing a methodology, and optimizing a model, we may assume that this model can be applied to similar problems, with no alterations needed. Unfortunately, this is rarely the case. Each problem has its own variables and needs a custom-made solution. So, avoid reusing implications for different problems.
№8: Not picking the correct tools
This is easy to make a mistake even for the most professional of us. Today, there are what seems like an infinite number of tools that can help you with the different stages of implementing a data science project. But, because of that number, we may choose the wrong tool or end up using too many tools. So, taking some time in the planning stage to choose the best tools for the project will save you a lot of time and effort in the long run.
№9: Forgetting the business side of the problem
Data science is an interdisciplinary field; it covers a wide range of applications and scenarios. All of these applications have a business side; that’s why they should never be ignored. Because the business side is where the data starts and where the results will be implemented, always take a moment to examine how and why the data was collected and how the insights you will find will be used on future data. Rember, wrong decisions in data science can cause millions.
Final Thoughts
When I first started my data science learning journey, I took months to get a grasp of the basics of the field, revising my maths, statistics, learning how to visualize data efficiently, how to communicate information in the best way, and learn some fundamental business knowledge to support my model choices.
But, I would say, although I learned a lot by going through tutorials, online courses, and books, I gained the most knowledge through my first year of actually building data science projects, working with other data scientists, and exploring the different applications of data science. And through these interactions, I have learned to avoid many mistakes just to have a more efficient workflow.
When you start designing and implementing projects, I am certain that you will agree with the 9 mistakes we have been through in this article. You will even smile when you remember making one of these mistakes in the earlier stages of your data science career. By reading this article, my only hope is that beginner data scientists will know what to avoid when they start their careers and be able to build better, more professional projects.