I started to learn Data Science about 3 years ago and it took me about 2 years to land my first job as a data scientist. I have been writing on Medium since the very first days of my data science journey.
I write articles about my learning experience, interviews, my life as a data scientist, and so on. Most importantly, I write about what I have learned.
What I have observed in the last 3 years is that learning data science is a continuous process that can be divided into 2 periods. The splitting point is when you find your first job.
The learning continues but when you move from one period to the other, it changes dramatically. I do not mean a change in the amount but a change in the way you approach problems, the skills you obtain, and the way you learn.
In the first period, you make doughnuts at home and sell them on the streets. You only accept cash and do not have to worry about taxes. In the second period, you have a doughnut store, you have employees, you accept any form of payment, and you pay taxes. In a sense, you go from amateur to professional.
When I started to work as a data scientist, I felt like I put aside my small doughnut stand and went into a doughnut store. I would still make doughnuts but in a different way.
I hear you saying enough with doughnuts and I agree 🙂 In this article, I will share 5 lessons that I learned from the hard way as a data scientist.
The only real mistake is the one from which we learn nothing. – Henry Ford
Check results at least 3 times
There is no magic about the number 3. I just want to emphasize the importance of checking the results thoroughly. In my experience, checking them at least 3 times significantly reduces the risk of missing the mistakes.
Whatever algorithm you implement or analysis you make, the results are used in the continuing processes or production. Thus, it is of vital importance to make sure the results are correct.
By results being correct, I do not mean not having any errors on your predictions or hitting 100% accuracy which is not reasonable or legitimate. In fact, you should be really suspicious of results which are too good to be true.
The mistakes I mention are usually data related issues. For instance, you might be making a mistake while joining stock information of products from an SQL table to your main table. It results in serious problems if your solution is based on product stocks.
There are almost always controls in your code that prevent making mistakes. However, it is not possible for us to think of each and every possible mistake. Thus, taking a second look is always beneficial.
Duplicates, duplicates, duplicates!
Relational databases are quite common in the data science ecosystem. The data are stored in many relational tables. In some cases, we extract data from several tables to get what we need.
This process requires joining several tables which might cause duplicate data points (or rows) in the final table. It is a good practice to place duplicate checks in your code many times. I made this mistake a few times and learned my lesson.
Modeling is the last step
You must have read from various resources that data preparation is the most time-consuming part of a data science project. Yes, it takes lots of time to collect, clean, and preprocess the raw data before feeding it to a model.
There is another step between data preparation and modeling: exploratory data analysis. It can be defined as the process of learning the underlying structure and the relationships within the data. We can use a variety of tools and techniques to perform exploratory data analysis such as statistical measures, distribution plots, and other data visualizations.
As an enthusiastic and inexperienced data scientist, I was too eager to start on modelling. Thus, I failed to do the exploratory data analysis thoroughly.
Before the modelling phase, it is of crucial importance to learn what the data tells us. It affects everything from selecting features to deciding which algorithm to use.
Learn the data like the palm of your hand before proceeding to the modelling phase.
The glorious world of Machine Learning algorithms is very attractive. The urge for using a fancy algorithm and building a model to perform some predictions might cause you to skip digging into the data.
There is, of course, nothing wrong with using machine learning algorithms. However, we should all know that they are just as good as the data we give them. We need to learn the data and shape the model accordingly. It would be too optimistic to expect a model to give us everything.
Data should not be mysterious because we have lots of tools to explore it. However, in some cases, we may fail to dig enough to learn from the data. There are certain things that we immediately look for in a dataset such as size, shape, number of missing values, distributions and correlations. They are important but not enough. A decent exploratory data analysis process is only achieved by taking the standard techniques one step further.
Your code running without any error does not mean it is correct
Data scientists write code but they are not as much into it as software developers. Most data scientists do not have a background in Programming or software development.
It is a challenging task for us to write clean, efficient, and maintainable scripts. When I first started writing code, I thought it was ok if the code ran without any error. However, it is not always the case. Some unexpected problems may arise without causing any error.
One way to detect such problems is checking the results. A few times, I ended up having a data frame consisting of mostly null values although the code did not raise any error.
Conclusion
We all learn from our mistakes which is a good thing. What is even better is to learn from other people’s mistakes. In this sense, the lessons I shared in this article might be considered as a resource to learn from.
I hope you do not make the same mistakes. If you do, try to come up with best practices that will prevent you from making similar mistakes again.
Last but not least, if you are not a Medium member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion from your membership fee with no additional cost to you.
Thank you for reading. Please let me know if you have any feedback.