The world’s leading publication for data science, AI, and ML professionals.

3 Lessons I Have Learned After I Started Working as a Data Scientist

What I would have done differently

Photo by engin akyurt on Unsplash
Photo by engin akyurt on Unsplash

I started my Data Science journey with completing IBM Data Science Professional Certificate on Coursera. It took me almost two years to land a data scientist job.

After I started working as a data scientist, it did not take me long to find out what I did right and wrong during my learning journey. The things that I overlooked have become crystal clear.

I’m not sure if it is because of the relief of finally landing a job or working in a production environment with real life data. However, I can assure you I would have done some things differently if I had to start over.

In this article, I will write about the 3 lessons I have learned after I started working as a data scientist. Some of you may be aware of these lessons but I’m sure there are some aspiring data scientists who might benefit from them.


SQL is a must

The fuel of data science is data. Without proper, well-maintained, and easily accessible data, we can’t do much. Although NoSQL databases are getting more common, most of the companies still use relational databases to store data.

SQL is the key to relational databases. SQL is not only used for accessing and retrieving data but also as an efficient data analysis tool. The versatile and flexible SQL functions allow for performing data analysis while retrieving the data.

We can also use if for filtering and transforming the data so that we only get the data we need. It saves us both memory and computation.

I learned SQL during my data science journey but it was not enough. I consider myself an intermediate user of SQL. If I started over, I would definitely go for being an advanced SQL user.

I discover the capabilities of SQL and how important it is for data science ecosystem. In order to become an advanced user, you should do lots of practice.


Git is the way to collaborate with your colleagues

Git is a version control system. It maintains a history of all changes made to the code in a project. The changes are stored in a special database called "repository", also known as "repo".

Two main advantages of using Git at software development:

  • Tracking the changes and updates. We are able to see who made which changes. Git also provides when and why a change was made.
  • Allowing to work collaboratively. Software development projects usually require many people to work together. Git provides the developers with a systematic way of doing that. Thus, the developers focus on the project instead of extensive communication sessions between the other developers.

In a typical data science project, you are likely to work with many people including data engineers, software developers, or other data scientist. The first and foremost way of communication is done through git.

You must be comfortable with the git commands and work flow to collaborate with your colleagues. Although hosting services like GitHub and GitLab provide simple interfaces for using git, I recommend to learn the git bash commands as well.


Python is not just about Pandas

Pandas is a great tool for data analysis and manipulation. I have been using since the first day I started my data science journey. I have also used many other Python libraries in the data science ecosystem such as NumPy, Seaborn, Matplotlib, Scikit-learn, and so on.

All of them are very useful and I definitely suggest Learning them. However, Python is not just about data science libraries. I feel like I focused on learning these libraries too much. As a result, I haven’t been able to improve my Python skills as a general-purpose language.

You may argue that a data scientist is not a software developer. However, most of the companies will ask you to write basic scripts to implement projects. Besides, you should be able read and understand the code written by other data scientists or software developers.

It is not only for Python. Whatever programming language you choose for learning data science, make sure you scope covers more than data science libraries.


Conclusion

The 3 lessons I shared in this article are what I realized after working as a data scientist. I knew SQL, Git, and Python were important and spent time learning them but it was not enough. I should have focused a lot more on these subjects.

I want to emphasize that these are not the only things you need to learn. In fact, these are the things you may overlook.

Since data science is not well-established in the traditional education system, the learning path is mostly through certificates and MOOC courses. The typical certificates related to data science usually focus on the libraries. Thus, the aspiring data scientists who follow a self-taught process like me are not likely to give enough importance to the tools mentioned in this article.

Thank you for reading. Please let me know if you have any feedback.


Related Articles