Guide
This story will be somewhat different from what I usually post. It won’t be an intro to specific tools and techniques, nor a tutorial or a practical case.
This time I want to answer a question I’ve been receiving through LinkedIn since I started posting on Medium:
how does someone with no technical background become a data scientist?
I’m no expert and my experience in the field is rather short, but I do feel in a position to help and discuss the topic. All is obviously going to be based on my opinion and there’s no science backing it up.
Disclaimer: I currently work as a data analyst but I’ve studied and been playing with data science for 5 years now. You could say I’m a free-time data scientist who makes ends meet through data analysis.
There are multiple paths someone could take to become a data scientist and it ultimately depends on your preferences and needs.
So I won’t discuss paths here. Do you like college? Enroll now. Do you prefer self-teaching? That’s what I’ve done for so many years. Do you prefer short courses or bootcamps? Those are great too.
Any option works, I’m not here to tell you what to choose.
What I can do, though, is share the tech stuff and details I find myself using every day. The tools and languages you certainly need to become a data scientist.
And it will all be pretty basic. In fact, you don’t even have to read this post. Just go to LinkedIn, find a job offer for a data scientist role, and check the requirements. You’ll understand the basics we all need.
You’ll see me talking about some key topics you need to master, and I’ll try to provide my personal view with personal examples on how each of these parts has been relevant for my personal projects.
Being more specific, you’ll see me talking about Netty and Bazar, the two names I’ve given to two of my projects. Keep on reading to know more about them!
However, data science is a really broad term. It’s like computer science. It involves many things and can get really different from one specialization to another. So I’d like to briefly dissect the data science role into some of its different types.
Types of Data Scientists
This can be quite controversial. There’s not a true list of types of data scientists; everyone tends to make their own. I found one on the Internet that works well for what I want to share so that’s the one I’ll be using [1].
The website shows 15 different kinds, really well explained, and with their respective responsibilities.
Check it out if you want to see them, though some people might not really label some of the roles there as real data scientists – roles such as data analyst, data engineer, business analyst…
In the end, there doesn’t seem to be a clear consensus on what a data scientist is!
But I think the list is actually helpful to let people see what each role is about. It should help newbies find their goals and move toward that direction.
Whatever you choose, the chore requirements are always the same. Let’s go over them.
1. Master Python (or R)
Python is the most-used language for data science. That’s the main reason why you need to master it.
You could choose other languages such as R or Rust – this last one becoming really famous lately – but I’d choose to learn them as a second language.
Just like we have a mother language that we use every day but we learn a second one to increase our options and set of tools. Make Python your mother language, and then complement your skills with R, Rust, or any option you’re interested in.
Based on my personal experience, this is the greatest asset on today’s list. Python is always the main language I’ve used both in personal projects and professionally as well.
For example, I once built a data science project that I called Netty – the one I’m most proud of – that consisted on a deep neural network with convolutional layers and fancy stuff that worked really well.
All Netty did was predict the winner of a given NBA game. Not to brag, but I tested it daily throughout two entire seasons and it worked quite well.
It didn’t make any sense to build Netty using a language that wasn’t Python. It would have been otherwise more difficult, less efficient in terms of time invested, and the results would have probably been worse.
So don’t ignore me now. Go ahead and learn Python, the ROI it’ll yield is priceless.
However, if you hate Python or just don’t want to follow that path, I’d suggest R. It’s an exceptionally useful language for data science and Data Analysis, so it’s a good option as well.
Plus, in most job offers they ask for Python OR R, so mastering one of both will probably be fine.
2. Once You Master the Language, Master Its Libraries
Knowing how to make for loops and conditional statements is basic, but that’s not going to make you stand out.
You need to master the libraries and packages that make Python and R so useful for data-related tasks.
Focusing on Python, because that’s what I usually recommend, here are the musts:
- Pandas – A powerful tool for data analysis and manipulation.
- NumPy – Your friend for any scientific computing you’re willing to perform.
- Matplotlib or Seaborn – You’ll want to visualize data and these two let you do just that.
- Keras, TensorFlow, or Scikit-learn – These will let you build the AI and ML models.
Apart from those, we could also talk about Collections, Statistics, Plotly, Dash, SciPy… But I wouldn’t say they are a must or, at least, as important as the previous ones.
If I were to start all over again, I’d focus on Pandas, NumPy, and Seaborn first. Once I became fluid with those and could manipulate data, I’d move on to build some models with Keras, TensorFlow, or Scikit-learn.
As I’m sure you’ve already heard before, data scientists are estimated to be spending 80% of their time retrieving, cleaning and manipulating data, while the remaining 20% is all the modeling part.
That’s why Pandas is crucial, in my opinion. I’ve been using it daily for some years now and it’s versatility, combined with NumPy, is what’s helped me develop amazing skills and advanced projects.
Keeping up with Netty, remember I said it consisted on a neural network. How do you think I made it possible? All the cleaning and manipulation of the data was done thanks to Pandas and NumPy; I obviously needed visualizations so I combined both Matplotlib and Seaborn. And of course, the cool part, the AI model, was built mainly using Keras but also Tensorflow.
Just like it made sense to use Python and no other language, same with these libraries. It’s just as if they all came as a pack.
3. Master SQL as well
Even though SQL won’t help you build models and predictive systems, learning how to perform queries and retrieve data from databases is basic.
I’ve said before that most of my coding time is in Python… The remaining is on SQL.
Be it professionally or in personal projects, if the data is stored in databases – and it usually is – you’ll have to know how to retrieve the exact data that you need.
I wouldn’t have been hired at my current job if I didn’t know SQL. And honestly, I couldn’t work there if I didn’t master it.
So pick any Relational Database Management System (RDBMS) you like and go ahead and learn to perform simple queries. I wouldn’t go super deep into SQL, I believe that the basic statements should be more than enough.
Learn to do joins, group by’s, window functions… Just become fluent in doing them.
If you want some RDBMS recommendations, I love DuckDB[2] – partly because it’s powerful and it’s amazingly integrated into Python. Check a story I posted introducing and analyzing it:
Forget about SQLite, Use DuckDB Instead – And Thank Me Later
Apart from DuckDB, other options I’d consider are SQLite, MySQL, or PostgreSQL…
On my personal portfolio I used to be hired, there was a project called Bazar. This one basically stored Amazon’s data on a local DuckDB database and I used it to track prices of products I was interested in and see when they dropped.
This project was sustained by a lot of SQL queries. Not complex ones, really simple in fact, but they were key. I had to retrieve the products’ prices, the URLs… All frequently to compare with real time data.
4. Don’t Be Afraid of Math
I personally love math. I’m lucky my interest in them comes naturally.
But others aren’t as lucky.
However, Data Science involves a lot of math. Be it directly or indirectly. Mastering math is key to understanding what you’re doing and making informed decisions.
Focus especially on calculus, algebra, statistics, and probability. Easier said than learned, I know.
For example, gradient descent is the most-used algorithm when it comes to the training of neural networks – yes, I used it on Netty too. Gradient descent is basically the process of using the gradient (derivatives) to tune the parameters that make up the model.
Another example I’ve used myself at some projects, especially in Netty, is on the exploration phase of the analysis. Lots of stats have to be applied there: distributions, means, percentiles, deviations… You have to understand these.
5. Improve Your Visualizations
It will depend on your role but, as a data scientist, you’ll probably need to share the insights with someone on your team or stakeholder. Visualizations will come in handy so that’s why it’d be a great asset if you mastered them.
Nothing fancy here, you can either choose specialized software like PowerBI or Tableau or just stick to Python and build amazing interactive dashboards with Plotly and Dash.
I’ll be creating a more in-depth post soon on Dash, but I introduced it briefly on a past story alongside other libraries. I think it could be useful if you’re interested:
Building Interactive Data Visualizations with Python – The Art of Storytelling
The thing is, as data scientists or analysts, we need to communicate our findings. It’s happened to me a lot that a simple visualization has been enough for the stakeholders to understand what I’m talking about.
That’s the magic and power of visualizations: if it’s amazing, few words are needed.
But don’t do it for the others, do it for yourself as well. Again, with Netty, I tinkered with lots of numbers. One could really get lost easily and don’t understand what one’s seeing.
With simple visualizations all the data made sense. I could now see how the accuracy of the model and its error evolved over time during the training phases. Not only that, but I could also see how the winners were distributed (~60% of the time the winner was the home team) and lots of new data that was extremely valuable.
6. Be Curious and Proactive
After you master Python, its libraries, SQL, math, and some visualization tools, you can’t allow yourself to stop there. The good stuff starts after all that core learning.
Curiosity, proactivity, creativity… They’re not tech tools but they’re equally important.
We need to cultivate these skills and learn new ones. Create personal projects that we’re interested in and use these tools to create an amazing portfolio.
This is what will ultimately make your profile stand out from the rest. Putting stuff on your CV isn’t enough. You need to show what you’ve done.
Provide proof.
The good part is that by actually doing these projects you’ll get to learn a lot while having fun.
And what’s most important: you’ll get hired.
Look at me, for example. A college drop out who decided to spend his time learning a lot and creating personal projects that build a entry-level portfolio.
That portfolio lead me to where I am today, in an amazing company with amazing benefits and an amazing culture.
Conclusions
Data science is an amazing field and it’s a perfect fit if you happen to be interested in data, math, and forecasting.
Becoming a data scientist requires time and effort, but it’s simple. There are multiple possible paths and the answer to which one is the better ultimately comes down to one’s preferences.
However, the basic-most skills and tools required are pretty straightforward: math, leveraging Python to build models, manipulate and visualize the data, and playing with databases.
If you’re still interested, then go ahead and join us!
Thanks for reading the post!
I really hope you enjoyed it and found it insightful.
Follow me and subscribe to my mailing list for more
content like this one, it helps a lot!
@polmarin
If you’d like to support me further, consider subscribing to Medium’s Membership through the link you find below: it won’t cost you any extra penny but will help me through this process.
Resources
[1] 15 Different Types of Data Scientists [With Responsibilities] – Knowledge Hut
[2] DuckDB