Data Science Year Zero: Your First Job

Bottom line: A few soon-to-graduate college students have been asking for my advice on what it takes to bag a Data Science & Analytics role. In this article, I’ll summarize the lessons I’ve learnt in the past year and hopefully, you’ll find some value in it.
**Table of Contents:
- What I lacked When I started
- Skills and Tools to learn
- Which Skill to learn first
- Workplace advice**
What I lacked when I started (and you do, too)
Figuring what ‘Prod’ meant, took me embarrassingly longer than what I’d like to admit. It was indeed a lightbulb moment when I did finally figure it out, for years worth of questions were answered in a matter of minutes.
I’ve never cared much for software development, and I believe that the CSVs and the Kaggle competitions have all done me dirty. When I started, I felt unprepared to work in a ‘real environment’ where you shouldn’t have to download a Parquet and load it into PySpark every time to need something done.
Software Engineers get a rundown of the Prod environment on their very first day at the company. However, Data Science is still finding its footing in most young organizations, unless they sell a data-driven product. The Engineers assume that you’re there for the Business Intelligence (BI), while the Business team takes it that you have your bases covered on the "tech stuff".
I didn’t, and probably you don’t either!
Here, I’ve compiled the learnings (ergo, the mistakes) from my first year working as the sole Data Science personnel at SmartServ, a SaaS startup for non-desk workers in the US and Canada. Some of these will not be expected of you—fresh out of college—but as this article happens to be about ‘standing out’, I’m including everything I know. I didn’t know any of this before I started, you probably should.
In no particular order:

Production: Getting the hang of the ‘production environment’ is a paradigm shift. You’ve always wondered how things must run at Netflix as they can’t reasonably be depending on loading and downloading CSVs every time a movie recommendation has to be made. But Kaggle and AnalyticsVidhya only ever prepared you for the final 10–20% of the stretch, i.e., reading preprocessed data & using it to fit ML models. Those, too, are one-off processes—neither reusable nor scalable. Data & Pipeline Engineering is how it’s done in the real world.
- Bigger companies hire for dedicated roles. However, at a startup, you take projects end-to-end. And that (roughly) includes:
- Sensing the data
- Ingesting it (from data-lakes or OLTPs)
- ETLing (Denormalizing, cleaning, and preparing it for use)
- Moving it to OLAP
- Setting up a warehouse
- Creating the ML Model
- Feeding it to ML/DL models or reports or dashboards NOTE: Everything runs on the production environment (i.e. a server) free from human intervention. A simple introduction to Environments:

- Docker : Once you’re through with ‘How do I put it in production?‘, you’ll encounter ‘Why is this not running in production? It works on my machine!!". Managing dependencies across environments is a headache, and knowing basic Docker/Kubernetes implementation can take you a long way. Docker containers are analogous to virtual environments (like –virtualenv in Python3), and they freeze the current state of your environment (including all dependencies) on which your application is going to run. It is then deployed to production where it remains unchanged over upgrades and deprecations of the included environment variables.
-
At bigger organizations, you might find dedicated DevOps personnel for this, but it’s always better to be full-stack. Dockers can do a lot more than what I’ve touched upon, but I’ll leave that for the experts. Tip: Dockers can also help you build a portable personal development environment for yourself that you can use while on the move. A step-by-step guide to Docker:

- Databases: Imbibe the functional DB concepts. You’ll be expected to know at least one Relational (MySQL/PostgreSQL)and one Non-relational Querying Language (MongoDB/CassandraDB)each. Once you figure one out, the rest are easily accessible. Query logging and optimization are definitely a plus, however, they’re unlikely to be acknowledged or appreciated. There are many articles to get you started with DBMS.

- ETL: I was oblivious to ETL right until its need hit me. Unless your product utilizes Intentional Data Transfer (IDT), you can’t escape this. ETL is an acronym for Extract-Transform-Load, a process you’re going to have to perform repeatedly, possibly hundreds of times a week for a hundred different tasks. ETL applications allow you to template the entire process (called a ‘pipeline’, using code or GUI) and even schedule it to run at specified intervals free from intervention.
- My Engineering lead pointed me towards Pentaho DI Community Edition (earlier Kettle), and it’s a pretty neat application. Alternatives include Matillion (wide support for cloud data sources), Talend, and Informatica. Savvy Data Scientists prefer using Bash for it’s low-level controls, so take your pick. A guide to Pentaho DI:
Getting Started with Pentaho Data Integration (Kettle) and its Components.
Tip: You’ll eventually want to graduate to a fully-fledged Workflow Managment System (eg. Luigi, Apache Airflow). I’m trying to take Airflow on at the moment, and it’s a world of pain. But it does seem to be worth it, a one-stop solution.

- Scripting: Reporting is not something you’re ever going to grow out of. While SQL is quite powerful by itself, knowing a scripting language often comes in handy. Scripting languages allow you to perform specified tasks in your runtime, interpreting it sequentially.
- Python is the de-facto, and that’s what I use for reporting (and for workflow management). I moved from R (my personal favourite) to Python because:
- People are not familiar with R. Hence, they resist it.
- R has parallelization constraints (poor scalability), and is not deployment friendly.
- It’s highly abstacted. Therefore, debugging gets challenging quickly.
- Everybody can read a Python3 code. Reusability goes up and passing it on to newer folks become easier.

- ML/DL/Visualization: This is your forté. You’ll keep accumulating other skills as you go, but at heart you’re always going to be a ML/DL wiz or Visualization aficionado. In the early years, however, the projects may not always come as you’d like them to. You may not get to build statistical models or fancy recommendation engines. You’ll need to scratch your itch by working on personal projects or contributing to open-source projects. And when you find a window where your skills can be useful to the product/company, hustle a little to get your foot in. Refer to ‘Workplace Learnings‘ below. Your preparedness will ensure that you stand out when the need arises.
- You must already be familiar with TensorFlow, MXNet, or Torch for ML/DL and ggPlot2, Bokeh, Plotly for visualization. Here’s a great open-source project by NVIDIA which you can contribute to:

- Cloud Computing : THE top skill to have in your arsenal for any Data Analyst/Scientist. Most developers and DevOps personnel are not comfortable with data-driven applications (remember that you might be the first or only the fifth Data-Science hire; the team’s going to take time to get used to your needs) which is why you want to start learning the ropes of CC as soon as you can.
- Fortunately, there are a plethora of managed platforms (AWS, GCP, Azure, DataBricks) that make it infinitely easier for you to build, test, and deploy your ML applications. Moreover, managed ML services (e.g. SageMaker) can help you cut through the auxiliary work that you might not want to put in.
Here are my personal notes on AWS Cloud Computing Solutions for Data Science:

- Version Control (Git) : When you work in large teams, it’s inevitable that multiple people are going to work on the project. That would require collaboration and some sort of history and change tracking of the codebase. Git helps multiple people collaborate on the same project and even keeps track of all the changes made to the codebase. It can do a lot more, but it’s a process that you get better with experience. I still struggle with Git.
tl;dr: What you should learn first
In my opinion: Mandatory Skills : DBMS & Scripting >> Git Skills to standout : [Cloud Computing, ETL] >> Docker >> your forté
Workplace learnings:
- Use statistics wisely: People with authority are only convinced by the things they understand (which is why they’re the ones making the decisions). Go overboard with numbers or compound metrics and you risk losing them, and go in without your numbers and you won’t be able to back yourself up. Balancing numbers with first-principles reasoning is a craft that I haven’t mastered yet, maybe you can get started on it sooner.
- You can’t escape the grunt work: There are innumerable reasons why you can’t, just accept it and try to get it done. Unless you’re specifically filling the role of an ML Engineer (which would be pointless for a graduate), you’ll be required to do a bit of everything. You’ll be responsible for all data-oriented tasks. Forgetting the science part from ‘Data Science’ helps ease the pain sometimes.
- Be adamant: To get things done, you’re going to need people in your corner, preferably senior-folk that back you and your ideas up. If you don’t have one in your team, find a champion outside. I had a difficult time getting anything done in the first few months because of the aforementioned. Others would push through with their ideas while I stood there waiting for things to come my way (they never did). Be a little scrappy and more than a little adamant. Get your shit done!
Conclusion
Data Science roles are not purist tech roles. At the end of the day, you’re there to cater to the product and its needs. Dare to step out of the bubble that’s been created in the Data Science community, and we might just become good leaders somewhere down the line, owing solely to the unique position that Data Science holds in the Product-Business mix.
That’s it for this article. I’ll try to cover each of these points separately in other posts. Let me know if you have any feedback or anything you’d like me to cover in my coming posts. I work at SmartServ where we’re always looking for fresh talent. Look out for any available positions or drop in a hello!