Answering the question "what skills do I need to have" can come with various answers depending on who you are talking to, what company you are looking at, and even what job position you are applying to. After many talks with college students on this topic, I wanted to sit down and discuss how I look at this question and the top 8 areas to consider when looking at skills.
1. Developing a Business Problem
As you look at what skills you need to work in Data science, one key area is learning to develop or understand the business problem you are trying to solve. It is essential to understand the business justification for the work you will be doing and how the customer will utilize it. Often, we can get caught up in a cool idea, but we miss the business aspect. If no customer is asking for the work, no reason to run the analyses, then what are you doing? Understand how this work will be used and provide value to the customer. Developing skills in this area to understand business justification and value-add can help you continue to build data science projects and present your work to others.
2. Working with Big and Small Data Ingestion and Processing
As you look for a job in data, you will need to know how to ingest and process the data you are working on. The skill sets in this area may vary. Suppose you are looking more towards data engineering. In that case, you may find yourself developing databases, creating relationships between the data sources, and making data marts for people to come to get the data from you. Your skills need to be in how to create and maintain those data sources. If this is the case for you, focus on knowing different database types, how to use those databases, and how to create relationships within the data.
If you are looking to find a Data Science or data analyst job, you may be more focused on how to bring that data into your workspace. Do you need to connect to a database or use an API? Are you developing code that will interact with this data or software tools like Power BI and Tableau? Your skills in this area may vary depending on the type of role you are applying to. Still, it is good to have at least a basic understanding of how to interact with different data sources and ingest that data into your tools or environments. Knowledge of how this data gets ingested before you start your analyses is essential.
3. Understanding Data Cleaning and Preprocessing
Data cleaning and processing will also vary between jobs. In the first case of being a data engineer and creating the data sources, your cleaning and preprocessing may be more generic to the overall ingestion process. You may want to remove duplicate entries, clean up how the data sources are interconnected, and create a usable database that others can work with. In this position, you are not as focused on intense data cleaning techniques.
Now, what I mean by that is, if you are a scientist or analyst looking at the data, you may ingest a data engineers dataset and still need to clean and preprocess the data to work with what you are doing. You need skills in one-hot encoding, cleaning and handle text data, imputation, and making sure the data columns are the expected data types for what you are doing. Understanding different ways data cleaning and preprocessing happens and implementing them depending on your end use-case are valuable skills that you will often need as you work with the data. You should understand the main concepts of data cleaning and preprocessing relative to the job you are looking for.
4. Working with Tools
No matter if you work in R or Python to code your models or utilize auto-ML tools to create your models, you should understand how to use your tools. If you are a programmer in data science, it is expected that you can take a notebook or script and be able to interact with that code, read it, change it, and run it. You should explain the types of packages you use, why you use them, and in what cases you may use one package vs. another. It would be best to have a good understanding of the programming tools you use from language, packages, IDE’s, and notebooks to how this code will be used in production to drive action.
No matter your toolset, you need to explain how you use these tools’ different features to ingest, analyze, and present your work. For example, let’s take an autoML tool. If you put that tool on your resume, you should understand how to ingest and clean data, configure the tool to use your dataset, and finally run the tool and use the results. You need to be able to answer questions related to this tool when asked. Whether you are creating your model through code or an autoML tool, you should understand how to develop, tune, and understand your model and its output.
5. Drafting Visualizations, Reports, and Dashboards
As you are developing your models and gathering your results, you will need to draft different visualizations, reports, and dashboards from that information. Understanding how to present your work in a meaningful and accessible to read manner is crucial to sharing your results with others. You want visualizations that a user can pick-up and drive action based on all the presented information.
For example, let’s consider a dashboard. If you are developing a dashboard of your outputs, what data points from your results are most important to your end customer? Would you display the data in a scatter plot, heatmap, or bar chart, and why? Drafting visualizations and turning them into presentations, reports, and dashboards is a skill on its own. It combines the business understanding of the problem, the cleaned data, and a representation of the action you would want someone to take as they review the data. This skill can be improved over time as you adjust to how your particular customer(s) want to see the data presented.
6. Understanding the Analytics Life Cycle
You go through all the effort to train a model or develop an analytic, but that analytic will not last forever when it comes down to it. You will collect more and new data, you will need to develop new analytics. Understanding the life cycle of your analytics and learning how to retire old analytics is a valuable skill. Your analytic will be created, validated, deployed, reworked, and then retired. The most exciting part is the creation and validation of the analytic. This part is when you are developing the model and validating that it is predicting what you expect. The rest of the life cycle focuses on how that model will be deployed. Will it be added to an application front-end or back-end? Will the analytic drive internal value, and you need it to produce results for a dashboard or report?
Once you know where this analytic will be deployed and how it will be used, you can determine when the analytic will need to be reworked to optimize and tune the results. This rework may happen every so many days, weeks, or even months. But after rework comes retirement. The analytic will eventually retire as it no longer produces valuable results, or the data has changed so much that you need to restart a new analytic. It is essential to understand what you will need to do at each stage of the analytic lifecycle.
7. Analytics to Production
One subsection of the analytic lifecycle is creating a production-ready analytic, but how do you move your analytic to production? There are different ways you may move an analytic into production, and you should understand at least one way and learn others as needed. The methods I have seen analytics move into production are:
- Pickling the model after training.
- Creating analytics using OOP and making an analytics library that others can run.
- Creating Docker containers of your analytic code and dependencies.
As you look at different options for creating production-ready models, make sure you understand your use case first, and then design your code architecture to fit what is required.
8. Embracing Research and Development (R&D)
As you work as a data scientist, you will also encounter many projects that are focused on R&D. You will need to think critically, research information from different sources, and analyze that information. The area I like to focus on R&D projects is failing fast. Failing fast is when you iterate quickly on a project and can fail rapidly. As these failures come, you will need to decide where to take the project next or drop the project for another.
A soft skill that goes along well with R&D work is to listen and embrace others’ feedback and ideas as you work. As you clean data, develop analytics, deploy your model and visualizations, listen to what others tell you about your work. I am not saying you should take every idea they give you or use all the feedback you are provided, but listen to what they have to say and determine if you need to alter your approach to the problem. If you ask them a question, jot down notes and understand what they are telling you. Then, take all these notes and determine your path forward. You may use some of their ideas, or you may not, but these discussions can drive some impressive results if you are open to working with others on your projects.
Final Thoughts
As you are looking for a new job in the data science field, understand that particular tools, technologies, and skills may vary from company to company. But, if you can get a good grasp of skills in different areas, understand the tools you know, and are willing to learn as you go, you can better prepare yourself to explain how you utilize your current skills to tackle problems.
If you would like to read more, check out some of my other articles below!
Stop Wasting Your Time and Consult a Subject Matter Expert
Do We Need Object Oriented Programming in Data Science?
Keys to Success when Adopting a Pre-Existing Data Science Project