
As the technology develops, so do the skills required to retain your top position in the industry. Being a Data Scientist has been a dream of many for the last couple of years. Without saying, it is a highly paid, lucrative job. Some of you may be working as a data scientist and others may seek to find a position in this field. In this brief article, I will list out the technical skills requirements for a modern data scientist. If you are a working data scientist, you would possess some of these skills, however, you may need to upskill in many new areas. If you are a beginner, the comprehensive list would help you in career planning to be a data scientist.
Let me start with the most essential skill.
Choice of Programming Language
Any machine learning project, like any other software project, requires the selection of an appropriate programming language. No software developer will think of using Fortran for business applications; likewise, you would never use COBOL for scientific applications. That to say, every programming language was designed with a specific purpose in mind.
I need not reason it out, but Python is the choice of a majority for ML development. If you take a survey on the job postings, you would notice that probably Python, SQL, and R top the list amongst all. I would recommend Python. If you are a seasoned developer, think of Julia for its high performance.
The most important question here is to what depth do you need to know Python? It would surprise beginners to know that every machine learning code comprises just a few lines – hardly 100 to 200. The skeleton code remains the same across all ML applications. This is the complete skeleton code (algorithm):
- Data Cleaning
- Data Pre-processing
- Exploratory Data Analysis (EDA)
- Features Engineering
- Creating Training, Testing, Validation Datasets
- Selecting ML Algorithm
- Training
- Evaluating Model’s Performance
After training, if it satisfies you with the model’s inference on unseen data, you can move the model to a production server.
The above training and testing code remain the same across all ML projects; what changes is just the ML algorithm. Albeit, there will be variations in each step depending on your data types and the problem that you are trying to solve. However, the top-level view remains the same across all projects. I hope you are getting to the point. For writing this kind of model training code, you need not gain deep skills in Python. I have rarely used object-oriented and functional programming in Python for my ML projects.
Contrast this with other programming languages. Without using the advanced features, you could never develop high-performing software applications. In ML, what is high performance? Even if it takes some extra time for model training, I am okay with it. What is important to me is the outcome – that is the model’s accuracy. Second, by using the functional and object-oriented code, I would make it less readable for the trainee developers. Thus, I always stick to basic, simple code that even a novice can easily understand.
You may even find ready-to-use templates for 400+ ML algorithms. Using such services will take away your work of even writing a skeleton code.
The next skill is the choice of a machine learning library.
Machine Learning Library
The scikit-learn is the choice of many ML developers. The scikit-learn provides tons of tools for predictive modeling and analysis. You need to use Pandas, NumPy, Matplotlib/Seaborn to supplement sklearn. NumPy and Pandas are required for creating machine-understandable datasets, features/target selections, and creating datasets. The Matplotlib and Seaborn would aid in EDA.
The biggest challenge in any ML project is the selection of an appropriate algorithm that would eventually give you the best predictions on unseen data. Here comes the AutoML to your aid in selecting the best algorithm.
AutoML
For algorithm selection, use AutoML. AutoML is a tool that will apply several ML algorithms to your dataset and rank them in the ascending order of their performance accuracy. With AutoML, you do not need to develop the expertise of those hard-cored data scientists. A lot of saving in terms of the learning process – you just need to learn the concepts behind the various algorithms and what problems they solve.
Even a beginner in ML can learn to use these tools that help in selecting the best performing model along with its all hyper-parameters fine-tuned. There is a huge list of AutoML and hyper-parameter tuning libraries. I have generated a comprehensive list for your quick reference.
AutoML Library/Tools:
Autosklearn, TPOT, Autokeras, MLBox, AutoGluon, autoweka, H2OAutoML, autoPyTorch, Amazon Lex, DataRobot, TransmogrifAI, AzureML, Ludwig, Darwin, Google AutoMl, AdaNet
Recent addition to this space, which is worth investigating, is BlobCity AutoAI. The most interesting feature that I shall like to mention is, it spills out the project source for the generated model, which a data scientist can claim to be his own creation. edited – January 17, 2022
Hyper-parameter Tuning Library/Tools:
Optuna, hyperopt, Scikit-optimize, Ray tune, RandomSearchCV, GridSearchCV, Microsoft NNI
I do not want to endorse or recommend one for you. You can try those out for yourself. It is just sufficient to say here that in ML development, the algorithm selection and fine-tuning hyper-parameters is the most arduous task and the use of these tools has made that easy.
Next comes the selection between statistical modeling and artificial neural networks.
The Choice between Classical ML and ANN
I may here refer you to my earlier article Modern Data Scientist’s Approach to Model Building published on Medium or you may watch the YouTube presentation on @blobcity channel that talks about the approach taken by data scientists in model building.
If you prefer neural networks over classical ML, the top selection amongst the developers is TensorFlow. It is not just an ANN development library, but it is a complete platform that allows you to create data pipelines, use distributed training for reducing training times, use CPU/GPU/TPUs depending on the resources that you have, allows deployment on the on-prem servers, cloud and even on-edge devices. With your kind consideration, I may refer you to my book for more details that also contain 25+ real-life projects based on ANN.
Another popular ANN development tool is PyTorch.
Transfer Learning
Training DNN requires enormous computing resources, several weeks of training time, and the datasets that would run into millions of images or terabytes of text data. Many tech giants have created several DNN models to solve a certain type of problem. The models that these companies have created, no individual in this world could do so.
Here is a comprehensive list for your quick reference.
Image Classification: ImageNet, CIFAR, MNIST
CNN Models for Computer Vision: VGG19, Inceptionv3 (GoogLeNet), ResNet50, EfficientNet
NLP Models: OpenAI’s GPT-3, Google’s BERT, Microsoft’s CodeBERT, ELMo, XLNet, Google’s ALBERT, ULMFiT, Facebook’s RoBERTa
Luckily for us, these companies have made these models publicly available. As a data scientist, you can re-use these models as is or extend them to meet your business requirements. We call this Transfer Learning. They trained each model for a specific purpose. You simply need to understand it before transferring their learning to your model.
Now comes a significant change that would change the way we have been developing models for many years.
Dynamic Models
So far, we have been developing ML models on static data. In today’s pandemic situation, do you think these models trained on static data about 2 years ago will help you in predicting your customer visit patterns and their new buying trends? You will need to re-train the model on the new data. It means that periodically all models developed on static data must be re-trained. This calls for the consumption of resources, time, and effort. This does not go without an additional cost.
So the concept of dynamic modeling came up. You re-train the model continuously on streaming data. The number of data points per unit of time that you collect in a stream would decide the window size. The solution can now boil down to time series analysis. One of the major advantages of dynamic modeling is that you can work with limited resources for model training. In the case of static modeling, they require you to load tons of data points in memory, which also results in high training times.
Developing models on streaming data requires additional skills. Have a look at PySpark. You will need to adapt some web frameworks into your model. Note that you train the model continuously and use it for immediate predictions in real-time. The web framework should be capable of producing fast responses.
In my next article, I will cover dynamic models in more depth.
Here comes a disruptive innovation.
MLaaS
Ultimately, I must mention MLaaS, which is probably set to replace data scientists. Tech giants like Amazon, Google, IBM, and Microsoft provide complete Machine Learning as a Service.
You just need to upload your data to the servers. They do the entire data pre-processing, model building, evaluating, and so on. In the end, they simply give you a model ready-to-move to their production server. Though this is a threatening situation for data scientists, I would still recommend that data scientists should get accustomed to these new technologies and provide a quick solution for their owners. After all, satisfying an owner helps you in retaining your job.
Summary
I have provided you with a comprehensive view of the skills that you need to be a top-notch data scientist in the current state of Data Science. With technological advancements, you do not need to master each skill. For example, there are hundreds of statistical ML algorithms designed so far. You may now use tools like AutoML to select the best-performing one for your business use case. Even as far as the programming language is concerned, you have seen that you do not need to be an expert in it to become an ML developer.
The success of ANN has opened up a new chapter in ML development. Several things are now oversimplified. Creating data pipelines, automatic features selection, unsupervised learning, and the ability to train networks on terabytes of data are just a few advantages of ANN technology. As a data scientist, you need to gain these skills, if you have not done so far. Using DNN and pre-trained models has further simplified the job of a data scientist.
Another important requirement that has come up, maybe because of the current epidemic, is the need to develop dynamic models. As a data scientist, they now require you to develop models on streaming data. Developing dynamic models on high-frequency data is a subject of current research.
Finally, to say you may opt for MLaaS for a quick solution.
Good luck and strive to be on top of the ladder – be the most sought-after data scientist.
You may like to watch my presentation on YouTube.
Credits
