Want to be an Eagle or Kaggle data scientist ?

There is no doubt that Kaggle is a great place to learn Data Science. There are many data scientists who invest a lot of time in Kaggle. That is fantastic.
But you should not rely only on Kaggle to learn data science skills.
And here are the reasons why
Data science is not only about prediction
Kaggle focusses only on problems which require to predict something. However there are many real-world problems which are not related to prediction.
For example, many companies are interested to know what are the most common paths to customer churn. These type of problems require one to understand different data types and customer touch points such as web navigation, billing, call-center interactions, store visits . You then need to identify important events such as excess billing or navigation errors. Once you have identified all events, you need to apply path algorithms to understand common paths which lead to churn. These type of problems cannot be solved only with predictive algorithms. They will require algorithms which can construct a time-line from events.
Like-wise there are many other such problems which require skills other than predictive problem solving. It is nice to know how to solve predictive problems, but as a data scientist you are expected to solve multiple type of problems. The real world has larger type of problems to solve. You will have to look outside Kaggle to develop skills on solving real world data science challenges
You will not develop skills on Graph Algorithms
Social Network Analysis, Influencer prediction, community analysis, fraudster network analysis – all these are very interesting analytical problems which a data scientist is required to solve. These type of problems require knowledge on graph algorithms such as Pagerank, Modularity, ShortestPath, EigenVectorCentrality and many more
Network or community type problems are rare in Kaggle. The graph and network style problems require notion of data of nodes and links, which is not the way most of the data are available in Kaggle.
Of course you can convert a problem to use graph algorithms, but it is rare. Absence of such type of competitions represent a huge gap between Kaggle and kind of problems which the data scientist are expected to solve in enterprise
You will not put effort in Algorithm Explicability
Explicability of algorithms is becoming very important. You can have a very sophisticated approach and most complex algorithms, but if the algorithms cannot explain on how it arrived at a prediction , it is a big problem in enterprises. Such inexplicable algorithms are called "black box" algorithms.
Black box algorithms can create a loss of confidence in using it. Also from legal perspective it can create a problem. Say for example you have developed a very highly accurate ensemble of algorithms to predict credit risk. When in production, it will start predicting risk on credit. Some people will receive a low score. People who got rejected, have by law, a right to know why their application got rejected. If the algorithm decision cannot be explained, it can create a legal problem
In Kaggle competitions, the winner is based on accuracy measures and not on explicability. This means that data scientist who are competing can use very sophisticated algorithms to give a very high accuracy and do not have to care about explicability. This approach can make one win the competition, but might not be valid in enterprise data science projects
Missing link to Return on Investment (ROI) analysis
Companies are investing heavily in data science skills. They expect that the data science projects give a ROI (return on Investment). Generally successful analytic projects are those where data science algorithms are linked to ROI.
One such example is predictive maintenance, where probability of failure of equipment is predicted. Say if an equipment has 10% probability of failure, do you need to send the maintenance people to inspect the machine ? Answer is probably no. If it is 95% probability of failure, then answer is yes
However in real situation, most of the percentages are something like 55%, 63% … where the answer is not very evident. If the company sends maintenance staff to all such equipments, it can represent a huge cost. If they do not send, there is risk of failure
So what should be the threshold percentage ? This is where ROI calculations play a crucial role. It is also very important for data scientist to come out with this threshold value which can help the company to decide on concrete actions.
Kaggle does not deal with these kind of analysis. The scope of work is only arriving at predictions, but does not consider on how to put the data science results into action which can lead to a ROI
You will not get exposure to Simulation and optimization
Simulation and optimization algorithms , such as system dynamic simulation, agent based simulation or monte-carlo simulation should be in toolbox of all data scientists. Many problems such as financial optimization, route optimization, pricing, are a different genre of problem which data scientists are expected to solve
If you take example of price forecasting, you can use Machine Learning techniques to predict price of product based on various features such as season, day, location, competitor price etc.. But is the price predicted by machine learning algorithm an optimal price ? Maybe not. In order to determine optimal price, you need to first determine an optimization objective. One such optimization objective could be profit optimization. In such scenario, you need determine a price range which can give optimal profit. Such prices should be not too high in order to retain the customers. And at same time it should not be too low in order to have a good profit margin.
So you need optimization algorithms to determine the optimal price range. If the predicted price falls within the price range, then the results of machine learning can be accepted else should be rejected
With Kaggle, generally the optimization objectives such as profit optimization are not given. So the problem remains restricted to machine learning and scope of optimization is not explored
Deployment and Operationalization cannot be experienced
Ok, you have a model which makes you rise high in Kaggle leaderboard. But then productive deployment of the model is a completely different ball-game, which cannot be experienced with Kaggle
Productive deployment of models can involve technology such as docker, kubernetes or creating micro-services. Though data scientist are not expected to be expert in docker and kubernetes, but at least they should be comfortable using it. For example data scientist are expected to contribute to create scoring pipeline using docker.
Operationalization and deployment also involves monitoring the model performance on a regular basis and taking corrective actions if required. For example, you have a model for product recommendation. And it is observed at some point of time that the sales due to recommendation is decreasing. Is it problem with the model which has suddenly started giving bad recommendations ? Or the problem is something else?
Data scientists need to get involved in such type of situations to get a real and enriching experience
So go beyond the "usual data science". Try your hands on problems which require skills such as algorithm explicability, ROI estimation, optimization and other skills explained here. It will be a very enriching experience. You will get the feeling of mastering the world of data science by solving wide range of interesting and real-world problems
Be an Eagle. Soar high. Have a vantage view of different type of data science problems. Don’t be a Kaggle with a limited view
Additional resources
Website
You can visit my website to make analytics with zero coding. https://experiencedatascience.com
Please subscribe to stay informed whenever I release a new story.
You can also join Medium with my referral link.
Youtube channel Here is link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated