
Introduction
As a data scientist, you might be eager and excited to start working with the data and begin running models. But as a good data scientist, I learned that this is not the best approach as you might realize you are on the wrong track halfway thru the project.
The first step in a machine learning project is having a good understanding of the data and what is the problem statement. You won’t be able to build good models without having a good understanding of the data and the patterns that lie in it. This is the step where you perform Critical Thinking and have a list of questions, hypotheses, and finding the answers for them.
Below are several key questions to ask:
(1) What is the main objective of this machine learning project?
- It is important to have the same business alignment with the company/client to know what is the direction of the project and how the end goal should be, what are the exit criteria.
- Understand the purpose of the project and what are the values that can be delivered to help the organization.
(2) Are the data collected sufficient to answer your question and reach the objective of the project?
- Having the right dataset is important to help to reach the end goal. There is no point in having the best solution, the best approach if you do not have the right dataset to answer your questions. Identify if the dataset collected is relevant to the use case. In addition, determine if the data collected is sufficient, are there sufficient historical data/ training data for the model to train on.
- For example, the goal is to predict house prices in a particular state but the previous years’ data are having huge gaps/missing values. This will affect the outcome of the model predictions unless steps are taken to get the right data or pre-process the missing gaps to reflect the right trend.
(3) Who is the audience for your analysis?
- Is important to know who will be the audience on the receiving end that will be reading the report and how the information in the report will be useful for them. Are they from the technical team? Board members?Marketing team? Sales Team? Data Analyst? etc. These groups of people will have different technical and statistical understanding, so it is best to customize your report to suit your audience to ensure that your findings are able to be accurately conveyed.
- For example, most of my project’s audience is business users with minimal or zero technical knowledge and are only interested in the final value forecasted. Therefore, presenting the result in a user-friendly dashboard that is interactive will be useful for them as they can interact and select a particular week’s forecast and view the result immediately.
(4) When does the forecast/report need to be delivered?
- Knowing when the forecast or report needs to be ready is important which helps in the planning of scheduling the process run and also determine which range of data is suitable to be used.
- For example, if you are working on time-series data to forecast weekly sales demand and the forecast needs to be delivered every Tuesday. Then you will need to investigate if the dataset for the latest week is generated and pre-process by Tuesday for the model to predict or else what adjustments are required to be made.
(5) Do you have the domain knowledge that relates to this project?
- Yes, you may be a data scientist with technical skills but there are times having the domain knowledge helps in the data analysis process such as having a better understanding of the data and be able to ask the right questions. Having domain knowledge in the particular industry that relates to the project allows you to know the process of the business and build specific feature engineering that can boost the model performance. Therefore, performing additional research to have a better understanding of the use case industry you are working on will definitely be beneficial.
- Having domain knowledge could help in determining the right set of features for the models as you will know what drives the performance. For example, stock market prices can be impacted by many factors such as interest rates, GDP, bond price, etc., and having a background in economics can be an advantage in developing a model to forecast future stock prices.
(6) Do you understand the variables in the dataset
- This relates back to the above question where having domain knowledge helps to know the meaning of the attributes in the dataset and whether the variables can be strong features to predict the target value. If there are roadblocks in understanding the variables, it would be best to do some research and also determine who is the right person you can reach out to help to fill in the gaps of your understanding in the dataset.
- For example, helping a client in the financial industry to build machine learning models but having no clue or understanding of the terms/words used in the financial datasets is likely to give you a hard time. So, in this scenario, it is good to spend some time understanding the financial industry, what the are common terms used, what are the key concepts you should understand. These few additional steps will help in the overall machine learning process.
(7) What are the risks/impacts to the organization if the project fails
- Data Science projects are not always successful and there are chances to failed such as insufficient data. Determine what impact this brings to the organization and identify if there is another use case that the project can pivot and still able to leverage based on the existing knowledge that has been built.
Conclusion:
Data Science is not just purely building models, tuning models parameters but a combination of having the right domain knowledge, understand the data, aligning with the project goals, and being able to communicate your findings to the end-user. Once you have the right understanding of the data, then you will be able to move to the next step which is performing a deeper data exploration to look at the overall statistics of the datasets and having an idea of how the data should be pre-processed, what feature engineering steps are required.
The list of Questions specified in this article is some of the key questions and there can be more. If you have good suggestions on what other questions to ask, list them in the comment below.
Lastly, thanks for reading this article! 😃
References & Links:
[1] https://towardsdatascience.com/domain-expertise-why-is-it-important-for-data-scientists-2d6a406d544d