Introduction
In this article, I will share four common mistakes in Data exploration and how you can avoid them.
Exploratory data analysis (EDA) is the discovery of trends and patterns in data using graphical representations and summary statistics. Popular graphs for EDA include scatter plots, bar charts, histograms, donut charts, and heat maps. In addition, summary statistics which are measures for describing a dataset include count, mean, median, standard deviation, and skewness.
EDA is one of the critical steps in the Data Science project life cycle and helps to better understand the data before machine learning modeling. It can also produce some quick wins to create value for a business through actionable insights.
The main goals of EDA are to identify errors in the data, gain a better understanding of the data, detect outliers, and uncover variable relationships.
Pitfalls in Exploratory Data Analysis
To achieve the stated goals of EDA, data practitioners must avoid the following pitfalls during data exploration:
- Unclear business problems
At the core of every data science project is the business problem that needs to be solved. How to plan for seasonal sales? What type of promotions should be offered to different customers? These are some of the questions that may be posed. However, data practitioners are not necessarily business savvy and managers may not be data experts resulting in poorly formulated business problems. Also, available data may not be sufficient to answer relevant questions.
Tips to avoid this pitfall:
- Get feedback early from stakeholders.
- Clarify requirements as soon as possible.
- Treat EDA as an iterative process requiring frequent cycling back to stakeholders.
2. Shallow insights
The desire to make quick discoveries is not entirely wrong. However, what you consider a great insight may just be "stating the obvious" to stakeholders. The question "so what?" is very common and data practitioners must prepare for this during analysis. Imagine telling the sales team that customer A is the biggest spender this year. Yes, it is true but they probably already know this.
Tips to avoid this pitfall:
- Tailor your findings to the business problem and how to add value.
- Answer the "so what?" question beforehand (ensure insights are actionable).
- Have some early discussions on preliminary results as you uncover insights.
3. Wrong inferences
There are several reasons why data practitioners may arrive at wrong conclusions. Common ones include a lack of domain knowledge, treating correlation as causation, and ignoring confounding variables.
Tips to avoid this pitfall:
- Expand your knowledge of the business area.
- Sharpen your statistics skills.
- Consult with the business stakeholders during analysis.
4. Bad visualizations
What can go wrong with a visualization? A lot! Just to mention a few: bad choice of graphs, misleading axis scale, using too many colors, not being sensitive to some colorblind people in the audience, and displaying wrong units. There are a ton of resources on how to do Data Visualization properly. Hence, the tips for this section are tutorial links to help improve your data visualization skills.
Resources:
- https://online.hbs.edu/blog/post/bad-data-visualization
- https://www.datapine.com/blog/misleading-data-visualization-examples/
- https://www.jotform.com/blog/bad-data-visualization/
Conclusions
In this article, we covered the common pitfalls in data exploration and how you can avoid them. In addition, we highlighted the goals of EDA and provided some resources to improve your data visualization skills.
I hope you enjoyed this article, until next time. Cheers!
What’s more interesting? You can access more enlightening articles from me and other authors by subscribing to Medium via my referral link below which also supports my writing.