4 Steps to Be a Good Data Analyst

TLDR; Don’t be a query monkey.

Travis Tang
Towards Data Science

--

5 months ago, I graduated from Singapore’s Nanyang Technological University and landed a role as a Data Analyst in a tech firm in Singapore. As a data rookie with little formal technical training in computer science or statistics, I was nervous about the prospect of being surrounded by data experts at work but also immensely grateful for the opportunity to learn from them.

Fast forward five months later, I took some time to reflect on the things I’ve learnt as an analyst.

What is a Data Analyst?

As a data analyst, I use data to quantify and find insights to business problems. A data analyst spots trends and anomalies in data which collectively are manifestations of a business problem. Some of these problems might include

‘Would the launch of a new product make sense?’

‘How can we segment the users based on their behaviours?’

‘How successful is this product?

Data analysts do not work independently. Instead, they typically work with domain experts to solve business problems. These domain experts are internal customers from various teams, which can run the gamut from strategy, marketing, operations, to product… As subject experts, they are the front-liners who identify the problem statements to be addressed.

To quantify the problem, the data analyst mulls over the relevant metric of the problem statement and extract the necessary data and insights before presenting the findings to address the problem statement.

Mining for insights. Photo by Albert Hyseni on Unsplash

The process of extracting insights is not unlike that of hunting for precious metals. Like the experts in precious metals who know where the metal mines are, the internal customers know what the business problem is. They introduce the data analyst to the metal mine and work with them to extract the precious metals.

Having been told the coordinates of the mines, the data analyst uses tools at his/her disposal to dig out a mixture of precious metals and dirt (raw data). The data analyst then removes the dirt from the mixture — to obtain the purified valuable metals (the insights and findings).

What Makes a Bad Data Analyst?

When I first started as an analyst, I was a query monkey. A query monkey is an example of a bad data analyst. It receives requests from internal customers and fulfills the requests without an understanding of the business problem. As such, a query monkey executes SQL queries like a knee-jerk reaction. It also stops short of providing insights and recommendations — as it is too busy executing queries.

Don’t be a query monkey. Photo by Quinten de Graaf on Unsplash

To avoid the pitfall of turning into a query monkey, a data analyst should actively follow these steps, with a focus on the 1st, 3rd and 4th steps.

  1. Understand the business problem with business stakeholders.
  2. Extract raw data efficiently.
  3. Extract insights and recommendations.
  4. Present findings intuitively.

Today, we’ll talk on the role of a data analyst — and let’s walk through these steps together!

One day, you received an email from a product stakeholder containing the following.

The product team has just built a new chatbot that shows up on the home page. Can the data team plan an experiment and tell us if the new chat bot should be launched? Attached is how the new chat bot looks like.

Interface without (left) and with (right) chat bot. Illustration by author.

Now, let’s get started.

1. Understand the business problem with business stakeholders.

An understanding of the business problem is important because it allows a data analyst in providing an analysis that is in line with the business needs. Executing the request without an understanding of the problem is a futile exercise at best.

I like to imagine a business problem as an iceberg. When we first encounter a business problem, we only see the email that briefly explains the problem from the internal customer. There are a lot more to this problem than we can observe at this point.

An iceberg analogy of a business problem. Illustration by author.

As cliche as it sounds, it found it helpful to clarify the tasks using the 5W1H framework.

5W1H’s of the data

  • Who is the end-user of this data?
  • What is the impact of this work?
  • When is the relevant period?
  • Where can I get the data from?
  • Why do we need this?
  • How will the data analysis be used?

After clarifying the task, we understand much more about the context surrounding the task.

(Why the request is raised) Recently, we see a decrease in the number of users who are making a purchase on the website. The product team hypothesizes that promoting the product with a chatbot will help increase the number of purchasing users.

(Where to get the data) Since this is a brand new feature, there is no existing data set on this new feature. We will need to plan an experiment and collect new data.

(Who is the customer + how will the analysis be used) This will be presented to the business team, who will decide whether this new chatbot feature makes business sense.

(What is the impact of this work) If we can increase the conversion, we can potentially increase the revenue we earn from this product by X USD.

That clarifies our thought process and sheds new light on how we can make our next move.

2. Extract raw data and collecting data

To obtain the raw data of whether the chatbot will increase the number of users who make a purchase, we plan an experiment or an A/B Test.

In setting up the A/B Test, we will divide the users who landed on the home page into equally sized groups A and B. The users in group A will see the old interface without the chatbot, while those in B will see the new interface with the chatbot. Our goal is then to see if more users in A or B make a purchase.

In the process of planning of an experiment, we determine the hypothesis to be tested, and then use statistical methods to calculate the required sample size based on the level of significance and the desired power of the test. To understand more, you can refer to the video below for a short introduction to AB Testing.

To plan the experiment, we need to extract data from the database through SQL.

After 5 months as an analyst, I realize that the skills of producing efficient and readable SQL is extremely valuable for a data professional. In fact, SQL is the common language for a data scientist, data engineer and a data analyst. A data analyst who does not know SQL fluently will have a challenging time communicating with other data colleagues and face difficulty in navigating the data landscape in an organization.

If you are an aspiring data scientist, and you scoff at the thought of collecting data through querying data, you might want to rethink your decision: according to a survey of 2,360 data scientists across the world by Anaconda, a data scientist spends up to 19% of their time loading (extracting) their data.

So much data, so little time. Illustration by author.

A beginner in SQL would be familiar with the following simple SQL functions…

  • SELECT and WHERE for filtering and selection
  • COUNT, SUM, MAX, GROUP BY, HAVING for aggregating data
  • DISTINCT, COUNT DISTINCT for producing useful distinct lists and distinct aggregates
  • OUTER (e.g. LEFT) and INNER JOIN when/where to use them

However, I found my work to be much more efficient once I learnt the following:

  • Temporary tables
  • Handling NULL with COALESCE
  • Sub-queries and their impact on the query’s efficiency
  • Window functions like PARTITION, LEAD, LAG
  • User defined functions

As I mentioned in my previous post How to Learn Data Science in 2020, I picked up the basics of SQL from Datacamp and Dataquest which did not go into such intermediate concepts extensively.

Alas, when I first started, I resorted to convoluted nested queries that were a pain to the eye since I was ignorant of the beautiful syntax of sub-queries. The use of sub-queries make the overall query much more modular, more readable and easier to troubleshoot.

To illustrate this point, let’s compare the two queries.

The former nested query has queries within queries, while the latter query has clearly delineated queries. Illustration by author.

Some of the resources I used to learn these intermediate functions include Zachary Thomas’ SQL Questions and Leetcode. They helped my role as an analyst tremendously.

3. Adding insights to analysis

Assume that we’ve collected the data and performed the experiment. Next step: we calculate whether there is a statistically significant difference between the A/B groups. In this case, the question is

Does the difference in user interface increases the number of users making a purchase?

Assume that the results come in as follow:

Treatment group performed poorer in the experiment… Illustration by author

We can calculate the conversion rate as the number of users who made a purchase as a proportion of the number of users who landed on the home page for the control and treatment groups. We see that the conversion rate of the treatment group who received the chatbot interface had a lower conversion rate.

At this point, a query monkey will report this table and conclude that the chatbot did not help increase the number of users making a purchase… but we’ve learnt not to be a query monkey.

One of the main goals of an analyst is to provide insights. Most of the time, insights come from an explanation of the numbers that are observed in the data.

Providing data without insight is like uncovering impure gold in a gold mine. To get to the real gold (insights), we need a little more work.

At this point, the most important questions that we can answer is

Why is the conversion rate of the new interface poorer than the old one?

This is an open-ended question, which is not necessarily easy to answer. To answer this, we can break down the problem into smaller parts,

We can further break down these two steps logically into smaller parts and seek to answer them sequentially. The following is an illustration of how we can break down the problem:

The process of breaking the problem down to get insights. Illustration by author.

Checking the validity of the data

Pre-experiment

  • Faulty experimental design: Was the experimental design correct? Did we calculate the sample size correctly? Did we use the correct test statistic?

During the experiment

  • Faulty data: Was the data collected correctly? For instance, was the clicks of both groups of users being logged by the database correctly?

Post experiment

  • Incorrect analysis: Was the test statistic applied correctly?

Seeking the Difference between the Old and the New User Interface

To seek out the differences between the old and the new user interface, we can scrutinize the differences between the appearance and the performance of the old and new interfaces. This might provide us with an explanation of the poorer performance of the new interface. For example, we can look out for:

  • Color: Was the color in the new interface less appealing than that of the old interface?
  • Latency: Was the new interface slower than the old interface? If so, by how much?
  • Support: Was the new function supported by the phone or the browser?

There are many, many more differences that we can explore.

Once we have found the reason that explains the difference in conversion rate, we can then make recommendations on what experiment is to be done next.

The bottom line is this: as an analyst, it helps to constantly ask yourself…

Can I explain this data further? Can I get more insights from this?

And this can be done by breaking problems down into smaller parts and answering them logically.

4. Present findings intuitively

After extracting the insights, a data analyst needs to present his or her findings. More often than not, this comes in the form of data visualization and presentations. In this example, we will be presenting our finding that the chatbot did not increase the sales of the website to the product and business teams.

Here are some pointers that I found particularly helpful when I make presentations to non-data stakeholders.

Gently introduce the problem statement.

If the attendees do not understand the business value of the presentation, it is easy for them to shut down. It will be great if you can introduce the problem as an important and high-value problem to address so we can capture their attention from the start. Here, we can translate the impact of the analysis by showing the increase in revenue if the conversion rates are increased.

Provide as much context to the attendees of the meeting as possible.

The attendees might not know as much about the new product as you do. As such, providing relevant background information about the product gets the attendees up to speed. In this example, we can provide pictures of the flow of the chatbot.

Explain metrics and concepts intuitively.

The attendees might not come from a technical background, and may not understand some technical terms that you might use in the presentation. Avoiding technical jargons might help the attendees understand your presentation better.

Pause at appropriate points to allow questions mid-presentation.

By allowing the attendees to clarify their doubts as early as possible, we can help them follow the logic of the presentation.

These are simply guidelines to follow and are context-specific. In a future post, I will document more good practices of a data presentation.

Don’t be a query monkey.

That’s my biggest lesson as an analyst for 5 months. To avoid being a monkey, I make sure that I follow these steps:

  1. Understand the business problem with business stakeholders.
  2. Extract raw data efficiently.
  3. Extract insights and recommendations.
  4. Present findings intuitively.

I still have plenty to learn, and I look forward to many more months in the data field ahead. Let me know what you’ve learnt as a data scientist or analyst in your first few months via the comment below. Otherwise, feel free to connect with me on LinkedIn so we can learn together!

Other Readings

If you enjoyed this blog post, feel free to read my other articles on Machine Learning:

References

[1] Anaconda (2020). “The State of Data Science 2020: Moving from Hype to Maturity”. Retrieved 5 Nov 2020.

Acknowledgement

A huge shout-out to my mentors (Geraldine and Reza) and friends (Edo, Deeksha, Aditi, Flo, Alson, Key, YC, Meenakshi, Jiale and Tommy). Wouldn’t have learnt so much as an analyst without their help and patience.

--

--