The world’s leading publication for data science, AI, and ML professionals.

What I Learned From My Data Analyst Internship at a Series-A Startup

This article documents 22 lessons and thoughts from my data analyst internship hoping to help you become a better data analyst at a…

Photo by FunNow
Photo by FunNow

As I’m about to fly back to the US, I think it would be a good time to reflect on my last five months as a business data analyst intern at FunNow, a Series-A marketplace startup in Taipei, Taiwan.

Background

My name is John(Yueh-Han) Chen and I’m currently a Computer Science sophomore who specializes in the intersection of Data Science, product analytics, and user growth. (Say hi to me on Linkedin.)

Due to the outbreak of Covid-19 in the US in 2020, and everything was moving online, I decided to come back and stay temporarily in my home country, Taiwan, where was way safer than America last year. At first, the Covid seemed like a disappointment since I lost the chance to meet my friends, classmates, and professors in person. However, a coin always has two sides, it was actually an opportunity for me since people were still going to the offices and companies were still hiring in Taiwan. In contrast, many US firms had canceled lots of internship programs because of the Covid.

I luckily found that FunNow, one of the fast-growing Taiwan-based startups I admire, was hiring interns. Without a second thought, I applied for the role, spent several days refreshing my Sql, Python, and Excel/Spreadsheet skills, and went through the interview process. Fortunately, I landed the internship 🙂

My home city, Taipei, Taiwan. - Photo by Timo Volz on Unsplash
My home city, Taipei, Taiwan. – Photo by Timo Volz on Unsplash

So, what is FunNow?

FunNow is an instant booking platform for leisure and entertainment activities in various Asia countries, including ** Taiwan, Japan, Hong Kong, and Malaysia, and with the ambition to expand all the big cities worldwide. So far, it has accumulated around 1.5 million downloads and 150k monthly active user**s.

Taiwan startup FunNow gets $5M Series A to help locals in Asian cities find last-minute things to…

What did I do during my internship?

I was in the user operations team, and the overall goal of the team was to improve user retention that will lead to increasing GMV. So, everything I’ve participated in was directly or indirectly towards improving user retention including building weekly dashboards, writing SQL to query data, analyzing and segmenting VIP users, training ML models to extract actionable insights, conducting research on the customer loyalty program, experiment design, and conducting cohort analyses.

Since that’s lots of different tasks, so I decided to categorize the 22 lessons into 7 categories:

  1. Data Analytics: 4 Lessons
  2. Writing SQL: 5 Lessons
  3. Machine Learning: 3 Lessons
  4. Dashboard building: 3 Lessons
  5. Experiment Design: 2 Lessons
  6. Communication/Presentation: 3 Lessons
  7. General Topics: 2 Lessons

Okay, enough with the contextual information, here are my 22 lessons, takeaways, and thoughts from this 5-month data analyst internship experience. Please enjoy.

4 Lessons on Data Analytics

  1. Most Analyses don’t add real values. Think about what direction can drive the most business value first, then dive in.
Photo by NeONBRAND on Unsplash
Photo by NeONBRAND on Unsplash

I had done 2–3 analyses based on pure curiosity during the internship period, but none drove any business value. The primary reason was the lack of pre hypotheses or a clear sense of which directions to optimize. So, after these few non-value-added analyses, I listed a few things I must do before diving into research, and these helped me a lot. One is to think from the customers’ perspectives about what makes the product hard to use. The other is to research what other companies have successfully optimized and see if their cases are applicable in our cases. Make a list of ideas created according to the above methods and to explore and visualize the underlying ROI of each idea, and then dive into the analysis to verify the feasibility.

Besides, reading books on growth strategy and other companies’ successful growth experiences can improve business tuitions on choosing the right directions to explore. One book my manager recommended to me was Lean Analytics by Alistair Croll and Benjamin Yoskovitz. It includes the 6 most common business models and advice on choosing the corresponding right metric to analyze, which I think is an excellent book for beginners. Later, I came across Hacking Growth by Morgan Brown and Sean Ellis, which was super helpful to me in thinking about growth strategy and optimization. Other than that, Andrew Chen’s and Brian Balfour’s blogs were beneficial as well.

2. When exploring the data, use the top-down approach.

After finding a direction, then you can jump to the analysis. Nevertheless, there’s still a pitfall. Don’t jump right to the details.

The reason is that analysis has to be presented to managers, and there is possibly an information gap between you and them. So, a better way to resolve this is to deliver analysis from a broader view to a smaller area so that they can have an exhaustive understanding of the research. Therefore, thinking in reverse, to make managers and other team members easily grasp the whole, it’s better to present the overall situation and then gradually shift into the information on a micro-scale.

3. Make sure the tabular analysis in the spreadsheet is flexible.

Whether using Python, R, or SQL to conduct an analysis, you might import the analysis result to the spreadsheet to share with your team. And if you decide to present in a tabular form, you might make some structural adjustments.

Like I mentioned earlier, your team members might want to check the data or conduct further analysis. If the tabular data has been manually adjusted the structure by a lot, it would be super difficult for team members to further work on it(see the example below). Therefore, a lesson learned, I think the tabular data should have as few manually structural changes as possible.

This is an example that I had manually over-adjusted the structure, so it was hard for other members to further work on this data. (These are not real data.) - Image by author
This is an example that I had manually over-adjusted the structure, so it was hard for other members to further work on this data. (These are not real data.) – Image by author

4. If the analysis can be done by Pivot table, use it.

If the data in the Google sheets need a structural change, don’t do it manually. Use Pivot Table instead. There are three reasons for using Pivot Table. It’s simple, easily checked, and the original data can remain intact. Besides, if a simple analysis task can be done by pivot table, why still use Python or R?

5 Lessons on Writing SQL

  1. The most important thing in writing excellent SQL codes is a good understanding of the database schema.

Before writing SQL, you need to understand which tables have the data you want, the default data types, and the keys. If that’s categorical data, which usually uses abbreviation, you should know what each abbreviation means. Besides, the most important thing here is the comprehension of the relationship between tables, which will help you write efficient SQL codes.

The very first thing that surprised me during the internship was that FunNow has lots of tables in the database(70+). The complexity of the data schema is subjective to businesses and products. Some companies have 100+ tables, whereas some have only 30 tables. And I had practiced many of Leetcode’s SQL interview questions. Many of which mostly need to use no more than 3 tables in a query. However, when I was in FunNow, 5+ tables used in a query were super common. Therefore, I would say the most essential thing at the beginning of the analyst job is to fully study the data schema to grasp how the data are consisted and connected.

2. Don’t rebuild the machine.

In most cases, unless you are the first data scientist or data analyst in the company, somebody might have written similar SQL queries. So the quickest way to get the hand of querying data is to cultivate the habit of searching codes before writing codes. It will save you a lot of time.

3.When sharing your analysis, make sure you attach the original SQL code used to query the data.

SQL Code - Image by author
SQL Code – Image by author

Two reasons. This first one is when a query is complicated, your colleague might need to check the accuracy of your code. The second is other team members or your manager might want to conduct further analyses based on your work. Hence, attaching your SQL can help your team move on to the next step faster.

4. The ability to check if the SQL query result is accurate.

When querying incorrect data causes you to make the wrong suggestions from your analysis, your credibility in the team will be highly decreased. Having the ability to check the correctness of your data is one of the most critical responsibilities of DA or DS. In startups, because of a shortage of manpower, people might not check your code line by line. Therefore, the code self-checking ability is extra vital in startups.

Usually, there are several ways to get comparable data. I used to think about what could be wrong and use other tables to examine. Another method might be to use previous data to surmise if the query result makes sense.

5. Minimize the SQL running time, but don’t over-optimize.

Some tables may have millions of rows, and nuance in different syntaxes can increase its computing time by minutes. Especially when the queries need to be run every day or very frequently, they should be optimized.

I think there are two directions to explore. One is that you possibly chose an inefficient table. I was writing a query that can be done by choosing one of the two primary columns: one has 2 million rows, and the other has 100k rows. I chose the 2-million one since I thought the codes will be more straightforward, given that it needs fewer relative tables to filter. However, it took 4 more minutes. The second direction to optimize might be to tune the syntax.

8 Ways to Fine-Tune Your SQL Queries (for Production Databases) | Sisense

However, some queries might only need to be run once or once per several months, and it might take you more than 3 hours to optimize for one minute, then in this case, just leave it there.

3 Lessons on Machine Learning

  1. When using a decision tree, divide the most contributing numeric variables into smaller groups so that they won’t be cut into an infinite amount of branches by odd numbers.

When I was training a decision tree model, I dumped a highly contributing numeric variable(called OrderPoints, see the picture below), and its values range from 0 to 10000. Because it’s strongly correlated to the target variable, the model split it into several mini branches, causing some odd number of split points like 8881.5 and 7495. However, the way it cut was way too spontaneous, making it hard to perceive and extract insights.

A part of the decision tree. - Image by author
A part of the decision tree. – Image by author

My colleague advised me to pre-manipulate this data and divide it into 5 categories as 0–2000 is in category 1, and 2001–4000 is in category 2, and so on. This made the numeric variable became a categorical variable. And then, I could use the one-hot encoding method to create 5 corresponding columns. By this, I solved this weird number problem!

2. When selecting the most contributing variables, I should have chosen random forest or XGBoost.

One of the reasons I was building the decision tree model was to find the most contributing variables since it would show the main variables at the top position that divide the trees. However, the decision tree has a bias, and I was a machine learning newbie at the time. If I could do it again, I would choose the random forest model and use its featureimportances or choose the XGBoost ** and use its xgb.plot_importance() to find the importance score since both random forest and XGBoost belong to the Ensemble** methods. This means that many trees will be calculated and then be tallied up to get the average numbers, which are less biased than a single tree.

Feature Importance and Feature Selection With XGBoost in Python – Machine Learning Mastery

3. Using domain knowledge to generate new features will become more critical in the future as many ML processes will be replaced by AutoML-like products.

When using actual data to build ML models, I found the most exciting part was making new features since it requires intuition, domain knowledge, business sense, and sometimes, creativity to think of the potential features that could bring value but are not accessible through the database.

Take Netflix, for example. A DS might want to understand what factors lead to renewing the subscription. The features that a DS will create might be the ratio of finished movies(The number of finished movies divided by the number of clicked movies) or every user has a tag that a K-means model assigned based on users’ go-to show categories, to name a few. As each company has a different business model, the business question that a DS wants to explore would be very different, making the task of creating new features hard to be replaced by machines.

After this internship, I spent some time researching whether automated ML products like AutoML will replace data scientists in the future. The summary of my research is that although these automated ML products can outperform humans in feature selection, feature preprocessing, model selection, and hyper-parameter tuning tasks, data scientists in the future who work with AutoML-like products will free up their time doing those tasks, have more time thinking about business, and apply their judgment to improve business faster(KDnuggets. The Death of Data Scientists – will AutoML replace them?).

The human judgment parts include defining the business problems, applying domain knowledge to generate more valuable features, and extracting actionable insights. In short, the lesson I learned after this post-internship research is as a future data scientist, I should set aside a good amount of time into the business aspect, choosing an industry to focus on, and build my domain knowledge over time.

The Death of Data Scientists – will AutoML replace them? – KDnuggets

3 Lessons on Dashboard building

  1. Frequent communications and drafting before building dashboards.

My very first task in this internship was to make weekly dashboards. Among all the factors of creating great dashboards, I think the most important one is frequent communication with your audience, who can be your teammate, manager, or executive. At FunNow, we used Metabase, an open-source BI server, to build dashboards. Each chart in a dashboard is built by a SQL script, so ten charts mean ten SQL scripts behind the curtain. (Although, we also used a more popular tool like Mixpanel, which is way more efficient to analyze and build dashboards, however, not as flexible as Metabase. I personally prefer using Tableau since it’s both flexible and efficient, but it’s too expensive for startups.)

If the final dashboard is not what your end-user wants, you might waste your time arranging unwanted layouts or writing many unnecessary scripts if you’re using Metabase. Therefore, the most critical principle in creating dashboards is drafting and frequently communicating with your audience before building it.

This is one of the drafts I used to communicate with my manager, making sure it consists of all the necessary data and the right charts. The data are imaginable, of course. - Image by author
This is one of the drafts I used to communicate with my manager, making sure it consists of all the necessary data and the right charts. The data are imaginable, of course. – Image by author

2. If these two data combined can indicate underlying important information, put them side by side, so you can view them together easily.

For example, one of the dashboards I built was to track weekly GMV(Gross Merchandise Value). GMV is the formula of ARPPU(Average Revenue Per Paying User) times Paid UU(Unique User) and is also the formula of OrderCnt(Order Count) times AOV(Average Order Value). Intuitively, I first designed them in a vertical way: I order the ARPPU first, then Paid UU, then OrderCnt, and then AOV.

But it was really hard to see the dynamics between ARPPU and Paid UU or the dynamics between OrderCnt and AOV. So, If these two data combined can indicate underlying important information, put them side by side, so you can view them together easily like the picture below.

Since GMV = OrderCnt * AOV, putting these two data side by side, viewers can easily see which one caused the change in GMV. The data are also imaginable, of course. - Image by author
Since GMV = OrderCnt * AOV, putting these two data side by side, viewers can easily see which one caused the change in GMV. The data are also imaginable, of course. – Image by author

3. The growth compared the current period to the average of previous periods is more straightforward than just showing the average line.

For example, it can be this week’s WAU(Weekly Active Users) divided by the average WAU of the past 8 weeks. It’s better than just showing an average line or a trend line in the line chart since, in this way, the viewers have to mentally calculate how much growth this week has compared to the past 8 weeks’ average. Instead, using the growth rate compared the current period to the average of previous periods, the audiences can see the number of how much has changed outright.

2 Lessons on Experiment Design

1. Make experiments as simple as possible.

In March and April, our marketing department made a campaign that brought in many first-time VIP users of a sudden. User operations team, the team I was in, wanted to let those users not just one-time purchasers but stay using FunNow long-term. Therefore, I had designed an experiment in which the first-time VIP will have an exclusive discount as long as they finish ordering two VIP special product items. The rationale behind it was that if they could experience the benefits as VIP users as quickly as possible, then they might be more likely to want to remain the VIP status.

As we are a startup, and our VIP status is calculated every month. So, to keep this cohort of first-time VIPs, this experiment should be executed before the end of the month.

However, this became the reason that made this experiment not as successful as we expected. Since we started this test in mid-April, some first-time VIPs already finished purchasing one VIP unique item, and some didn’t buy anything. So, since there were these two groups of first-time VIPs, we needed to set up two in-app and push notification tracks in CleverTap, a push notification software. Each track will have 5 to 6 steps, in which the first step was the first time users open the app, then pop up the notification, then wait until the users purchase a VIP item, then send a push notification, and then repeat the process, until they finish purchasing two items. Which made the experiment more complicated.

Photo by Nicolas Thomas on Unsplash
Photo by Nicolas Thomas on Unsplash

At the end of April, although we found that this experiment made 10% of more first-time VIPs stayed in the second month than the previous statistics, we found that about 40% of the group that had already purchased one VIP item did not receive our in-app push content, meaning that they didn’t even enter the push notification track, which indicated that we could have made more users stay if this experiment had no glitch. However, one lesson learned, make the experiment as simple as possible.

2. If you can use data to back up your hypotheses, use it.

The first step of designing an experiment is to form a hypothesis, which can be found by data exploration, user interviews, other companies’ experiences, psychology principles, etc. Yet, startups usually lack human resources, so we have to choose tests that can presumably bring the highest ROI. Therefore, if a hypothesis only consists of qualitative observation, it would be hard to convince data-driven managers to approve this experiment. So, because of this, if a hypothesis can be supported by data evidence, the better.

3 Lessons on Communication/Presentation

1. Explain the data sources first.

When giving presentations to a data team, you should explain the data sources for a few reasons. The first one is that they all understand the data schema and can examine if your analysis matches perfectly with the data sources. The second reason is that they might recommend you to use other data sources that you didn’t consider but are helpful to your research. The third reason is that some senior colleagues in your team might know some data that were not documented in the data schema but are helpful to your project.

2. Compare numbers with industry standards.

Photo by Chris Liverani on Unsplash
Photo by Chris Liverani on Unsplash

When you present data, attach comparative figures. If not, it would be hard to understand if this particular number is high or low – for example, retention of 30% after the first 30 days is high or low? According to Localytics, 30% is normal in gaming companies, whereas it is low in social media companies(Bhargava. Getting The Most out of Push Notifications to Improve User Retention).

3. Always clearly explain the denominators when presenting numbers in percents.

My manager frequently asked me, "wait, John, what does this denominator mean?" When presenting a ratio, make sure to explain the meaning of the denominator since usually, the nuance is hiding in the context. Do you mean all the log-in users? Or paid users? Or paid users in a particular range of time?

2 Lessons/thoughts on general topics

  1. Systematic thinking – when thinking about strategies, consider how they would act upon other teams. If not sure, ask!

When the strategies you form will need to collaborate with other teams/departments. You need to systematically think about whether this task will be too much for them or if they should focus on other more important tasks on hand. If not sure, just talk to them. I didn’t think of this crucial perspective at first, making the first few proposals I presented seem a bit naive.

2. I was too immersed in the tasks. I should attend more meetings.

Photo by Leon on Unsplash
Photo by Leon on Unsplash

I remember when I was reading Tools of Titan by Tim Ferriss two years ago, Tim asked Chris Sacca, an early-stage investor in Instagram and Uber, "If working in a startup environment, what would one do or focus on to learn or improve as much as possible?" Chris replied to him by saying, "Go to all meetings you can, even if you’re not invited to them, and figure out how to be helpful. If people wonder why you’re there, just start taking notes." I totally failed to follow this advice as I was too wrapped up in my tasks and projects, although it wasn’t terrible. I still think if I could have done it over again, I would strictly force myself to attend more meetings.

Wrapping Up

Taipei - Photo by TangChi Lee on Unsplash
Taipei – Photo by TangChi Lee on Unsplash

I tried to include as many thoughts as possible in one article, but there are still many valuable ideas that I didn’t put in as it would be way too long. Generally speaking, the most valuable thing I gained from this internship is the actual experiences, from analyzing the real data to designing experiments to building ML models to extracting business insights to execution to analyzing the results to communicating with team members and getting feedback. These are impossible for me to experience when building side projects alone using Kaggle’s sample data. Besides, this internship also makes me more firmly believe that using data to drive business values is what I’m genuinely passionate about and want to pursue in my future career!

Thank you, FunNow, and my brilliant colleagues, Peggy, Yuan, and Steven for the amazing experience!

If you read this far, please send me a connection request on Linkedin! I enjoy meeting new people and exploring how we can work together in the future.


Works cited:


Related Articles