The world’s leading publication for data science, AI, and ML professionals.

Data Analysis of 10.000 AI Startups

AngelList is a place that connects startups to investors and job candidates looking to work at startups. Their goal is to democratize the…

Extracting insights from AngelList companies

Introduction

AngelList is a place that connects startups to investors and job candidates looking to work at startups. Their goal is to democratize the investment process, helping startups with both fundraising and talent. Be it to find a job, investors for a startup, or even just to make connections, it’s a platform which everyone in the tech field should be aware of. Since the website was created in 2010, more than 4M companies, 8M investors and at least 1M candidates have registered on their website.

In times when machine learning is growing exponentially, I wanted to take a look at those AI startups and make an exploratory Data Analysis around them and their field of activity. How big is the investment in the AI sector? How do AI startups scale? What markets are the most promising for them?

Data Extraction

To find commonly related words, a nice tool to use is SenseToVec, by explosion.ai. It’s a neural network model that reads every comment posted to Reddit in 2015 and built a semantic map using word2vec and spaCy. You can search for a word or phrase and get the most similar words to that (I even use it to look up synonyms once in a while). So I typed in machine learning and came up with terms like:

  • Data Science
  • Natural Language Processing
  • Computer Vision

And dozens more. After filtering some terms out, I used the remaining as queries to be typed on Angel’s search box.

The web scraper was made using Selenium and Beautiful Soup. It creates a driver that access the URL (https://angel.co/companies), clicks on the search bar and writes a specific query. Then it scrolls through every company in the list and stores its data. Since the website limits the search by 400 companies per search, I opted to use filters and increased the number of queries, to make sure I’d get almost all companies related to each one.

Angel Scraper
Angel Scraper

After removing duplicates, the result was a CSV file containing 10.139 unique data points, comprising features like:

  • name‘ → Name of the company
  • joined‘ → Date that the company joined Angel
  • type‘ → Company type (Startup, Private Company, Incubator…)
  • location‘ → City where the company is based
  • market‘ → Company’s field of activity (E-Commerce, Games…)
  • pitch‘ → Company’s slogan
  • raised‘ → Amount raised by the company with investments
  • tech‘ → Main programming language (Python, Javascript…)

Data Analysis

Before looking for insights in the data, I had to clean and pre-process it to become useful for analysis. That included some steps like formatting dates, normalizing texts and converting money strings to float numbers. After that, I imported the Geopy library to extract coordinates information from the location column, so that we can work with latitudes and longitudes later on. Here’s a sample of the processed data frame:

Now, there are many things we could do with a data frame like this. Let’s start by checking on the programming languages those companies are using.

Wow! That’s a huge difference. Python is one of the most used languages when it comes to machine learning and it looks like a great favorite within Angel’s AI startups. Please notice that we are only comparing among AngelList top techs, according to them, so other important programming languages were not included.

We could rearrange this data by date joined and check the growth of each of those techs during the last years:

Tech growth by year
Tech growth by year

Python is growing, indeed. It’s an amazing high-level, general-purpose language, with an extensive range of powerful libraries, and probably the most famous one when it comes to data science and machine learning.

Back to our analysis, let’s take a look at the market frequency now. Which are the most common ones?

Market distribution
Market distribution

Nice. Although some of them are too general (like b2b and SaaS) and others could fit in the same category (like Big Data Analytics and Big Data), we can get a good comparison on the existing sectors.

Let’s try something more interesting. Group our data by market and sum up the raised values to see how much money, in total, was invested by sector:

Total investment by market
Total investment by market

Those are the 20 markets with the highest investment. That doesn’t necessarily mean they have the largest amount of invested companies. Let’s take a look at the biggest companies:

Airbnb → 10.3 Bi (Hotels)

Netscape → 4.2 Bi (News)

Nest → 3.3 Bi (Internet of Things)

Palantir → 2.1 Bi (Analytics)

Grail → 1.7 Bi (Diagnostics)

That explains the enormous investment in the hotels market. One or two huge companies can weight too much on the total sum of investments. Maybe taking the median investment of each market could give us a different outcome:

Median investment by market
Median investment by market

Those are the 10 markets in which the median investment is highest. Hotels market is not even there anymore. Still, there may be other approaches that lead us to more revealing results.

Let’s count the number of invested companies by their market, instead of getting the amount invested. Second, it would be nice to have that comparison made between investment ranges. For instance, how many Mobile Advertising companies received an investment that ranges from 1 to 10 million dollars?

For that, I built an interactive chart, in which you can click the buttons to interact (up to 1 Million, from 1 to 10 Million, and so on). For each button, you get a bar plot with the number of companies that raised some amount in between that range.

Number invested companies by market
Number invested companies by market

That’s a much more complex analysis and can give investors and founders a deeper insight on how those markets behave in relation to an investment scale. In which markets is it easier to be raised if you are in the first stage (seed)? And which are the ones in which companies that became billion-dollar unicorns?

Depending on the device you’re using to read this post, you may not be able to use the chart below. Otherwise, feel free to interact with it and take your own conclusions.

Using the amount invested per year for each sector, we could even compare how some markets evolved since 2011.

Market growth since 2011
Market growth since 2011

Then we can check the average investment by stage:

Average investment by stage
Average investment by stage

For some reason, Series A presents a lower average investment than Seed. Let’s take a look at the total amount invested during the last years (in millions of dollars):

Total investments
Total investments

We clearly see that 2012 was the year when AngelList exploded, probably together with a growth in the Venture Capital financing and an increasing number of startups worldwide. Next plot shows the number of startups registered on the website per year.

Number of startups per year
Number of startups per year

Finally, what we can do is use the coordinates extracted from location with Geopy and build a cluster map with the world distribution of those startups. The result is an interactive map that looks like this:

Cluster Map
Cluster Map

That’s a location map for every single one of those 10.000 companies. Even if it’s a small sample, it is a pretty good representation of technology distribution across the countries.

To make it I used the Folium library and saved the output in HTML. If you want to interact with the map, just go to my GitHub repository → click here, download cmap.html and open it in your computer.

Click on the clusters to open up smaller clusters and click on those to see the companies. If you click on a single company you’ll get the link of their website.

The picture below shows a heat map (_hmapweighted.html) weighted by the investment amount, or: where does the AI money go to?

Heat Map
Heat Map

That’s not even half of what we could do with a data set like that. More insights could be obtained from the number of employees (size of the company), companies‘ lifetime and even pitches could be analyzed using NLP. For now, let’s just check the most common words used on startup slogans.

Word Cloud
Word Cloud

What else could you extract? Contact information of Founders, Co-Founders, and Investors. Web Scraping is amazing, and together with data analysis and machine learning, it becomes an incredibly powerful tool.


If you want access to the maps, data or notebooks, just go to my GitHub repository → click here, or leave a comment below. Feel free to let any observations, concerns or ideas, and thank you for reading this post.


Related Articles