Startup Funding, Investments, and Acquisitions

An Exploratory Data Analysis

Published in

Towards Data Science

15 min readJan 19, 2019

The following is a fun little data analysis project I decided to do recently mainly as a way to practice SQL and do a little exploratory data analysis (EDA). On my internet travels, I came across Mode Analytics and saw they had a really nice interface for writing SQL queries as well as a selection of data sets to choose from ( The interface is called Mode Studio and I am pretty sure the data is kept in a PostgreSQL database). Another great feature is that you can integrate SQL analysis with Python notebooks as well. I am a big fan of this as I was struggling to find other online resources where I could easily access some free SQL databases.

I had three main goals in mind when writing this post:

Practice my SQL skills and implement a fun mini project
Outline the typical workflow of a data analysis project focusing on initial data exploration. (at least a typical workflow for me anyway).
Provide a tutorial for those trying to learn either SQL or Python and show that you can generate some really powerful and interesting insights from a little bit of EDA and some relevant questions.

I really enjoyed analyzing this dataset and found that the more I explored it the more questions I came up with but to keep this post relative concise and on track I am going to come with 3 main themes/questions (below) that I want to explore and get some insights on. I encourage anyone following along to come up with their own questions as there are many more then what I discuss here.

Data set and Initial Questions

The data set we are going to look at today is the Crunchbase data set. It contains funding, acquisition and investment information on over 40,000 startups going back to the early 1900s. I will be using the tutorial versions of the data sets and as far as I can tell they are basically the same except the tutorial version is a bit shorter then the original and some of the columns have been cleaned which is good news for us as we do not have to spend time cleaning the data (ordinarily, however, this is a vital step in any analysis and I would recommend getting comfortable cleaning different types of data). The database consists of three tables:

The main aspects of the data I want to explore are the trends in funding, investment, and acquisitions. To be a bit more explicit I am going to come up with 3 general questions to try and answer. Doing so will help focus our analysis and make our ultimate goals a lot clearer.

What are the key features of funding by industry and across countries?
Do acquisitions tend to be made within the same industry or across industries? i.e. Is there any evidence of vertical or horizontal integration?
Is there any evidence of investment bubbles?

OK so we have come up with a few questions to explore and there is actually quite a bit of analysis we could do here. Initially, I will query the database to answer some of these questions and of course to actually get the data. I will also be using Python to do some further analysis and visualizations. Mode makes switching between the two pretty seamless by converting the results of our queries into Python objects and allowing us to access them from a notebook instance. To get an overview of the data I recommend checking out the links above for each table.

Exploratory Data Analysis (EDA)

Funding

I am just going to just jump straight in and figure out whether we can answer our first question. Well, we can break it down a bit since there are a number of parts to this question. Let’s first look at the average amount funded, total funding and the number of companies funded by industry and limit the results to the top 10.

SELECT  category_code,
        COUNT(*) AS number_of_companies,
        AVG(funding_total_usd) AS avg_funding,
        SUM(funding_total_usd) AS total_funding
FROM tutorial.crunchbase_companies
WHERE funding_total_usd IS NOT NULL AND category_code IS NOT NULL
GROUP BY 1
ORDER BY 4 DESC
LIMIT 10;

The results here are pretty interesting but perhaps not that unexpected as the industries receiving the greatest total funding all appear to be tech related such as biotech/software/e-commerce etc. There is also a pretty close correlation between the industries appearing in all 3 charts below (The charts were all made using mode studio btw). I provide a link at the end of this post to the full report that contains interactive graphs for those wanting to have a look. The highest average amount of funding appears to be in the cleantech sector which is consistent with a push towards producing technology that has less of a negative environmental impact. Ostensibly this is a very welcome development given the stringent carbon targets that many countries have.

What kind of insights do we get if we examine this a little closer? Well, if we look at funding over time we arrive at a different conclusion. From Figure 3 we can see that there was a significant increase in total funding from 2002 until 2006 peaking at around $8 billion. Post-2006, however, there was a sharp reduction in funding in the sector. This appears to have the hallmarks of an investment boom and bust or in other words a bubble. After doing a bit of research it appears that there were a number of factors that could have contributed to this boom including tax breaks and loan guarantees for clean tech around this period. There were also a number of factors that could have contributed to the fall in funding in Cleantech in the subsequent years such as falling fuel prices and of course the financial crisis beginning in 2007/2008. This provided less of a financial incentive to put money into cleaner technology and fuel sources and also likely caused general investor uncertainty given the prevailing economic climate.

Well, this is a nice start to the analysis and appears to go some way in answering both question 1 and 3. Let's see what other interesting insights we can glean from the data.

SELECT category_code,
       DATE_PART('year', first_funding_at::DATE) AS year,
       SUM(total_funding_amt) AS sum_funding_amt
FROM tutorial.crunchbase_companies
WHERE category_code = 'cleantech' AND first_funding_at IS NOT NULL 
GROUP BY 1,2
ORDER BY 2 ASC;

Figure 3: Total Funding in cleantech Sector

Now that we have looked at a specific industry how about we get a bit more general and look at the companies across all industries that received the cumulative highest level of funding over the years and see if we recognize any of them. So am I sure most of you recognize these companies with the likes of Facebook, Groupon, Twitter, Verizon etc (Figure 4). We could also look at what kind of time period the top 10 companies were funded over. Were they given a lump sum once or were they funded over a prolonged period of time? We can do this by simply subtracting the first_funded_at from the last_funded_at variable to get the difference in days. We can also calculate the number of rounds of funding that each received and use this to calculate the average level of funding per round.

Figure 4: Top 10 Companies by Total Funding

SELECT  name,
        funding_total_usd,
        TO_DATE(first_funding_at, 'DD/MM/YY') AS first_month,
        TO_DATE(last_funding_at, 'DD/MM/YY') AS last_month,
        (TO_DATE(last_funding_at, 'DD/MM/YY') - TO_DATE(first_funding_at, 'DD/MM/YY')) AS days_difference,
        funding_rounds,
        funding_total_usd/funding_rounds AS funding_per_roundFROM tutorial.crunchbase_companies
WHERE funding_total_usd is NOT NULL
ORDER BY 2 DESC
LIMIT 10;

We can see the results of this query below in table 1. So what does this table tell us? It looks like there were 3 companies, Sigmacare, Carestream and Terra-Gen Power who had a single round of funding. The rest of the companies in the top 10 seemed to have had multiple rounds of funding with an average of 5. These companies also received their funding spread out over a number of years. Apart from the 3 companies who went through only one round of funding, the average funding per round is about $567 million a staggering amount of money. This compares to an average of just below $8 million for the rest of the startups with more than one funding round (query below).

with t1 as (SELECT funding_total_usd/funding_rounds AS funding_per_round
FROM tutorial.crunchbase_companies
WHERE funding_rounds > 1)
SELECT AVG(funding_per_round) FROM t1;

We can also plot the distribution of the funding which will give us a nice visual overview of the funding amount. For plotting purposes, I have removed a number of very large investments including the figures so we could better visualize the rest of the data (Python code below). The average is right around $16 million while the median is $5 million indicating there are still significant outliers within this sample.

fund_dist = fund_dist[fund_dist['funding_total_usd'] < 2.050000e+08]
plt.figure(figsize=(16,6))sns.distplot(fund_dist['funding_total_usd'], bins = 30)

Figure X: Distribution of Funding Amount

OK now that we have a reasonably good overview of the funding data lets see if there is a relationship between the amount of funding that a company receives and whether or not they are still around? Let’s have a look using Python. First, let’s explore the status of companies. The code below gets the raw counts and percentage of companies in each category. The vast majority of the companies are still operating (82%) but there are also 8% who have closed over the period. Looking at table 3 we can see that the highest average level of funding is generated by IPO’s (Initial Public Offering on the stock market) which is interesting given it is the least common of the categories in the status column.

datasets[3]['status'].value_counts()
datasets[3]['status'].value_counts(normalize = True)datasets[3].groupby('status')['funding_total_usd'].agg(['mean', 'sum'])

Table 3: Average and Total Funding by Status.

Given that the highest average funding is raised by IPO’s, let’s take a quick peek at the top 10 IPO’s by funding raised. There are some familiar faces here again such as Verizon, Facebook, and Tesla all of which generated substantial amounts of funding from their IPO. I have also calculated the funding as a proportion of the total generated by all IPO’s in the data set and was very surprised to see that the top 2 generated 16% of the total. There was obviously a lot of investor interest in these companies at the time.

datasets[3][datasets[3]['status'] == 'ipo'][['name', 'category_code', 'funding_total_usd']].sort_values(by = 'funding_total_usd', ascending=False)[:10]

So what can we say about the difference in companies who are still operating and companies who are now closed? Well from the information above we can say that the companies who are now closed received less funding on average and less funding in total. So was it just a funding problem? Probably not. There are likely a host of other reasons these companies are no longer operating. I tried to find some difference between the characteristics of operating and closed companies such as the number of funding rounds they received and the countries they operated in but didn't really find any stark differences. Unfortunately, it seems like we are not going to find any definitive answer to this particular question. This is actually a pretty important point,

It is not always possible to find the answers you want with the data that you have. This is one of the reasons we carry out exploratory data analysis as it allows us to figure out what questions we can feasibly expect to find answers to.

Doing an initial exploratory analysis of the data is a very important step in an analytical project as it can help formulate realistic goals and also help the analyst set and manage expectations for the outcomes of the project.

The final thing I want to look at in the funding table is the distribution of total funding across the US as I suspect a large proportion of them are in the US, particularly in California. This is where Python excels. I have used Plotly in a previous post and this is another perfect opportunity to use it. Below is the Python code to get the data we need from our object that mode created and then plot a nice interactive Chloropleth (I had to attach a PNG as there is a view limit on the free tier of Plotly).

import matplotlib.pyplot as plt
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True) #do not miss this line
from plotly import tools
comp_df = datasets[3] # object created by mode
temp_plot = comp_df[['funding_total_usd', 'country_code', 'state_code']].dropna()
grouped_data = temp_plot.groupby('state_code').sum().reset_index()
data = [ dict(
 type='choropleth',
 locations = grouped_data['state_code'],
 autocolorscale = True,
 z = grouped_data['funding_total_usd'],
 locationmode = 'USA-states',
 marker = dict(
 line = dict (
 color = 'rgb(255,255,255)',
 width = 2
 )
 ),
 colorbar = dict(
 title = 'Millions USD'
 )
 ) ]
layout = dict(
        title = 'Total Funding for startups across the US',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)

It looks like my suspicions were correct, the most intense red colour is over California indicating that the highest levels of funding occurred there primarily due to a high density of tech companies in Silicon Valley. There are also substantial levels of funding in other areas of the US such as New York, Massachusetts, and Maryland. This is also perhaps not surprising given the numerous universities and technology-based companies located in these states. Now that we have looked at the US why don't we have quick look at at the global distribution of funding? (I will leave out the python code here for brevity but it is very similar to the code above.)

Again the bulk of the money is located in the US, however, there are also high levels of funding in China, India and the UK even though it may be somewhat hard to see from the graph above (this is more obvious in the interactive report that I link at the bottom).

Acquisitions

We have spent quite a bit of time analyzing the companies table but what about the other tables? What have acquisitions looked like over the years? The main objective of this section is to look at the types of acquisitions being made i.e. are acquisitions being made across industries or within the same industry. This can help us get a sense of the levels of horizontal and vertical integration (these are important concepts in competition economics). We could start out by looking at some of the biggest acquisitions and seeing whether they were in the same or different industries. This could give us a bit of insight on the level of concentration in each industry.

The first thing we can do is look at the general trends in the number of acquisitions over time and also examine the biggest acquisitions in the data set. This is relatively straightforward to do in PostgreSQL. The first query is pretty straightforward and just returns the top 10 acquisitions ordered by the price. The second query returns the number of acquisitions each year as well as the accumulated number using a window function.

SELECT  company_name,
        company_category_code,
        acquirer_name,
        acquirer_category_code,
        acquired_year,
        price_amount
FROM tutorial.crunchbase_acquisitions
WHERE price_amount IS NOT NULL
ORDER BY price_amount DESC
LIMIT 10;------------------------------------------------------------------SELECT  acquired_year,
        AVG(price_amount),
        COUNT(*),
        SUM(COUNT(*)) OVER (ORDER BY acquired_year) AS total_count
FROM tutorial.crunchbase_acquisitions
WHERE price_amount IS NOT NULL AND acquired_year > 1995
GROUP BY 1
ORDER BY 1;

Looking at the top 10 acquisitions by total price, it seems that the majority of acquisitions are across industries rather than within the same industry. This suggests that there is not a lot of horizontal integration going on as the companies purchasing and being purchased tend to be in separate industries, at least for the biggest acquisitions. Interestingly, there seems to be a number of PR firms being acquired by other companies (Table 2). Could this be something to do with branding or companies trying to improve their image? We can also see that nearly all of these acquisitions have been made over the last 10 years and nearly all of the companies have been acquired by a company in a tech related industry. These observations are pretty consistent with what we have seen so far.

Table 6: Top 10 Acquisitions by Price ($)

The graph below corresponds to the second query above and highlights the proliferation of acquisitions starting in the 2000’s up to 2014 (last year in the data set). Taking all of this together, it is clear that there has been a substantial amount of financial activity in these industries. Overall, the growth in the number of acquisitions is quite volatile over time but there does seem to be a general trend for fewer acquisitions from 2010 onwards (apart from 2013). Interestingly, although the numbers have fallen, the size of acquisitions in terms of average cost has been rising over the same time period. In other words, there has been a trend towards larger acquisitions in more recent years. Could this signify an increase in monopoly power for certain companies or industries?

Figure 7: Acquisitions over time — number and amount paid

Investments

In this section, I want to have a brief look at the investments table. I start off by plotting the distribution of the amount raised for each investment to give us a better sense of the spread of the data. Figure 9 displays a histogram of the investment value over the entire time period. We can see from the figure that the majority of investments seem to be at around $12 million. There are also a number of outliers as well. Let's examine what companies received very large investments and check what type of funding round it was. The results reveal that pretty much all of the biggest investments were post-IPO. This presumably means that the investments were stock purchases.

SELECT funding_round_type, raised_amount_usd
FROM tutorial.crunchbase_investments
WHERE funding_round_type IS NOT NULL AND raised_amount_usd IS NOT NULL
ORDER BY 2 DESC;------------------------------------------SELECT funding_round_type, COUNT(*)
FROM tutorial.crunchbase_investments
WHERE funding_round_type IS NOT NULL AND raised_amount_usd IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC;-----------------------------------------SELECT company_name, company_category_code, investor_name, investor_category_code,
funding_round_type, raised_amount_usd, funded_at
FROM tutorial.crunchbase_investments
WHERE raised_amount_usd IS NOT NULL
ORDER BY 6 DESC LIMIT 15;

What about for the rest of the sample? Were they funded by equity, debt or was it perhaps angel investors? The total count of investments by type of funding rounds is plotted in Figure 10. Series A funding appears to be the most frequent type of funding received by companies followed by venture capital and angel investing. As the name suggests, series A is one of the first types of funding a start-up can receive. After this usually comes B and C rounds. The fact that series A funding is likely more frequent than the subsequent rounds as many start-ups do not actually succeed and therefore do not make it the next round of funding. Please see here for a more in-depth explanation of the types of funding rounds.

with t1 AS
  (SELECT company_name, investor_name,  funded_year ,investor_category_code,
          company_category_code,
          SUM(raised_amount_usd) AS total_raised,
          rank() OVER(PARTITION BY investor_category_code
                      ORDER BY SUM(raised_amount_usd) DESC) AS rank
   FROM tutorial.crunchbase_investments
   WHERE raised_amount_usd IS NOT NULL
     AND investor_category_code IS NOT NULL
   GROUP BY 1,
            2,
            3,
            4,
            5
   ORDER BY 5,
            3 DESC)
SELECT *
FROM t1
WHERE rank < 2
ORDER BY 6 DESC
LIMIT 15;

The query above shows us the top investment for each industry and what industry is receiving the investment. We have only selected the top 15 investments for the purposes of presenting the data. The top 15 is dominated by the mobile industry. There is a huge influx of money going towards the mobile industry which begs the question, is this going to one company or multiple companies and why the massive interest in this sector? Upon further examination, it appears that there are large investments in Clearwire in 2008 which is essentially what we saw in Table 1. Why all the interest in Clearwire? There were a number of companies investing in Clearwire in 2008, including Sprint Nextel, Google, Intel, and Comcast among others and appears to have been used to launch 4G wireless technology.

Main Takeaways

There does not seem to be evidence of horizontal integration, particularly with the biggest acquisitions. Thus, there is no obvious evidence of a lack of competition. This is of course not the final say as some industries may already be concentrated. This suggests that at least it hasn't become more concentrated as a result of takeovers.
There does seem to be a general trend towards larger acquisitions which could be indicative of companies increasing monopoly power.
We did find some evidence of investments bubbles, specifically in the cleantech sector. Is there evidence of other bubbles? I will leave that to you guys to find.
The majority of the money has flowed into the US, particularly into California which is really what we would have expected given the proliferation of tech companies in recent years in Silicon Valley.
The highest levels of funding have mainly been concentrated in tech-related industries such as cleantech, biotech, mobile etc.
Without a doubt, the type of funding which generates the most money is the IPO.

Thanks for reading everyone, hope you all found this interesting and learned a little something and see you all in the next post.

Link to report on Mode Analytics