Even you can build a machine learning model.

Seriously!

Good data alone doesn’t always tell the whole story. Are you trying to figure out what someone’s salary should be based on their years of experience? Do you need to examine how much you’re spending on advertising in relation to your yearly sales? Linear regression might be exactly what you need!

Linear regression looks at the relationship between the data you have and the data you want to predict.

Linear Regression is a basic and commonly used type of predictive analysis. It’s the most widely used of all statistical techniques. It quantifies the relationship between one or more **predictor variables** and one **outcome variable**.

**Linear regression models** are used to show (or predict) the relationship between two variables or factors. **Regression analysis** is commonly used to show the correlation between two variables.

You could, for example, look at some information about players on a baseball team and predict how well they might do that season. You might want to examine some variables about a company and predict well their stock might do. You might even just want to examine the number of hours people study and how well they do on a test, or you could look at student’s homework grades overall in relation to how well they might do on their tests. It’s a seriously useful technique!

Just remember:correlation is not causation! Just because a relationship exists between two variables doesn’t mean that one variable caused the other variable! Regression analysis is not used to predict cause-and-effect relationships. It can look at how variables relate to each other. It can examine to what extent variables are associated with each other. It’s up to you to take a closer look at those relationships.

The variable that the equation in your linear regression model is predicting is called the **dependent variable**. We call that one **y**. The variables that are being used to predict the dependent variable are called the **independent variables.** We call them **X**.

You can think of it as though the prediction (**y**) is dependent on the other variables (**X**). That makes **y** the dependent variable!

In **simple linear regression analysis**, each observation consists of two variables. These are the independent variable and the dependent variable. **Multiple regression analysis** looks at two or more independent variables and how they correlate to the independent variable. The equation that describes how **y** is related to **X** is called the **regression model**!

Regression was first studied in depth by Sir Francis Galton, a man with a wide variety of interests. While he was a very problematic character with a lot of beliefs worth disagreeing with, he did write some books with cool information about things like treating spear wounds and getting your horse unstuck from quicksand. He also did some useful work with fingerprints, hearing tests, and even devised the first weather map. He was knighted in 1909.

While studying data on the relative sizes between parents and their children in plants and animals, he observed that larger-than-average parents have larger-than-average children, but those children will be less large in terms of their relative position within their own generation. He called it **regression towards mediocrity. **That would be **regression to the mean** in modern terms.

(I have to say, though, that there is a certain sparkle to the phrase, “regression towards mediocrity” that I need to work into my day-to-day life...)

To be clear, though, we’re talking about **expectations** (predictions) and not absolute certainty!

Regression models are used for predicting a real value, for example, salary or height. If your independent variable is **time**, then you are forecasting future values. Otherwise, your model is predicting present but unknown values. Examples of regression techniques include:

- Simple regression
- Multiple regression
- Polynomial regression
- Support Vector Regression

Let’s say you’re looking at some data that includes employee’s years of experience and salary. You want to look at the correlation between those two figures. Maybe you’re running a new business or small company that has been kind of setting the numbers randomly.

So how can you find the correlation between those two variables? In order to figure that out, we’ll create a model that will tell us what is the best fitting line for this relationship.

Here’s a simple linear regression formula:

(You might recognize this as the equation for a slope or trend line from high school algebra.)

In this equation, **y** is the dependent variable, which is what you’re trying to explain. For the rest of this article, **y** will be an employee’s salary after a certain number of years of experience.

You can see the independent variable above. That’s the variable that is associated with the change in your predicted values. The independent variable might be causing the change or simply associated with the change. Remember, **linear regression doesn’t prove causation**!

The coefficient is how you explain that a change in your independent variable is maybe not totally equal to a change in y.

Now we want to look at the evidence. We want to put a line through our data that best fits our data. A regression line can show a positive linear relationship (the line looks like it’s sloping up), a negative linear relationship (the line is sloping down), or really no relationship at all (a flat line).

The constant is the point where the line crosses the vertical axis. For example, if you looked at 0 years of experience in the graph below, your salary would be around $30,000. So the constant in the chart below would be about $30,000.

The steeper the slope, the more money you get for your years of experience. For example, maybe with 1 more year of experience, your salary (y) goes up an additional $10,000, but with a steeper slope, you might wind up with more like $15,000. With a negative slope, you’d actually lose money as you gained experience, but I really hope you won’t be working for that company for long...

When we look at a graph, we can draw vertical lines from the line to our actual observations. You can see the actual observations as the dots, while the line displays the model observations (the predictions).

The line that we drew is the difference between what an employee is actually earning and what he’s modeled (predicted) to be earning. We would look at the **minimum sum of squares **to find the best line, which just means that you’d take the sum of all the squared differences and find the minimum.

That’s called the **ordinary least squares** method!

First the imports!

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Now let’s preprocess our data! If you don’t know much about data cleaning and preprocessing, you might want to check out this article. It will walk you through importing libraries, preparing your data, and feature scaling.

The complete beginner’s guide to data cleaning and preprocessing

We’re going to copy and paste the code from that article and make two tiny changes. We’ll need to change the name of our dataset, of course. Then we’ll take a look at the data. For our example, let’s say for our employees we have one column of years of experience and one column of salaries and that’s it. Keeping in mind that our index starts at 0, we will go ahead and separate the last column from our data for the dependent variable, just like we already have set up. This time, however, we’d be grabbing the second column for our independent variable, so we’d make a minor change to grab that.

dataset = pd.read_csv('salary.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 1].values

Now X is a matrix of features (our independent variable) and y is a vector of the dependent variable. Perfect!

It’s time to split our data into a training set and a test set. Normally, we would do an 80/20 split for our training and testing data. Here, though, we’re working with a small dataset of only 30 observations. Maybe this time we’ll split up our data so that we have 20 training observations and a test size of 10.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

You have an X_train, X_test, y_train, and y_test! You’re ready to go!

We set a random state of 0 so that we can all get the same result. (There can be random factors in calculations, and I want to make sure we’re all on the same page so that nobody gets nervous.)

We’ll train our model on the training set and then later predict the results based on our information. Our model will **learn** the correlations on the training set. Then we will test what it learned by having it predict values with our test set. We can compare our results with the actual results on the test set to see how our model is doing!

A**lways split your data into training and testing sets**! If you test your results on the same data you used to train it, you’ll probably have really great results, but your model isn’t good! It just memorized what you wanted it to do, rather than learning anything that it can use with unknown data. That’s called overfitting, and it means that you **did not build a good model**!

We actually don’t need to do any feature scaling here!

Now we can fit the model to our training set!

We’ll use Scikit-learn learn for this. First, we’ll import the linear model library and the linear regression class. Then we’ll create an object of the class — the regressor. We’ll use a method (the fit method) to fit the regressor object that we create to the training set. To create the object, we name it, then call it using the parenthesis. We can do all of that in about three lines of code!

Let’s import linear regression from Scikit-Learn so that we can go ahead and use it. Between the parenthesis, we’ll specify which data we want to use so our model knows exactly what we want to fit. We want to grab both X_train and y_train because we’re working with all of our training data.

You can look at the documentation if you want more details!

Now we’re ready to create our regressor and fit it to our training data.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train)

There it is! We’re using simple linear regression on our data and we’re ready to try out our predictive ability on our test set!

This is what machine learning is! We created a machine, the regressor, and we had it learn the correlation between years of experience and salary on the training set.

Now it can predict future data based on the information that it has. Our machine is ready to predict a new employee’s salary based on the number of years of experience that the employee has!

Let’s use our regressor to predict new observations. We want to see how the machine has learned by looking at what it does with new observations.

We’ll create a vector of predicted values. This is a vector of predictions of dependent variables that we’ll call y_pred. To do this, we can take the regressor we created and trained and use the predict method. We need to specify which predictions to make, so we want to make sure we include the test set. For our input parameter in regressor.predict, we want to specify the matrix of features of new observations, so we’ll specify X_test.

y_pred = regressor.predict(X_test)

Seriously. That takes a single line of code!

Now y_test are the real salaries of the 10 observations in the test set and y_pred are the predicted salaries of these 10 employees predicted by our model.

You did it! Linear regression in four lines of code!

Let’s visualize the results! We need to see what the difference is between our predictions and the actual results.

We can plot the graphs in order to interpret the result. First, we can plot the real observations using plt.scatter to make a scatter plot. (We imported matplotlib.pyplot earlier as plt).

We’ll look at the training set first, so we’ll plot X_train on the X coordinates and y_train on y coordinates. Then we probably want some color. We’ll do our observations in blue, and our regression line (predictions) in red. For the regression line we’ll use X_train again for the X coordinates, and then the predictions of the X_train observations.

Let’s also fancy it up a little with a title and labels for the x-axis and y-axis.

plt.scatter(X_train, y_train, color = 'blue')

plt.plot(X_train, regressor.predict(X_train), color = 'red')

plt.title('Salary vs Experience (Training set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

Now we can see our blue points, which are our real values and our predicted values along the red line!

Let’s do the same for the test set! We’ll change the test set title and change our “train” to “test” in the code.

plt.scatter(X_test, y_test, color = 'blue')

plt.plot(X_train, regressor.predict(X_train), color = 'red')

plt.title('Salary vs Experience (Test set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.show()

Make sure you notice that we aren’t changing X_train to X_test in the second line. Our regressor is already trained by the training set. When we trained, we obtained one unique model equation. If we replace it, we’ll obtain the same line and we’ll probably build new points of the same regression line.

This is a pretty good model!

Our model is doing a nice job of predicting these new employee salaries. Some of the actual observations are the same as the predictions, which is great. There isn’t a 100% dependency between the **y **and **X** variables, so some of the predictions won’t be completely accurate.

You did it!

Congratulations on making your very first machine learning model!!!

As always, if you’re doing anything cool with this information, let people know about it in the responses below or reach out any time on LinkedIn @annebonnerdata!

You might want to check out some of these articles too:

- Getting started with Git and GitHub: the complete beginner’s guide
- How to effortlessly create a website for free with GitHub
- Getting Started With Google Colab
- Intro to Deep Learning
- WTF is image classification?
- How to build an image classifier with greater than 97% accuracy
- The brilliant beginner’s guide to model deployment

Thanks for reading!

The complete beginner’s guide to machine learning: simple linear regression in four lines of code! was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>*Team Members: **Richa Bathija, Abhinaya Ananthakrishnan, Akhilesh Reddy(*@akhilesh.narapareddy*), Preetika Srivastava (*@preetikasrivastava30*)*

Did you ever face a situation where you had to scroll through a 400 word article only to realize that there are only 4 key points in the article? All of us have been there. In this age of information, with content getting generated every second around the world, it has become quite difficult to extract the most important information in an optimal amount of time. Unfortunately, we have only 24 hours in a day and that is not going to change. In the recent years, advancements in Machine learning and Deep learning techniques paved way to the evolution of text summarization which might solve this problem of summarization for us.

In this article, we will give an overview about text summarization, different techniques of text summarization, various resources that are available, our results and challenges that we faced during the implementation of these techniques.

There are broadly two types of summarization — Extractive and Abstractive

**Extractive**— These approaches select sentences from the corpus that best represent it and arrange them to form a summary.**Abstractive**— These approaches use natural language techniques to summarize a text using novel sentences.

We tried out different extractive and abstractive techniques and also a combination approach that uses both of them and evaluated them on some popular summarization datasets.

Data is the most crucial part when it comes to any Natural language processing application. The more data that you can get your hands on to train your model, the more sophisticated and accurate the results would be. This is one of the reasons why Google and Facebook have outsourced a lot of their work in this domain.

We used 3 open source data sets for our analysis.

It is popularly known as GIGAWORLD dataset and contains nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition. It consists of articles and their headlines. We have used this dataset to train our Abstractive summarization model.

CNN daily mail dataset consists of long news articles(an average of ~800 words). It consists of both articles and summaries of those articles. Some of the articles have multi line summaries also. We have used this dataset in our Pointer Generator model.

This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. The reviews were obtained from various sources — Tripadvisor (hotels), Edmunds.com (cars) and amazon.com (various electronics).Each article in the dataset has 5 manually written “gold” summaries. This dataset was used to score the results of the abstractive summarization model.

To evaluate the goodness of the generated summary, the common metric in the Text Summarization space is called Rouge score.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation

It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). It works by matching overlap of n-grams of the generated and reference summary.

Extractive summarization techniques select relevant phrases from the input document and concatenate them to form sentences. These are very popular in the industry as they are very easy to implement. They use existing natural language phrases and are reasonably accurate. Additionally, since they are unsupervised techniques, they are ridiculously fast. As these techniques only play around with the order of sentences in the document to summarize they do not do as great a job as humans to provide context.

We wanted to start our text summarization journey by trying something simple. So we turned to the popular NLP package in python — NLTK. The idea here was to summarize by identifying “top” sentences based on word frequency.

Although the technique is basic, we found that it did a good job at creating large summaries. As expected, brevity wasn’t one of its strengths and it often failed at creating cohesive short summaries. Following are the summaries of “A Star is Born” Wikipedia page.

You can find the detailed code for this approach here.

The Gensim summarization module implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al. It is built on top of the popular PageRank algorithm that Google used for ranking.

After pre-processing text this algorithm builds graph with sentences as nodes and chooses the sentences with highest page rank to summarize the document

TextRank is based on PageRank algorithm that is used on Google Search Engine. In simple words, it prefers pages which has higher number of pages hitting it. Traditionally, the links between pages are expressed by matrix as shown in the image below. This matrix is then converted to a transition probability matrix by dividing the sum of links in each page which influences the path of the surfer.

In the original “TextRank” algorithm the weights of an edge between two sentences is the percentage of words appearing in both of them.

However, the updated algorithm uses Okapi BM25 function to see how similar the sentences are. BM25 / Okapi-BM25 is a ranking function widely used as the state of the art for Information Retrieval tasks. BM25 is a variation of the TF-IDF model using a probabilistic model.

In a nutshell, this function penalizes words that appears in more than half the documents of the collection by giving them a negative value.

The gensim algorithm does a good job at creating both long and short summaries. Another cool feature of gensim is that we can get a list of top keywords chosen by the algorithm. This feature can come in handy for other NLP tasks, where we want to use “TextRank” to select words from a document instead of “Bag of Words” or “TF-IDF”. Gensim also has a well-maintained repository and has an active community which is an added asset to using this algorithm.

You can find the detailed code for this approach here.

The summa summarizer is another algorithm which is an improvisation of the gensim algorithm. It also uses TextRank but with optimizations on similarity functions. Like gensim, summa also generates keywords.

You can find the detailed code for this approach here.

We wanted to evaluate how text summarization works on shorter documents like reviews, emails etc. Most publicly available datasets for text summarization are for long documents and articles. Hence we used the Amazon reviews dataset which is available on Kaggle.

The structure of these short documents is very specific. They start off with a context and then talk about the product or specific matter, and they end off with closing remarks. We used K-means clustering to summarize the types of documents following the aforementioned structure.

For clustering the sentences, one has to convert the text to number, which is done using pre-trained word embeddings. We have used GloVe embedding for achieving this. The embeddings created consisted of a dictionary with the most common english words and a 25 dimensional embedding vector for each one of them. The idea is to take each sentence of the document, and calculate it’s embeddings, thus creating a data matrix of (# of documents X 25) dimensions. We kept it simple for obtaining the sentence embeddings by taking the element-wise weighted average of the word embeddings for all the words in the sentence.

Then, all of the sentences in a document are clustered in k = sqrt(length of document) clusters. Each cluster of sentence embeddings can be interpreted as a set of semantically similar sentences whose meaning can be expressed by just one candidate sentence in the summary.

The candidate sentence is chosen to be the sentence whose vector representation is closest to the cluster center.

Candidate sentences corresponding to each cluster are then ordered to form a summary for an email. The order of the candidate sentences in the summary is determined by the positions of the sentences in their corresponding clusters in the original document. We ran the entire exercise for *Amazon Food reviews dataset**, *and the results were pretty good on the mid length reviews. One of them is shown below:

You can find the detailed code for this approach here.

We wanted to make a tangible end to end solution to our project. We wanted a viable UI for our algorithms through which the user could interact with the backend.

For this we integrated widgets into Jupyter Notebook

The Widget enables the user to get a holistic view of the project, by asking for user input and generating the output. The widget takes three parameters as input:

- The type of algorithm the user wants to use for text summarization
- The URL of the Wikipedia page whose text needs to be summarized
- The shrink ratio (Amount of output text to be displayed as a percentage of the original document)

The user enters these three inputs and hits the RUN button. The algorithm gets called from the backend and the output is generated based on the shrink ratio.

You can see the detailed code for this notebook here.

Abstractive summarization began with Banko et al. (2000) ‘s research suggesting the use of machine translation model. However, in the past few years **RNNs **using encoder — decoder models with attention has become popular for summarization.

We attempted to train our own encoder — decoder model by using the GIGAWORLD dataset from scratch by trying to imitate the first state-of-the art encoder-decoder model.

*If you’re unfamiliar with Recurrent Neural Networks or the attention mechanism, check out the excellent tutorials by **WildML**, **Andrej Karpathy** and **Distill**.*

We used Google Cloud platform to train our RNN and test it. To learn more about how to setup the environment on Google Cloud Platform, you can view this tutorial also written by our team here.

We used a 16 — CPU system with 2 Tesla P100 GPUs in order to train a simple encoder decoder Recurrent neural network with the following hyper parameters:

Pre-trained tensor flow embeddings were also used.

Although we ran our code on the cloud, given the sheer size of GIGAWORLD dataset, training for even 5-epochs took 72 hours

When we scored the model on validation data, we observed that even after 5 epochs (and 3 days), the model did a good job at matching the actual summaries written by humans. Following are a few examples:

The model also did a good job at providing context and using different words of its own. For example in the image below we can observe how the model uses “Taipei” instead of “Taiwan”. The actual text doesn’t talk about “Taipei”, but the model recognizes context and uses its own learning to use the word “Taipei”.

You can view the detailed code and output used to generate the model here.

Although abstractive summarization can be more intuitive and sound like a human, it has 3 major drawbacks:

- Firstly, training the model requires a lot of data and hence time. Although one can apply transfer learning here, pre-trained weights are not readily available and there is no open-source code that one can leverage to implement it
- An inherent problem with abstraction is that the summarizer reproduces factual details incorrectly. For instance, if the article talks about Germany beating Argentina 3–2, the summarizer may replace 3–2 by 2–0
- Repetition is another problem faced by the summarizer. As we can see in the second example above, some phrases are repeated in the summary

Pointer generator networks solve *problem 2* discussed above by calculating “generating probability” which represents the probability of generating a word from the vocabulary versus copying the word from the source. The image below describes how it weighs and combines the vocabulary distribution (which we use for generating) and the attention distribution into the final distribution to create the “generating probability”

Compared to the sequence-to-sequence-with-attention system, the pointer-generator network does a better job at copying words from the source text. Additionally it also is able to copy out-of-vocabulary words allowing the algorithm to handle unseen words even if the corpus has a smaller vocabulary.

Hence we can think of pointer generator as a combination approach combining both extraction (pointing) and abstraction (generating).

We used the pre-trained weights generated by training the CNN dataset and saw some really cool results on the validation dataset.

The code and pre-trained weights were used from the developer’s github.

Following is the demo of how we used the weights to run the validation model on Google Cloud Platform:

The pointer generator model also creates an “Attention Map” which shows you which words have received a high “generation probability” and how they play a role in creating the summary.

We generated summaries for all the articles in the “Opinions” dataset and compared them to the “gold summaries” that were already present. Both extractive and abstractive techniques did a good job and it was hard to point out any significant difference in both the summaries.

In fact, in some cases, the generated summaries across both algorithms were significantly different as compared to the gold summary. One such instance can be seen in the image below, where the gold summary talks about how “Gas mileage is below expected”, but the generated summaries only highlight the positives of the “Toyota Camry”

Given the architecture of RNNs and the current computing capabilities, we observed that extractive summarization methods are faster, but equally intuitive as abstractive methods. A few other observations:

- Abstractive methods take much longer to train and require a lot of data. Higher-level abstraction — such as more powerful, compressive paraphrasing — remains unsolved
- The network fails to focus on the core of the source text and summarizes a less important, secondary piece of information
- The attention mechanism, by revealing what the network is “looking at”, shines some precious light into the black box of neural networks, helping us to debug problems like repetition and copying.
- To make further advances, we need greater insight into what RNNs are learning from text and how that knowledge is represented.

It goes without saying that there will be challenges in implementing a task which is still in the phases of evolution. Text summarization perfectly fits in to this description . In this section, we will discuss about the challenges that you might possibly face if you want to try this out on your own.

Research and development in this field is growing at a very high pace. When we started implementing these techniques, we realized most of the versions of packages that were used in the earlier resources were outdated. This led to a lot of version issues in Tensorflow and Attention packages we used. So you need to make sure to go through your reference code and make necessary updates to the code beforehand

With 16 CPUs of 16GB memory and 2 Nvidia Tesla P100 GPUs, our abstractive model took 72 hours to go through just 5 epochs. Just 5 epochs!

It would have taken 2 complete weeks to just replicate the results of the paper that we were referring. So better be prepared to have long training hours.

We had to go through multiple Github profiles to get the datasets that we used for our analysis. This process took us quite some time to get started with the analysis.

- Compare the abstractive and extractive summaries using ROUGE scores
- Train Seq2Seq and Pointer Generator models for more epochs and replicate results in the paper

All of the references used are listed below:

*(i) **https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1*

(ii)*https://github.com/thunlp/TensorFlow-Summarization*

(iii)*https://github.com/thomasschmied/Text_Summarization_with_Tensorflow*

(iv) *https://github.com/abisee/pointer-generator*

*Please do share if you have tried another cool techniques for Auto-text summarization. Would love to learn from similar experiences!*

Data Scientist’s Guide to Summarization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>If you’re thinking about attending a data science bootcamp but have zero data science experience yourself, you’ll probably not be able to sort the good from the bad.

You won’t know which ones focus on the right things, the unnecessary things, the weird edge-case things. And most importantly, you won’t know which ones will actually give you a shot at landing that dream job.

I’ve been in data science for about a decade. I’ve worked on all kinds of data science projects with all kinds of colleagues. Some had attended bootcamps, some had PhDs, some were software engineers first, and some had no previous experience at all.

Every day I see new questions on Quora from people seeking objective opinions about these bootcamp programmes.

So I’ve started to compile this article — an in-depth review of every (well-known) data science bootcamp I can find.

Here are the bootcamps I’ve written about below (this article is long, you should use CTRL+F is you’re interested in a particular course):

- General Assembly’s in-person data science immersive bootcamp
- Flatiron School’s in-person data science bootcamp
- The Data Science Dojo’s corporate training bootcamp
- Ironhack’s in-person data analytics bootcamp
- Springboard’s online data science bootcamp
- Thinkful’s flexible data science bootcamp
- Brainstation’s data science diploma

If you want to know my overall opinion on data science bootcamps and which ones are best, scroll all the way to the bottom.

Before we get in to the reviews, it’ll be good to set the stage and explain what a data science bootcamp is and who they’re for.

Data science bootcamps are rapid training courses that aim to take those unskilled in data science to a point where they can make meaningful contributions to a project.

These courses usually combine lecture-style presentations with hands-on coding challenges and more often than not have some kind of capstone project (or portfolio focus module) and interview advice.

I’m going to say it upfront — by their very nature these bootcamps do not deliver everything that someone needs to become a “successful” data scientist. They necessarily focus on the technical and practical aspects of the role and do not teach business, developing expertise, marketing yourself, managing others etc.

These bootcamps fit somewhere in between a semester long college class and industrial technical training. This is problematic for two reasons.

The first is that college courses do not stand alone, they are bolstered by a range of experiences (and other courses) which increase skills outside of a narrow technical domain. This helps students see the big picture.

The second is that industrial technical training is usually delivered in-house to people who know their businesses and roles very well. They know how they’re going to implement the new skills in their work and they’ll know what will benefit their company.

In summary, data science bootcamps are excellent ways of learning isolated technical skills in a discipline that can otherwise be very hard to break in to. If you’re looking to learn these technical skills and prefer to learn in person, they are an excellent choice — just be aware of what they don’t teach.

- I do not work for any of these organisations and I haven’t attended any of these courses. I’m using my experience as a data scientist, consultant, and coach to review these programmes.
- The ordering has been determined by the courses search ranking on Google and is not indicative of quality.
- I’m going to be judging these programmes on their technical content (of course) but also how well they prepare their students for the working world.
- My job, nowadays, is to help data scientists move forward in their careers. It’s in my interest to say that these bootcamps fail to prepare their students for the profession. I’m going to try my best to recognise that and to look at these things honestly.
- If I’ve missed a bootcamp that you’re really interested in attending, drop me a line. Similarly, if you have experience of attending one bootcamp or another and feel I’ve given it an inaccurate review — let me know.

With all that out of the way, let’s get started.

This is the link to the General Assembly course.

This is a 12-week course that costs £10,000.

Like a lot of the other courses in this list, General Assembly’s immersive programme openly talks about how focussed it is on helping you achieve in your career.

They promise to help you optimise your resume and flaunt your skills at networking events.

The course runs from 9:30–5 (after which there’s optional activities like Meetups). Each day includes a 2.5 hour lecture, a 3 hour coding session, and another 1 hour lecture.

This unit covers using Python (and NumPy), UNIX (commands), git, calculating descriptive statistics, and visualising data with Matplotlib and Seaborn.

The project for this unit (there’s one for each of the 5 units) is answering questions using NumPy — not exactly a portfolio piece.

**Unit review**

Correctly or incorrectly, I’m assuming that each of 5 units in this bootcamp take up equal time. If that is the case then each unit takes approximately 2.4 weeks which is 10 days.

10 days seems like a long time to be working with this stuff, but I can see the benefit of introducing tools like git and UNIX slowly and repeatedly, after all, data scientists use them every single day. So it’s important to get them right in the beginning.

On the in-depth syllabus, that I had to give them my email and phone number to get access to, they do not state what is covered in the descriptive statistics section of Unit 1. I imagine that it covers things like calculating means and medians and the interquartile range. I’m hoping it goes as far as skew, kurtosis, and has some descriptions of the various probability distributions, but I’m not holding my breath.

In summary, this is an introductory unit of the type I imagined, a lot of focus on syntax and less on ideas, maths, and statistics.

This unit is a little confused. On the one hand it teaches some statistical concepts like p-values and confidence intervals. On the other it teaches web scraping and Pandas to find and clean up datasets.

The project for this unit is cleaning and analysing messy data (presumably the data you scraped from the web).

**Unit review**

To me, exploratory data analysis means making charts. So I think this unit is poorly named but that’s not such a big deal.

The important thing to mention is the time that has gone by (by the end of this unit you’ll be in to week 6 of the bootcamp). This seems like a fairly slow progression. Whereas git and UNIX are crucial everyday skills, I’m not so sure that web scraping needs to be given such high priority.

I have a feeling that combining lots of syntactical information about Pandas with high-level statistical ideas may not be the best way to guarantee that students will retain the more heady topics. Indeed, it seems as though the statistical ideas are introduced as functions and methods in SciPy rather than from a mathematical (or conceptual) standpoint.

This unit covers linear and logistic regression, gradient descent, feature selection and the k-Nearest Neighbours algorithm.

The project is applying these skills to a cleaned dataset that is handed over to the students from the staff.

**Unit Review**

We get the first mention of mathematics in this unit breakdown. But it’s too late — it is introduced as part of the lecture on gradient descent. I’m hoping that the idea of finding the ideal slope in a simple linear equation using covariance and variance is described (as this is a great insight) but I doubt it.

Most of the things here again point to the libraries and syntax that will be used.

The weirdest thing about this module is the fact that students get to work on a cleaned dataset even though they just spent 5+ weeks learning how to fetch and clean data themselves.

That means that the output at the end of this project is probably a cookie cutter application of basic techniques to already solved problems. Not great for a portfolio piece.

**Unit 4: Machine Learning Models**

In this unit you’ll cover clustering, ensemble methods, NLP, the naive Bayes algorithm, Hadoop and MapReduce architecture and time series analysis.

In the project you’ll apply these techniques to datasets you’ve found (or scraped) by yourself.

**Unit review**

I think I’m going to get whiplash! 7 or so weeks of slow incline followed by 2 and a bit weeks of racing through technique after technique.

The layout of this unit seems problematic. It starts out with clustering to compare it to classification methods, then goes back to classification algorithms (bagging and boosting) and then moves forward to NLP and naive Bayes (which are probably in the correct order) and then forward to MapReduce (which is really helpful for text related tasks) before ending up in time series analysis land.

The ARIMA model is complex, certainly. But I think it should’ve been introduced much earlier (businesses really care about forecasting). Either way, it definitely doesn’t belong in a module focussed on machine learning.

Each of these things are so different on the surface that there’s just no way the proper attention is given to each of them and how they vary.

Again this unit seems to make the error of mixing conceptual ideas with syntactical ideas and it seems as though Hadoop has been crowbarred in to this module.

The final module covers recommender systems, neural network basics, multi-arm bandits and ends up focussing on portfolio and interview prep.

The project for the unit has students make a public presentation on how they used all of the techniques of the bootcamp to analyse a dataset they find or scrape.

**Unit review**

The unit starts off with neural networks and backpropagation. It doesn’t say if things like dropout, regularisation, data transformation or hyper-parameter tuning are also covered but I imagine not as deep learning is probably worthy of its own bootcamp.

It also seems as though recurrent and convolutional nets aren’t covered.

The recommender system section says that students will learn to build a basic recommendation engine, which I can only assume means that deep learning is not involved and that matrix transformations and Euclidean distances are used instead. That is a little bit of a strange mix.

Finally we get into the work-related stuff.

It states that the career coaches will help you polish your portfolio which, as far as I can see, contains only 1 or 2 projects that would be unique to you.

You finish by going through some practice technical interviews (which is very helpful).

While this bootcamp may rush through a lot of the topics, having exposure to them is beneficial.

I think that the first few modules take less time than the last two (and I hope I’m right) which helps balance things out a little.

The biggest problems with this bootcamp are:

- The few opportunities to build a portfolio different from everyone else’s.
- The introduction of key topics syntax-first instead of idea-first.
- The end-heavy approach to preparing students for interviews.

If you’re going to take this bootcamp, I’d suggest:

- Making plenty of time outside of the course to do projects that interest you.
- Reading up on the ideas behind the syntax (and not being afraid to ask why when your instructor tells you something).
- Having family and friends give you mock interviews (this would really help you explain what you’ve just learnt to do in simple terms).

This is the link to the Flatiron bootcamp.

This is a 15-week course that costs £12,500

This data science bootcamp pitches itself as the most comprehensive course.

Like the GA bootcamp there’s a big focus on career development and Flatiron also offer one-on-one career advice and practice interviews.

The Flatiron course run for 50 hours a week, which means 10 hour days everyday, from 9–7.

This module covers the basics of Python, NumPy, Pandas and SQL. It also includes Matplotlib and an explanation of Linear Regression.

The projects in this bootcamp seem to all be at the end of the course in their own module — Module E.

**Module Review**

This 3 week long module is about the right length to cover the topics needed. I’m glad that it covers SQL, something ominously omitted by General Assembly, but it doesn’t include anything about git or basic UNIX commands.

It’s not clear how you are introduced to Linear Regression, perhaps (though this may be wishful thinking) you are taught about the normal equation and you compute the line of best fit using matrix transformations in NumPy.

This module seems like a fine introduction to the technical skills you’ll need in the rest of the course but the lack of git training is a little disconcerting.

In this module you learn about gradient descent as it applies to Linear Regression, you also learn about Bayes’ theorem (I wonder if they teach the maximum likelihood approach to regression?).

After that you study XML (of all things) and JSON, do some web scraping, and learn about experiment design for A/B testing a website.

**Module Review**

It is just me or does Advanced Data Retrieval sound like you’re going to dive into the ocean looking for a lost aircraft’s black box?

I like the fact that this bootcamp introduces Bayes so early. However, again I’m concerned that the ideas will be lost on students coming to grips with XPath and querying JSON strings.

The project for this module, A/B testing a website, isn’t quite data science but it is a useful thing to do and to have in a portfolio (assuming its done well).

In this module (they’re all 3 weeks long) you’ll learn about Logistic Regression, Support Vector Machine and Decision Trees, dimensionality reduction (and Principal Component Analysis), k-means Clustering, and time series modelling.

**Module Review**

There’s a theme developing here. Both Flatiron and General Assembly take a long time to introduce linear regression and then rush through as many different algorithms as possible.

Because of the layout of the course, we know that Flatiron teaches Logistic Regression, SVMs, PCA, XGBoost, k-means, and time series modelling at the same speed that they cover Linear Regression and Bayesian statistics. There’s just no way they’re explaining these techniques in enough detail.

Also, there seems to be no opportunity to use these different techniques in projects as you go along. The syllabus does mention that each day you’ll do some pair programming (an odd choice for data science), but it’s not clear if these are just exams / problem sets rather than actual projects.

In this module you’ll learn Regular Expressions (!?), PySpark, deep neural nets, convolutional neural nets and recurrent neural nets.

**Module Review**

The inclusion of regular expression here is directly related to that fact that the course makes no introduction to UNIX skills the way the General Assembly course does.

For the most part I agree with the pace and timing of this module. However, the inclusion of convolutional nets is a little surprising as everything else in the module is about text data (including recurrent neural nets).

Seeing as they cover quite a lot of advanced topics, and knowing what they’ve taught students before this module, I’m finding it hard to see how people can really make the jump to understanding deep learning properly.

These last three weeks of the bootcamp are dedicated to a pair project (Flatiron really like the pair programming thing). You pitch your instructor a few different projects that you’d like to work on and the instructor picks one.

**Module Review**

Okay, so finally there’s something to do that can go on your portfolio. I’m not sure about the paired aspect (it’s hard to split work evenly between two people) but at least there’s a considerable amount of time to complete the project.

Because this comes right after the module on deep learning, I wouldn’t be surprised if a very high proportion of these projects were deep learning related.

But there’s a problem with that — businesses like results they can understand.

This bootcamp started out strong and at the right pace. The inclusion of SQL is very good, but it shouldn’t come at the expense of git and UNIX commands.

Flatiron’s course, like GA’s, is very squashed in the middle. Knowing Linear Regression and gradient descent in-depth is not enough — they shouldn’t just pile on the algorithms without sufficient explanation.

Deep learning is very popular, but I question whether it should be taught right before the projects. In most day-to-day work, data scientists build much simpler models and a good portfolio should reflect that.

The biggest problems with this bootcamp are:

- Very little mention of projects and an over-reliance on pair programming (nobody wants to have to carry someone else through the course).
- There’s nothing in the modules that talks about what businesses need from data scientists and the career coaching seems to be a bit looser than GA’s where it was built into the end of the course.
- I’m not confident that the portfolio you’d be left with after this course would really help you secure a job.

If you’re going to take this bootcamp, I’d suggest you:

- Go through everything by yourself after class has ended. Don’t just assume that your partner knows what they’re doing.
- Stretch out the middle by looking into all these different algorithms until you understand in which instances they work and why.
- Ask the instructors a lot of questions about the applications of what you’re learning to real business problems, so that you can build a portfolio for yourself outside of the course.

This is the link to the Data Science Dojo course.

The actual bootcamp section of this course runs for 5 days and costs an unknown amount of money (if any knows, I’d appreciate you telling me.)

I wouldn’t consider the offering from the The Data Science Dojo to be a bootcamp the way people traditionally think of them. This is more of an industrial training programme that aims to improve the skills of employees within a business.

They list some very impressive companies as past clients including Apple, Google, Amazon and Microsoft and they have a wealth of good reviews.

Unfortunately, I can’t go in depth into the syllabus and I can’t find one.

Regardless, I think it’s safe to say that this isn’t your average data science bootcamp and if you’re looking to change careers, you might want to consider a different programme.

This is the link to the Ironhack Data Analytics course.

This 9 week course costs 7,000 euros (available in Europe and the US).

It’s called a data analytics bootcamp but covers much of the same material as the other courses, so it makes sense to include it here.

The primary market for Ironhack bootcamps is career changers. Their marketing website is tailored for those wishing to change jobs and they seem to offer on-going career support after the course has finished which is a nice touch.

The course runs from 9–6 (with optional evening events).

Having students improve their fundamental skills before the course can only help ensure that time in class is well spent. The prework component covers SQL, git, Python and descriptive statistics, so it’s a good blend of both General Assembly’s and Flatiron School’s courses. Plus, you don’t have to learn these things in person (they’re delivered online) which saves you money.

In the first 3-week module you’ll go repeat the prework by learning more about git and MySQL. You’ll also start to explore pandas for data wrangling and get used to working with APIs.

There’s no mention of a project for this module in their syllabus.

**Module Review**

Alas, the promise of everyone being up to speed with the basics on the first day of the course wasn’t true.

I would be annoyed if I sat through 60 hours of online video (that’s how much they claim to have in the prework package) and then turn up in class to find that I’m going through that stuff again.

These are important skills but why make everyone go through that effort just to cover the stuff again?

Another issue with this first module is that it’s all syntax. I can’t see anything related to statistics in the syllabus, which is a concern because Ironhack’s offering is the shortest in length of the in-person bootcamps.

In this three week module you’ll learn about using Python for inferential statistics, you’ll do more work in Pandas, learn about Matplotlib and all the types of charts you can make with it and start ‘storytelling’ with data.

**Module Review**

Okay, the syllabus for this module reiterates the fact that this is a data analytics bootcamp more than it is a data science bootcamp.

After 6 weeks you haven’t gotten very far in terms of predictive capabilities (there’s been no forecasting or linear regression).

Given that the students were supposed to learn about descriptive statistics before the course started, leaving inferential statistics until weeks 4–6 makes this course seem very slow.

These guys are just keyword padding their module titles.

In this module you learn about supervised vs unsupervised learning, get an in-depth intro to scikit-learn and do some feature engineering.

Again, there’s no mentions of project here even though this is the last module!

**Module review**

The name of this module is just wrong. But there are things I like about this part of the course.

For one, Ironhack haven’t jumped on the deep learning gravy train. Secondly, they end the bootcamp with 3 weeks of exploring the algorithms that data scientists really use each and every day.

Like Flatiron School’s data science bootcamp, there’s no explicit mention of career guidance in the course syllabus itself so I’m not sure how that works or when it’s delivered.

For something so focussed on delivering value to career changers, there’s a remarkable lack of information in the syllabus about how they help you to change careers.

Ironhack make no claims to be teaching you the in-depth mathematical intuition behind data science (as the others do) and so I don’t have such a big problem with their omissions here.

The biggest problems with this bootcamp are:

- I can’t find anything about projects in their syllabus. If you don’t get a portfolio out of the bootcamp, what’s the point?
- Going over everything from the prework again would annoy the hell out of me.
- They don’t break down the precise algorithms you learn about and so I can only assume that they stop at the more common ones. Which is fine, but means they cover less than their competition.

If you’re going to attend the Ironhack bootcamp, I’d suggest:

- Skimming the prework. Don’t do it all, for the sake of your own sanity.
- Using the skills you learn during the course to make portfolio pieces for yourself in the evening.

This is the link to the Springboard course.

This course lasts 6 months and costs $7,500 (but you only pay once you find a job).

This bootcamp offers weekly 1:1 calls with a mentor, uses DataCamp’s online courses to teach you the technical skills and includes 14 projects(!)

Springboard offer over 500 hours of online material for you to learn from and the individual modules are pulled from the DataCamp library.

I’m not going to breakdown each module as they’re not delivered in person and that means you can take watch them again and again and can refer to outside sources if you struggle with any of the topics.

Once you’ve covered the basic skills you get to choose a specialism, which is a very nice way of handling things instead of smothering students with techniques from lots of different niche fields.

Overall, I think that this course is fairly expensive (which is mitigated by the fact that you only pay once you’re in a job) but one of the better offerings.

Springboard shows that some of the best education is being delivered online. The real question is whether or not the qualification means as much as an in-person course when it comes to looking for that second, third, fourth job.

A potential problem is the projects. I can’t find a breakdown of what these are but imagine that some of them aren’t portfolio pieces. And so anyone taking this course should be aware of the fact that they should still seek opportunities to do projects by themselves.

This is the link to the Thinkful course.

Very similar to Springboard’s offering The Thinkful programme is also online, also takes 6 months, and costs nearly the same amount at $7,990.

Thinkful offer a job guarantee, which I believe is the same as Springboard’s in that you won’t pay for the course if you fail to secure a position.

I entered my email address to get a hold of the syllabus but didn’t receive anything in-depth. It looks to be about the same as the other bootcamps on this list, moving through SQL, pandas, matplotlib. And then going on to cover linear regression, supervised learning, unsupervised learning and finishing on specialisations which are very similar to Springboard’s.

1:1 mentoring is also offered at Thinkful but instead of the once weekly calls they offer two calls a week with your assigned mentor.

As with Springboard it isn’t clear if the purpose of the mentor is to help you get a job or to complete the course (probably the latter).

There’s no specific number of projects mentioned on the course website and I’m not sure who provides their education content (it’s probably created in house).

Again there’s the question of whether this course will mean anything to employers going forward (after the initial job that satisfies the guarantee) and as it’s more expensive that the Springboard course I’ll have to say that this one also seems overpriced.

If you’re going to participate in the Thinkful flexible data science bootcamp, be sure to use your mentoring time to discuss how your job search is going and use your free time to bolster your portfolio with the skills you’re learning in the course.

This is the link to the Brainstation course.

This programme lasts for 12 weeks and, despite my best efforts, I couldn’t find a price.

Brainstation state that this bootcamp is project-focussed and that at the end you’ll be left with a single major portfolio piece.

Apparently, the application process can take between 4 and 6 weeks.

The programme includes mock interviews post-‘graduation’ support.

In this unit, you’ll learn about Excel(!) and SQL.

**Unit review**

I have very little to go on here, Brainstation seems to guard their prices and syllabus more closely than the rest (I’d love to hear why).

That Excel is covered is unique and surprising. Data scientists do often use spreadsheets as part of their workflows, but not to the extent that 2 weeks of training using them is worthwhile.

Including SQL training is helpful but is maybe not the best way to spend the first two weeks of a purported data science course.

The Analysing Data unit covers Python, statistical modelling and something called visual analysis (making charts, I think).

**Unit review**

A seemingly good progression from the first unit that shows you how you can do things in code that you’d previously done by hand in Excel (filtering, sorting, etc.)

This unit also includes hypothesis testing which is good to see.

At the end of this unit you’ll have most of the technical skills of a data analyst (though not the business skills as they haven’t been mentioned yet).

I’m a little suspicious of the ‘project-driven’ approach that Brainstation proclaimed — as of yet there’s nothing that would make one person’s portfolio stand out from another’s.

In this unit you’ll learn Tableau.

**Unit review**

If it seems like I’m writing short descriptions of what’s in each unit it’s because I have no idea what the actual contents of the modules are.

Again, Brainstation have chosen to include a proprietary product in their bootcamp. I’m not sure if learning Tableau after spending a couple of weeks with Matplotlib is the best idea. It also takes students away from the ‘data science’ for two weeks, meaning they’ll have a harder time connecting the dots when they return to do more modelling.

In Unit 4 you’ll learn about supervised clustering (that’s a new one) and unsupervised clustering with applications in finance and e-commerce.

**Unit review**

The unit description states that you’ll be using both R and Python to complete the unit which must be super confusing for students who have spent the past 6 weeks being passed around various domain specific languages (Excel, Pandas, Tableau).

From what I can see on the website it looks as though this module is where previous students have developed their portfolio pieces, which is worrying because it seems like they only get taught a handful of algorithms before this.

In this unit you’ll learn … presentation skills, I think?

**Unit review**

It’s a good thing that the course includes presentation skills but 2 weeks of mock presentations sounds like my personal hell.

Hopefully there’s some career advice that’s also a part of this section.

In Unit 6 you’ll learn TensorFlow and Hadoop.

**Unit review**

Bolting this on to the end of the course seems odd to me. These topics are far more advanced and nuanced and edge-casey than anything else the students would have learned up to this point.

Maybe I’m just grumpy because I’ve written all of this in one sitting, or maybe it’s because I’ve received three emails from these guys with no mention of price, but I doubt that’s it.

I feel like this course relies too heavily on proprietary products, has too large gaps between the data science content, starts off too slow, and ends too abruptly.

Those are all the major problems.

If you like what you see at Brainstation, I’d suggest:

- Doing way more projects than they set.
- Reviewing your notes from the previous weeks during the modules on Tableau and making presentations.
- Reading widely about the algorithms they don’t teach.
- Choosing your favourite language and committing to mastering it.

This has been an eyeopening article to write.

I’m surprised to see that so many bootcamps claim to be about helping people make or change careers and then teach nothing about business throughout the courses (careers only really happen inside organisations).

It’s also disconcerting that so many of them blend lots of advanced and nuanced topics together without the proper prerequisites.

If I was looking to get started with data science and thought bootcamps were the way to do that, I’d start by taking a data analysis bootcamp which taught databases and SQL, git, basic stats and visualisation. And then look for a data science bootcamp which skipped those things and dedicated more time to the algorithms and why they work.

Unfortunately, it doesn’t look like many of those exist.

The online courses are both very expensive and have me rethinking whether I’m charging way too little for my own courses. But at least they delve more in to why things work than the in-person bootcamps.

Of all the bootcamps I reviewed, GA’s probably has the best syllabus, despite its problems.

In short, anyone who feels that a bootcamp is the best way for them to move forward should be ready to spend additional time and resources building a great portfolio. They should also allocate time to studying the various algorithms in-depth.

If you do choose to participate in a data science bootcamp, be sure to use up as much of the instructors’ and career coaches’ time as possible — extract all you can.

Good luck!

*Originally published at **carldawson.net** on March 23, 2019.*

Which Data Science Bootcamp is right for you? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>In my previous article, I discussed what is Empirical Risk Minimization and the proof that it yields a satisfactory hypothesis under certain assumptions. Now I want to discuss Probably Approximately Correct Learning (which is quite a mouthful but kinda cool), which is a generalization of ERM. For those who are not familiar with ERM, I suggest reading my previous article on the topic since it is a prerequisite for understanding PAC learning.

Remember that when analyzing ERM we came to the conclusion that for a finite hypothesis space **H **we can arrive at a hypothesis that has an error lower than **epsilon **with a certain probability **delta** when assuming that there exists such a hypothesis in the hypothesis space. Based on these parameters we could calculate how many samples we need to achieve such accuracy and we arrived at the following lower bound for the samples:

This can be fit into the general PAC learning framework, the following formal definition was given from the book Understanding Machine Learning:

For me at least, this definition was a bit confusing at first. What does this mean? The definition states that a **hypothesis class **is PAC learnable if there exists a function **m_H** and an algorithm that for any labeling function **f, **distribution **D **over the domain of inputs **X, delta **and **epsilon **that with **m ≥ m_H **produces a hypothesis **h **such that with probability 1-**delta **it yields a **true error **lower than **epsilon. **A labeling function is nothing else than saying that we have a certain function **f **that labels the data in the domain.

Here, the hypothesis class can be any type of binary classifier, since the labeling function that assigns labels to the examples from the domain assigns the labels **0 **or** 1.** The **m_H **function gives us a bound to the minimal number of samples that we need to have to achieve an error lower than **epsilon **with confidence **delta. **The accuracy **epsilon** logically controls the necessary sample size, since the higher accuracy we have, logically we need our training set to be a more faithful sample from the domain, therefore, increasing the number of samples necessary to achieve such accuracy.

The above model has a certain drawback, it is not general enough because of the realizability assumption (explained in Empirical Risk Minimization)— nobody guarantees that there exists a hypothesis that will result in a true error of 0 in our current hypothesis space because of a failure of the model. Another way of looking at it is that perhaps the labels aren’t well defined by the data because of lacking features.

The way that we circumvent the realizability assumption is by replacing the labeling function by a data-labels distribution. You can look at this as introducing uncertainty in the labeling function since one data point can share different labels. So why is it called **Agnostic PAC learning**? Well, the word agnostic comes from the fact that the learning is agnostic towards the data-labels distribution — this means that it is going to learn the best labeling function **f **by making no assumptions about the data-labels distribution. What changes in this case? Well, the **true error **definition changes since a label to a data point is a distribution over multiple labels. We cannot guarantee that the learner is going to achieve the minimal possible true error, since we do not have the data-labels distribution to account for the label uncertainty.

After those considerations, we arrive at the following formal definition from the book:

Notice the changes in the definition with regards to the definition of **PAC **learnability. By introducing the data-labels distribution **D** we allow for the fact that the **true error **of the learned hypothesis is going to be less or equal to the error of the **optimal hypothesis **plus a factor **epsilon**. This also encapsulates **PAC **learning itself in the case that there is an optimal hypothesis in the hypotheses space that produces a **true error **of 0, but we also allow for the fact that there is maybe no such hypothesis. These definitions are going to be useful later on in explaining the **VC-dimension **and proving the **No free lunch theorem.**

In case if the terminology was a bit foreign to you, I advise you to take a look at Learning Theory: Empirical Risk Minimization or a more detailed look at the brilliant book from Ben-David mentioned in the article. Other than that, keep machine learning!

Learning Theory: (Agnostic) Probably Approximately Correct Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>*Co-authored by: **Tom N. Collins**, **Ichsani Wheeler** and **Robert A. MacMillan*

Since 2004, OpenStreetMap has been the main platform for building an Open Map of the world. More recently, the concept of building geo-information through distributed work and common ownership has led to several inspiring businesses; for example, Mapillary — a street-level imagery platform. However, currently both OpenStreetMap and Mapillary focus only on cities and urban areas. What about the rest of the planet? What about making open global maps of environmental variables such as land cover, land use, climate, soils, vegetation and biodiversity? Several for-profit and not-for-profit organizations have started doing just that, providing platforms to map world land use (https://osmlanduse.org), land cover (https://geo-wiki.org), biodiversity (https://gbif.org), forest dynamics (https://globalforestwatch.org), weather (https://openweathermap.org), and relief/topography (https://opentopography.org), to mention a few. These open data sources are awesome but scattered. Such information could be even more effective if *combined *into a *complete* and *consistent “OpenLandMap”*-type system of the world’s environment. If realized, such a system could be “*the place**”* where anyone could, with Wikipedia-like confidence, track current status and changes in our environment. To contribute to this idea, OpenGeoHub foundation has recently launched a web-mapping app and a data service called “LandGIS”.

How can we judge today if we don’t know what happened yesterday? For anyone to be able to understand ecosystem services and the value they represent to the environment, they must first have insight into past environmental conditions. Some selected point in the past (often referred to in environmental sciences as a *“baseline”*) is the ultimate reference that allows us to quantify the scale of environmental degradation/restoration and the potential costs/benefits involved. One typical, and widely recognized example of tracking environmental history is the changing status of global annual land surface temperatures (see Fig. 1).

If we focus on the last few hundred years on the graph above, we can notice that the planet should have been cooling down not warming up. There is a distinct difference between the expected (*“natural”*) trend and the *“human-induced”* global annual temperature. In this case it is the global warming effect, which we now know is mainly due to fossil CO2 emissions (fossil fuel industry, transportation, agriculture). Ricke et al. (2018) estimated that the social cost of each additional fossil carbon emission (*“the complete economic cost associated with climate damage that results from the emission of an additional tonne of carbon dioxide”*) is, on average, US$417 per tCO2e emitted. We still emitted about 4.9 tCO2 per capita in 2017 (15.7 tCO2 in USA; 3.8 tCO2 in other G20 countries; based on Olivier and Peters, (2018)), which means that total social costs of CO2 emission for the year 2017 are about US$15.4 trillion! Even if the number estimated by Ricke et al. (2018) is replaced by the conservative number used by the US Environmental Protection Agency (US$40), we can see that CO2 emissions have serious economic consequences. Thus, environmental management and accounting for the social costs of environmental damage may well become a major economic activity (that is, if the current economic system run by our global civilization is to survive).

The previous example demonstrates how physical variables, such as surface temperature, can be related to measurements of CO2 emissions, which can then be assessed in terms of economic costs. But our planet comes with a diversity of landscapes, climate zones and geological materials, and these different parts of the planet react differently to global events and processes. To better display the geographical distribution of these processes, scientists generate global maps at various spatial resolutions. Using these maps means that we know not only a single value for *average* change per physical variable, but also *where* that change is most significant and often also what is *driving* it (Fig. 2a). Furthermore, when documenting and quantifying global dynamics of the environment, scientists document the dynamics of specific scientifically accepted rigorous variables, which we can refer to as **land management indicators **(Fig. 2b).

Here, not all land management indicators are equally important. Some variables are direct indicators for quality of human life and can be valued in billions or even trillions of dollars when used to predict potential future natural hazards, gains or losses of ecosystem services and so on (see previous example with CO2 emissions). The United Nations Convention to Combat Desertification (UNCCD) has, for example, selected the following three key indicators of land degradation:

- Land cover classes,
- Land productivity,
- Above and below ground carbon in biomass,

Keeping each of the three constant, or at least not significantly worse, is a step towards **Land Degradation Neutrality**.

When it comes to land management indicators, it is now increasingly important to be able to *“map the past”* and *“predict the future”* i.e. to produce maps of land management indicators for various periods while maintaining the same spatial reference. One example of a complete global data set documenting land use changes over the last 10,000 years — basically documenting the evolution of world agriculture — is the **HYDE data set** (Klein Goldewijk et al. 2011) which comprises a time-series of images of percent of coverage for 10 main land use types, population density and urban and rural areas — produced and distributed by the Dutch Environmental Agency (PBL). One can quickly observe that most such historical time-series maps suffer from the same problem, which is their proportionally lower certainty about events and conditions that happened further in the past. For that reason HYDE is only available at a relatively coarse spatial resolution of 10 km and the reliability of spatial patterns drops drastically as we move back into pre-industrial times. In 2019 we can map almost all world forests and canopy heights at fine spatial resolution. But where the forests of 100 years ago were located (let alone 2000 years ago) is much more difficult to reconstruct.

The spatial resolution of maps often defines most of the technical specifications of a system and the usability of maps (Hengl and MacMillan, 2019). From a management point of view, the Earth’s land mass can be divided into discrete spatial units, or pixels, ideally corresponding to equal area blocks of the earth’s surface. The **What3Words** company (Fig. 3), for example, has assigned to each 3⨉3m pixel a unique combination of three common words (total land area is estimated at 148,939,063 km-square, so over 1.65⨉10^13 unique pixels!). By using What3Words you can connect with other people within 3⨉3m spatial blocks. We could now also attach values for all key land management indicators to each of these unique pixels and then track their changes through time. This type of fine-grain management of land, when applied to agricultural land, is referred to as **“Precision Agriculture”**. Mapping the Earth’s land mass at 1–5 m resolution in 2019, as we will see in later sections, is certainly feasible, but will also demand massive computing infrastructures.

Thus far, we think that the most important types of global land variables to monitor are land cover, land use, temperature, precipitation, vapor pressure deficit, air-borne particulates, fires, above and below ground biomass, emissions of gases, effective photosynthesis, flora and fauna types and their characteristics, soil water and soil macro and micro-flora and fauna. Variations of these variable types can be further interpreted to derive land management indicators, which can then be used to track performance and to estimate potential financial losses or gains. By monitoring these indicators through time, and by modeling their precise spatial distribution, we can assess various potential states of the environment, predict future states, and use this information to continually improve and optimize land management.

As in many scientific fields, one way to reduce political bias, or controversy, around quantifying land degradation is to rely on robust technology and objective measurements. From the 1960’s it has been evident that the most efficient way to track the status of global vegetation is by using satellite-based Remote Sensing systems (RS) as a basis for continuous Earth Observation (EO). Beginning in 2008, NASA and USGS made all **Landsat** imagery *“Free and Open”* symbolically starting a new era of RS data (Zhu et al. 2019). Opening of RS data by national agencies has proven to generate financial benefits as well. Zhu et al. (2019) estimated that use of Landsat data lead to productivity savings for the year 2011 of $1.70 billion for U.S. users only plus $400M for additional international users, which is almost enough to pay for the whole program’s history (from profits in 2011 alone)! Applications that support improved resource management will inevitably result in EO applications growing exponentially and these trends will remain positive for decades (Ouma, 2016). Having Open EO data is especially important for nature conservation and land restoration projects (Turner et al. 2014; Gibbs & Salmon, 2015). NASA and USGS have been maintaining and sharing time-series of land products derived from their **MODIS** imagery for now almost 20 years; **Copernicus Global Land Service** (CGLS; hosted by VITO NV) provides similar list of land management indicators including vegetation indices, Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) and dry matter productivity. EO-based environmental monitoring projects have boomed following the recent launching of the European Union funded **Copernicus Sentinels** and the Japan Aerospace Exploration Agency (JAXA’s) Advanced Land Observing Satellites (**ALOS**). In addition to multispectral imagery, the German Aerospace Center (DLR) **TANDEM-x** project has now produced probably the most accurate global land surface model (including a global map of forest canopy heights) at 12 m spatial resolution and at unprecedented vertical accuracy (Martone et al. 2018). The 100 m resolution version of this data set has recently been freely released for research and education uses. The DLR’s next generation topographic/elevation mapping will focus on even finer resolutions (5 m) and faster revisit times (<1 month) so that we will increasingly be able to track all fine grained changes in land surface elevations or volumetric displacements. The following three trends now apply universally to all EO projects (Ouma, 2016; Herold et al. 2016):

- Continuously finer spatial resolutions and increasingly faster revisit times.
- Lower costs of acquisition, storage and processing per unit area.
- More advanced technologies allowing for penetration of clouds and vegetation (active sensors) and improved accuracy of detection of materials and tissues.

It is now increasingly more difficult to conceal poor land management or air or water pollution. The spatial resolution of publicly (or cheaply) available imagery is now rapidly progressing towards sub-meter accuracy. Already in 2019 **Planet Labs** have created an image fusion system that supports derivation of daily Normalized Difference Vegetation Index (NDVI) and Leaf Area Index (LAI) at 3 m spatial resolution (Houborg and McCabe, 2018). The **OneSoil** company has likewise automated extraction of farm borders and crop status (for the present just in Europe and USA) from Sentinel-2 and LandSat7/8 imagery, so that even smaller-size farms and their vegetation indices can be tracked on a field-by-field basis. These systems are moving from having just raw images to having pre-processed estimates of land management indicators by land management units (e.g. fields). We anticipate that in the coming years it will become increasingly difficult to conceal any deforestation (even loss of a few trees?), loss of biomass (Landsat 7/8 and Sentinel 2), loss of soil, CO2 emissions (Sentinel 5), and even smaller size waste dumps (Planet Labs) anywhere on the planet. We see these as good trends, and also believe that it is good that such awareness continues seamlessly across country borders i.e. that it gives us the power to identify poor land managers or environmental polluters irrespective of borders. People have a right to know about the status of our global land commons in the past, now and into the future. Such global environmental awareness may well be one of the most positive aspects of globalization.

In previous sections we discussed how RS imagery is directly and objectively measuring the status of the environment, how it is now becoming ever more accessible and how, in consequence, the number of applications and the overall market for RS data are likely to grow exponentially. But most countries and businesses are not set up to handle such data volumes, let alone exploit the new types of information that they will convey. For instance, to archive all LandSat scenes you would need Petabytes of storage, not to mention massive infrastructures to process this data. For illustration, Planet Labs claims to acquire 50 Terapixels of new data every single day. To address such massive data challenges Google has developed a cloud-solution for processing all of NASA’s, ESA’s and JAXA’s imagery and all other publicly available imagery: **Google Earth Engine** (GEE) (Gorelick et al. 2017). Several research groups have used the GEE infrastructure to derive global maps of deforestation / forest cover changes (Hansen et al. 2013), surface water dynamics (Pekel et al. 2016), land cover (Song et al. 2018) and cropland distribution (Xiong et al. 2017), to mention just a few most known. The recent special issue on GEE in the Remote Sensing journal lists even more applications of GEE.

GEE literally enabled under-graduate students to map the world at 30 m resolution without having any research budget. Should GEE become an universal solution for deriving, storing and distributing land management indicators? It sounds like a perfect option. Having many copies of the same original RS data would certainly be inefficient. Google also thoughtfully pays for the processing costs, which are (in relative terms) considering the size of their infrastructures probably minor. But GEE does come at a cost. First, uploading and downloading larger data sets to GEE includes an expectation that a user will need to increase the capacity of their Google Drive or Google Cloud storage. A second, more serious, limitation is that Google’s terms of use indicate, among others that:

5. Deprecation of Services. Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer.

In other words, a corporation like Google could discontinue GEE at any moment and legally face no consequences because these are the terms of use that all GEE users have accepted at the start of their service. Another issue is: can we trust Google? What if Google developers make a mistake in their code? Although many algorithms used within the GEE are based on Open Source software, understanding what happens on Google’s servers before an image of Earth is generated is not trivial. A European Union-funded project **OpenEO** (Open Earth Observation) now looks at ensuring ** “cross cloud backend reproducibility”**. It develops a language (API) that can be used to ask questions to different cloud back-ends, including GEE. By that, it aims to make it relatively trivial to compare and verify their results, but also to find e.g. the cheapest options for doing particular types of computations. The

EO is a basis of objectivised monitoring of land dynamics, but not all land management indicators can be mapped directly using RS technology alone. Some land management indicators need to be estimated by interpolating sampled (i.e. observed/measured) values at point locations, or by using predictive mapping approaches. Some variables are simply more complex and can’t be mapped using spectral reflectances only. For example, no EO system can (yet) be used to directly estimate soil organic carbon (SOC) density. Technology simply does not yet exist to do this and, therefore, we need to use point samples and Predictive Mapping to try to estimate SOC at each pixel. Luckily, there is now also ever more point data with environmental variables being made available across borders via Open Data licenses and through international initiatives. Established field observation compilations include: the **Global Biodiversity Information Facilities** observations, **Global Historical Climatology Network** station data, **WoSIS Soil Profile Database** points, **sPlot global plant database**, **BirdLife**, to mention just a few of the most widely known. Increasingly more and more point data measurements and observations today are also contributed through Citizen science and crowdsourcing projects such as **iNaturalist** (Fig. 4), **Geo-wiki**, **Citizen Weather Observer Program**, and similar (Irwin, 2018). Even commercial companies are now open to sharing their proprietary data and using it for global data mining projects (provided that their users agree of course). For example, continuous measurements of temperature, pressure, noise levels etc. from the **Netatmo** weather stations (estimated over 150k devices sold worldwide). Many users are also open to sharing weather measurements from their mobile phone devices. **Plantix app** is currently producing a global database of photographs of plant diseases and growth aberrations. We mention here only a few major global or continental data repositories and initiatives. There are, of course, many more repositories available in local languages and for local areas.

As global compilations of ground observations and measurements become larger and more consistent, there is an increasing need to extract value-added global maps from them. There is now a burgeoning interest in using exascale high-performance computing to process emerging novel, in-situ observation networks to transform the quality and cost-effectiveness of high-resolution weather forecasting, agricultural management and Earth system modeling in general. An especially exciting development in the years to come will likely be a hybrid process-based modeling + machine learning approach which combines the best of data and the best of our geophysical/geochemical knowledge (Reichstein et al. 2019).

To summarize, the most promising path to building trust in data seems to be by achieving computational reproducibility (documenting all processing steps and providing all relevant metadata required to reproduce exactly the same results). There are now increasingly more robust ways to achieve reproducibility, even for projects that are relatively computationally heavy. Due to increasing provision of free RS data associated with the launch of new EO satellites, it seems that the mainstream data analytics will be moving from local networks to (virtual) clouds and **Big Earth Data cubes** i.e. data analysis through web-based workflows (Sudmanns et al. 2019). In that context, GEE will remain a valuable and (hopefully) increasingly trustworthy place to process global public RS data and produce new valuable maps of the status of our environment. But society also needs more Open and not-for-profit infrastructures such as OpenStreetMap to ensure longevity. Global field observation repositories, Citizen Science and Machine Learning will, in any case, play an increasing role in generating more reliable maps of more holistic land management indicators.

We (the OpenGeoHub foundation) have recently started providing hosting and data science services to help produce and share the most up-to-date, fully documented (potentially to the level of fully reproducibility) data sets on the actual and potential status of multiple environmental measures through a system we call “LandGIS” and which is available via https://landgis.opengeohub.org (Fig. 5). Initially, LandGIS provides access to new and existing data on soil properties/classes, relief, geology, land cover/use/degradation, climate, current and potential vegetation, through a simple web-mapping interface allowing for interactive queries and overlays. This is a genuine Open Land Data and Services system where anyone can contribute and share global maps and make them accessible to hundreds of thousands of researchers and businesses.

LandGIS is based on the following six main pillars:

- Open data license (Open Data Commons Open Database License and/or Creative Commons Attribution-ShareAlike) with a copy of data placed on zenodo.org,
- Fully-documented, reproducible procedures with most of the code available via a github repository,
- Predictions based on the state-of-the-art Ensemble Machine Learning techniques implemented using Open Source software,
- Distribution of data based on using the Open Geospatial Consortium (OGC) standards: GDAL, WMS, WCS and similar,
- Diversity of web-services optimized for high traffic usage (cloud-optimized GeoTIFFs),
- Managed, open user and developer communities.

We contribute to LandGIS results of our own data mining and spatial prediction processes including mapping of potential natural vegetation (e.g. Hengl et al. 2018) and 3D mapping of soil properties and classes. However, we also host maps contributed by others, especially if the data are already peer-reviewed and fully documented.

To illustrate our general approach to producing usable information to improve global land management, consider the example of soil type mapping. USDA and USGS have invested decades, and possibly billions of dollars, to collect data on soils and produce and maintain knowledge about soils, mainly through the soil classification system *“USDA Soil Taxonomy”*. As a result of decades of field work and laboratory analysis of thousands of soil samples, USDA and USGS produced a repository of over 350,000 field observations of soil types (in this case we focus on soil great groups). We combined these point data with other national and international compilations to produce the world’s most consistent and complete training data set of USDA soil great groups. We then overlaid these points over some 300 global covariate layers which represent soil forming factors, and then fitted spatial prediction models using Random Forest (Hengl et al. 2017). Although we achieved only limited classification accuracy outside the USA (where a majority of the training points are located), having a large-enough training data set allows us to produce initial maps of soil types (at relatively fine resolution of 250 m) also for countries where we basically had no training points at all (Fig. 6). Because we have fully automated overlay, modeling and spatial prediction, as soon as we obtain more contributed observations of soil types, we can update these initial predictions and gradually produce better and more usable / useful maps.

Other LandGIS functionalities in development include:

**Multi-user**: Relevant and useful functionality for various participants in environmental activities, including landowners, community leaders, scientific advisory agencies, commercial contractors, donors and investors.**Multi-module**: various activities in environmental management are being integrated, including project discovery, farm due diligence, fundraising, implementation, automated land management KPI (Key-Performance-Indicator) tracking.**Integrations with enterprise IT**and social media: to achieve enhanced security, data protection, and interoperability features enabling this.**Context-based customization**and enrichment: spatially auto-filled user data (saving manual setup effort), peer activity-based features, location based alerts for specific conditions (e.g. frost warning, temperature thresholds etc).- Integration of
**‘blockchain’ capability for environmental token trading support**.

So, in summary, there is still a need for an OpenLandMap type system to allow for archiving and sharing environmental variables and land management indicators. With LandGIS, we have shown that new, value-added, information can be produced immediately and affordably using *“old legacy data”*. We have demonstrated a technology and knowledge transfer opportunities (from data-rich countries to data-poor countries), which we believe is a win-win scenario. We release all code used to generate LandGIS layers as open source, allowing full replicability of state-of-the-art spatial analytics by anyone, including as a basis for commercial services. We have released all our data as open data, allowing anyone, including businesses, to build upon this soils and other environmental data — hopefully in ways we can’t even imagine.

- Gibbs, H. K., & Salmon, J. M. (2015). Mapping the world’s degraded lands. Applied geography, 57, 12–21. https://doi.org/10.1016/j.apgeog.2014.11.024
- Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031
- Hansen, M. C., Potapov, P. V., Moore, R., Hancher, M., Turubanova, S. A. A., Tyukavina, A., … & Kommareddy, A. (2013). High-resolution global maps of 21st-century forest cover change. science, 342(6160), 850–853. https://dx.doi.org/10.1126/science.1244693
- Hengl, T., de Jesus, J. M., Heuvelink, G. B., Gonzalez, M. R., Kilibarda, M., Blagotić, A., … & Guevara, M. A. (2017). SoilGrids250m: Global gridded soil information based on machine learning. PLoS one, 12(2), e0169748. https://doi.org/10.1371/journal.pone.0169748
- Hengl, T., Walsh, M. G., Sanderman, J., Wheeler, I., Harrison, S. P., & Prentice, I. C. (2018). Global mapping of potential natural vegetation: an assessment of machine learning algorithms for estimating land potential. PeerJ, 6, e5457. https://doi.org/10.7717/peerj.5457
- Hengl, T., MacMillan, R.A., (2019). Predictive Soil Mapping with R. OpenGeoHub foundation, Wageningen, the Netherlands, 370 pages, ISBN: 978–0–359–30635–0. http://soilmapper.org
- Herold, M., See, L., Tsendbazar, N., & Fritz, S. (2016). Towards an Integrated Global Land Cover Monitoring and Mapping System. Remote Sensing, 8(12). http://dx.doi.org/10.3390/rs8121036
- Houborg, R., & McCabe, M. (2018). Daily Retrieval of NDVI and LAI at 3 m Resolution via the Fusion of CubeSat, Landsat, and MODIS Data. Remote Sensing, 10(6), 890. https://doi.org/10.3390/rs10060890
- Irwin, A. (2018). No PhDs needed: how citizen science is transforming research. Nature, 562, 480–482. https://doi.org/10.1038/d41586-018-07106-5
- Klein Goldewijk, K. , A. Beusen, M. de Vos and G. van Drecht (2011). The HYDE 3.1 spatially explicit database of human induced land use change over the past 12,000 years, Global Ecology and Biogeography 20(1): 73–86. https://dx.doi.org/10.1111/j.1466-8238.2010.00587.x
- Marcott, S. A., Shakun, J. D., Clark, P. U., & Mix, A. C. (2013). A reconstruction of regional and global temperature for the past 11,300 years. Science, 339(6124), 1198–1201. https://dx.doi.org/10.1126/science.1228026
- Martone, M., Rizzoli, P., Wecklich, C., González, C., Bueso-Bello, J. L., Valdo, P., … & Moreira, A. (2018). The global forest/non-forest map from TanDEM-X interferometric SAR data. Remote Sensing of Environment, 205, 352–373. https://doi.org/10.1016/j.rse.2017.12.002
- Olivier, J.G.J. & Peters, J.A.H.W. (2018). Trends in global CO2 and total greenhouse gas emissions: 2018 report. Report no. 3125. PBL Netherlands Environmental Assessment Agency, The Hague.
- Ouma, Y. O. (2016). Advancements in medium and high resolution Earth observation for land-surface imaging: Evolutions, future trends and contributions to sustainable development. Advances in Space Research, 57(1), 110–126. https://doi.org/10.1016/j.asr.2015.10.038
- Pekel, J. F., Cottam, A., Gorelick, N., & Belward, A. S. (2016). High-resolution mapping of global surface water and its long-term changes. Nature, 540(7633), 418. https://doi.org/10.1038/nature20584
- Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., & Carvalhais, N. (2019). Deep learning and process understanding for data-driven Earth system science. Nature, 566(7743), 195. https://doi.org/10.1038/s41586-019-0912-1
- Ricke, K., Drouet, L., Caldeira, K., & Tavoni, M. (2018). Country-level social cost of carbon. Nature Climate Change, 8(10), 895. https://doi.org/10.1038/s41558-018-0282-y
- Song, X. P., Hansen, M. C., Stehman, S. V., Potapov, P. V., Tyukavina, A., Vermote, E. F., & Townshend, J. R. (2018). Global land change from 1982 to 2016. Nature, 560(7720), 639. https://doi.org/10.1038/s41586-018-0411-9
- Sudmanns, M., Tiede, D., Lang, S., Bergstedt, H., Trost, G., Augustin, H., … & Blaschke, T. (2019). Big Earth data: disruptive changes in Earth observation data management and analysis?. International Journal of Digital Earth, 1–19. https://doi.org/10.1080/17538947.2019.1585976
- Turner, W., Rondinini, C., Pettorelli, N., Mora, B., Leidner, A. K., Szantoi, Z., … & Koh, L. P. (2015). Free and open-access satellite data are key to biodiversity conservation. Biological Conservation, 182, 173–176. https://doi.org/10.1016/j.biocon.2014.11.048
- Xiong, J., Thenkabail, P. S., Gumma, M. K., Teluguntla, P., Poehnelt, J., Congalton, R. G., … & Thau, D. (2017). Automated cropland mapping of continental Africa using Google Earth Engine cloud computing. ISPRS Journal of Photogrammetry and Remote Sensing, 126, 225–244. https://doi.org/10.1016/j.isprsjprs.2017.01.019
- Zhu, Z., Wulder, M. A., Roy, D. P., Woodcock, C. E., Hansen, M. C., Radeloff, V. C., … & Pekel, J. F. (2019). Benefits of the free and open Landsat data policy. Remote Sensing of Environment, 224, 382–385. https://doi.org/10.1016/j.rse.2019.02.016

Everybody has a right to know what’s happening with the planet: towards a global commons was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>The recent advances in the field AI have opened the doors to a plethora of smart and personalized methods of fraud detection. What once was an area that required a numerous amount of manual labor, is now one of the many that have experienced the progress of machine learning. Notwithstanding, it is important to note that this technology, like any other, is not perfect and susceptible to problems, and with so individuals, enterprises, and services offering solutions based on predictive techniques to detect fraudulent transactions, we have to take a step back and consider the possible challenges that might arise.

I believe it is safe to say that most of the transactions are non-fraudulent. While this is excellent, of course, it creates the most significant and prevalent issue in the area of fraud detection: imbalanced data. Data is the most critical component of a machine learning predictive model, and an imbalanced dataset — which in this case is a dataset mostly made of non-fraudulent records — might result in a prediction system that won’t be able to learn about the fraudulent transactions properly. In such a case, the straightforward solution would be to obtain more data; however, in practice, this is either expensive, time-consuming or borderline impossible. Fortunately for us, some procedures and algorithms help with the problem of an imbalanced dataset.

The technique of over-sampling and under-sampling allow us to modify the class distribution of a dataset. As the name implies, over-sampling is a procedure used to create synthetic data that resembles the original dataset, while the goal of under-sampling is the opposite, removal of data; It is important to note that in practice, over-sampling is more common than under-sampling, that’s why you have probably heard more about the former than the latter.

The most popular over-sampling algorithm is *Synthetic Minority Over-sampling Technique* or **SMOTE**. SMOTE is an algorithm that relies on the concept of *nearest neighbors* to create its synthetic data. For example, in the image below we have an imbalanced dataset made of 12 observations and two continuous features, x and y. Of these 12 observations, ten belong to the class X, and the remaining two to Y. What SMOTE do is selecting a sample from the under-represented data, and for each point, it computes its K nearest neighbors — which for the sake of simplicity, let’s assume its K=1. Then, it takes the vector that is between the data point and its nearest neighbor, and multiplies it by a random value between 0 and 1, resulting in a new vector “a.” This vector is our new synthetic data point.

The following gist shows how to perform SMOTE in R.

A second concept I’d like to bring to the table is the choice of the predictive system. At first, we might think that a supervised learning system that classifies an action into fraud or not would be the most appropriate approach for this kind of problem. While this might sound attractive, it is important to note that sometimes this is not enough and that other areas such as anomaly detection and unsupervised learning could help to find those noisy and anomalous points that might represent fraud.

Anomaly detection is the field of data mining that deals with the discovery of abnormal and rare events — also known as outliers — that diverges from what it is considered to be normal. At a basic level, these detection approaches are more statistically intensive, as they deal more closely to the topic of distributions and how much a data point varied from it. For example, equations such as the *upper and lower inner fence*, defined as UF = Q3 + (1.5 * IQR) and LF = Q1 — (1.5 * IQR), where IQR is the interquartile range, is one of the many techniques used to create a barrier that divides the data into what is considered normal and what is considered an outlier.

Let’s explain this with a realistic albeit silly example. Imagine there’s this guy, John, and John is an early bird who regularly wakes up around 5:30 am (a real champ), however, last night while John was celebrating his 32th birthday, he had a little way too much fun and woke up the next morning at 6 am (and felt terrible about it). This “6 am” represents an anomaly in our dataset.

The following graph represents John’s last ten wake up times, and we can see that the last one seems to be out of place. Correctly, we could say that this point falls far from the distribution of the other nine times; the three sigma or standard deviations rule, a rule with applications for detecting outlier, shows this. Without the 10th time, the mean and standard deviation would be 318.89 and 2.09 (if we compute the time as minutes after 12). However, the 10th time (6 am, or 420), is way beyond the range of the mean plus three times the standard deviation (318.89 + (2.09 * 3) = 325.16), indicating that time is indeed an anomaly.

Unsupervised learning, particularly clustering, can also be used to detect anomalous and noisy content in a dataset. This kind of machine learning algorithms works under the assumption that similar observations tend to be group under the same clusters, while the noisy ones, won’t. Going back to John, and his fantastic sleeping pattern, if we cluster his data using *k-means* with k=2 we can observe how the anomalous point falls into a cluster of its own.

Lastly, the choice of performance metric plays a significant role at the time of training our system. A standard score like accuracy, won’t be of much use if the dataset is imbalanced. For example, suppose that only 2% of the content of the test dataset, are actual fraudulent transactions. If the model classifies all of those cases are non-fraudulent, the accuracy would be 98%; a good number, but insignificant in this case. Some metrics that would be more appropriate for this use case are *precision*, *recall*, and *Cohen’s Kappa coefficient*. Moreover, the type of error —* false positives*, or *false negative* — that we are optimizing for is also of great importance. There will be cases in which we should favor a higher false positive ratio in exchange for a lower false negative score and the other way around.

The demand to combat fraud, scam and spam activities will always be there. Even with all the recent advances, and breakthroughs in the area of AI, there will be some difficulties to be encountered during our problem-solving quests. In this article, I talked about three of these difficulties: the lack of a balanced dataset, the choice of predictive system and the selection of an appropriate evaluation metric, and offered some pointers and options to consider with the goal of improving the quality of our predictions and detections.

Appendix with the code used to generate the images are available on my GitHub.

juandes/fraud-challenges-appendix

Thanks for reading.

Juan De Dios Santos (@jdiossantos) | Twitter

Overcoming challenges when designing a fraud detection system was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>The goal of the neural network is to classify the input patterns according to the above truth table. If the input patterns are plotted according to their outputs, it is seen that these points are not linearly separable. Hence the neural network has to be modeled to separate these input patterns using *decision planes.*

As mentioned before, the neural network needs to produce two different decision planes to linearly separate the input data based on the output patterns. This is achieved by using the concept of *hidden layers*. The neural network will consist of one input layer with two nodes (X1,X2); one hidden layer with two nodes (since two decision planes are needed); and one output layer with one node (Y). Hence, the neural network looks like this:

To implement an XOR gate, I will be using a Sigmoid Neuron as nodes in the neural network. The characteristics of a Sigmoid Neuron are:

1. Can accept real values as input.

2. The value of the activation is equal to the weighted sum of its inputs

i.e. ∑wi xi

3. The output of the sigmoid neuron is a function of the sigmoid function, which is also known as a logistic regression function. The sigmoid function is a continuous function which outputs values between 0 and 1:

The information of a neural network is stored in the interconnections between the neurons i.e. the weights. A neural network learns by updating its weights according to a learning algorithm that helps it converge to the expected output. The learning algorithm is a principled way of changing the weights and biases based on the loss function.

- Initialize the weights and biases randomly.
- Iterate over the data

i. Compute the predicted output using the sigmoid function

ii. Compute the loss using the square error loss function

iii. W(new) = W(old) — α ∆W

iv. B(new) = B(old) — α ∆B - Repeat until the error is minimal

This is a fairly simple learning algorithm consisting of only arithmetic operations to update the weights and biases. The algorithm can be divided into two parts: the *forward pass* and the *backward pass* also known as *“backpropagation.”*

Let’s implement the first part of the algorithm. We’ll initialize our weights and expected outputs as per the truth table of XOR.

inputs = np.array([[0,0],[0,1],[1,0],[1,1]])

expected_output = np.array([[0],[1],[1],[0]])

Step 1: To initialize the weights and biases with random values

import numpy as np

inputLayerNeurons, hiddenLayerNeurons, outputLayerNeurons = 2,2,1

hidden_weights = np.random.uniform(size=(inputLayerNeurons,hiddenLayerNeurons))

hidden_bias =np.random.uniform(size=(1,hiddenLayerNeurons))

output_weights = np.random.uniform(size=(hiddenLayerNeurons,outputLayerNeurons))

output_bias = np.random.uniform(size=(1,outputLayerNeurons))

The forward pass involves compute the predicted output, which is a function of the weighted sum of the inputs given to the neurons:

Where Σwx + b is known as the activation.

def sigmoid (x):

return 1/(1 + np.exp(-x))

hidden_layer_activation = np.dot(inputs,hidden_weights)

hidden_layer_activation += hidden_bias

hidden_layer_output = sigmoid(hidden_layer_activation)

output_layer_activation = np.dot(hidden_layer_output,output_weights)

output_layer_activation += output_bias

predicted_output = sigmoid(output_layer_activation)

This completes a single forward pass, where our predicted_output needs to be compared with the expected_output. Based on this comparison, the weights for both the hidden layers and the output layers are changed using backpropagation. Backpropagation is done using the Gradient Descent algorithm.

The loss function of the sigmoid neuron is the squared error loss. If we plot the loss/error against the weights we get something like this:

Our goal is to find the weight vector corresponding to the point where the error is minimum i.e. the minima of the error gradient. And here is where calculus comes into play.

Error can be simply written as the difference between the predicted outcome and the actual outcome. Mathematically:

where *t* is the targeted/expected output & *y* is the predicted output

However, is it fair to assign different error values for the same amount of error? For example, the absolute difference between -1 and 0 & 1 and 0 is the same, however the above formula would sway things negatively for the outcome that predicted -1. To solve this problem, we use square error loss.(Note modulus is not used, as it makes it harder to differentiate). Further, this error is divided by 2, to make it easier to differentiate, as we’ll see in the following steps.

Since, there may be many weights contributing to this error, we take the partial derivative, to find the minimum error, with respect to each weight at a time. The change in weights are different for the output layer weights (W31 & W32) and different for the hidden layer weights (W11, W12, W21, W22).

Let the outer layer weights be wo while the hidden layer weights be wh.

We’ll first find ∆W for the outer layer weights. Since the outcome is a function of activation and further activation is a function of weights, by chain rule:

On solving,

Note that for Xo is nothing but the output from the hidden layer nodes.

This output from the hidden layer node is again a function of the activation and correspondingly a function of weights. Hence, the chain rule expands for the hidden layer weights:

Which comes to,

NOTE: Xo can also be considered to be Yh i.e. the output from the hidden layer is the input to the output layer. Xh is the input to the hidden layer, which are the actual input patterns from the truth table.

Let’s then implement the backward pass.

def sigmoid_derivative(x):

return x * (1 - x)

#Backpropagation

error = expected_output - predicted_output

d_predicted_output = error * sigmoid_derivative(predicted_output)

error_hidden_layer = d_predicted_output.dot(output_weights.T)

d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)

#Updating Weights and Biases

output_weights += hidden_layer_output.T.dot(d_predicted_output) * lr

output_bias += np.sum(d_predicted_output,axis=0,keepdims=True) * lr

hidden_weights += inputs.T.dot(d_hidden_layer) * lr

hidden_bias += np.sum(d_hidden_layer,axis=0,keepdims=True) * lr

This process is repeated until the predicted_output converges to the expected_output. It is easier to repeat this process a certain number of times (iterations/epochs) rather than setting a threshold for how much convergence should be expected.

Choosing the number of epochs and the value of the learning rate decides two things: how accurate the model is, and how fast did the model take to compute the final output. The concept of *hyperparameter tuning* is a whole subject by itself.

The output with epochs = 10,000 and learning rate = 0.1 is:

Output from neural network after 10,000 epochs:

[0.05770383] [0.9470198] [0.9469948] [0.05712647]

Hence, the neural network has converged to the expected output:

[0] [1] [1] [0]. The epoch vs error graph shows how the error is minimized.

- An Introduction to Neural Networks by Kevin Gurney
- https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/
- Thanks to https://www.codecogs.com/latex/eqneditor.php for converting LaTeX Equations to PNGs used in this article.

I would love to hear about any feedback/suggestions! Connect with me here.

Implementing the XOR Gate using Backpropagation in Neural Networks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>In statistics we are often looking for ways to quantify relationships between factors and responses in real life.

That being said, we can largely divide the responses we want to understand into two types: categorical responses and continuous responses.

For the categorical case, we are looking to see how certain factors influence the determination of which type of response we have out of a set of options.

For example, consider a data set concerning brain tumors. In this case, the factors would involve the size of the tumor and the age of the patient. On the other hand, the response variable would have to be whether the tumor is benign or malignant.

These types of problems are usually called classification problems and can indeed be handled by a special type of regression called logistic regression.

For the continuous case however, we are looking to see how much our factors influence a measurable change in our response variable.

In our particular example we will be looking at the widely used Boston Housing data set which could be found in the scikit-learn library

While this data set is commonly used for predictive multiple regression models, we are going to focus on a statistical treatment to come to understand how features like an increase in room size can come to affect housing value.

The linear regression model is represented as follows:

where:

In order to “fit” a linear regression model, we establish certain assumptions about the random error that would ensure that we have a good approximation of the actual phenomena once these assumptions are applicable.

These assumptions are as follows:

For the purpose of this model we will be looking at MEDV as the response variable. MEDV is the median value of owner-occupied homes in $1000's

Here we are going to be taking a look at the row of the pairplot that has MEDV on the y-axis. That is, we are going to look at scatter plots of how the housing value relates to all the other features in the dataset.

You can find a full description of all the names of the dataset here:

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

One particularly linear relationship that stands out is that of the average number of rooms per dwelling(RM).

Here’s a closer look:

The simplest form of linear regression consists only of the intercept and a single factor term.

In this case we will fit the model using the least squares method.

When we say ‘fit’ the model, we mean that we find estimates of the factor terms that best suit the data given. In the regression case, that means we find the distance that minimizes the distance between our regression line and each observed data point.

That is we minimize the square errors, represented mathematically as:

Where sigma is the sum over all the rows in the data set.

Here’s how we fit the model using python library statsmodel.

We first import the library:

import statsmodels.api as sm

We are now ready to fit:

Notice how we have to add in a column of ones called the ‘intercept’.

This is due to the fact that we can rewrite our model using linear algebra so that:

So the ones column is the first column in our X matrix so that when we multiply by the factor coefficients vector we get our intercept value in each equation.

This form is what is used to extend to the multiple regression case, but we won’t be extensively covering that math in this article.

**coef:** These are the estimates of the factor coefficients. Oftentimes it would not make sense to consider the interpretation of the intercept term. For instance, in our case, the intercept term has to do with the case where the house has 0 rooms…it doesn’t make sense for a house to have no rooms.

On the other hand, the RM coefficient has a lot of meaning. It suggests that for each additional room, you could expect an increase of $9102 median value of owner-occupied homes.

**P>|t|: **This is a two tailed hypothesis test where the null hypothesis is that RM has no effect on MEDV . Since the p-value is so low it is approximately zero, then there is strong statistical evidence to reject the claim that RM has no effect on MEDV

**R-squared: **This is the amount of variance explained by the model and is often considered to be a measure of how well it fits. However, this factor can be thrown off as it can be artificially inflated by increasing the number of factors even if they are not significant. For this reason we must also considere the adjusted R-squared which adjusts calculations to suit the number of factors. However in simple linear regression these two are the same.

Seeing as we don’t have much else to do, why not just throw in all of ‘em

(Disclaimer: maybe you shouldn’t try that at home…..especially if you have thousands of features….)

- Except CHAS: Since it is binary*

Okay…I apologize for how long that summary table was :/

Note that the R-squared and adjusted R-squared increased dramatically. Also notice however that there is now a feature with strong statistical evidence to support the claim that it does NOT affect MEDV.

That feature is AGE and we will be removing it from our model and refitting.

We will continue to trim and compare until we have all statistically significant features.

As we can see, the columns Age and Indus were not contributing to the fit of the model and are better left out.

**Prob(F-statistic): **is the p-value associated with the test of the significance of the overall model. In this case the null hypothesis is that the model is not overall significant. Since our value is way below 0.01, we can reject the null hypothesis in favour of the alternative, that the model is statistically significant.

***NOTE***

For Multiple linear regression, the beta coefficients have a slightly different interpretation.

For example, the RM coef suggests that for each additional room, we can expect a $3485 increase in median value of owner occupied homes, **all other factors remaining the same**. Also note that the actual value of the coefficient has changed as well.

At this point you may be wondering what we could do to improve the fit of the model. Due to the fact that the adj. R-squared is 0.730, there is certainly room for improvement.

One thing that we could look to see what new features we could add to the data set that may be important to our model.

Apart from that, we would also need to test to see if it was ever actually reasonable to apply linear regression in the first place.

For that we look to residual plots.

The benefit here is that the residual is a good estimate to the random error within the model. That being said we would be able to plot the residual vs the feature (or predicted values) to get a feel of the distribution of the residuals.

We would expect that if the model assumptions hold that we would see a completely scattered diagram with points ranging between some constant value. This would imply that the residuals are independent, normally distributed, have zero mean and constant variance.

However in our SLR case with RM we have

Here we can see that there is some hint of a non-linear pattern as the residual plot seems to be curved at the bottom.

This is a violation of our model assumptions where we assumed the model to have constant variance.

In fact, in this case it would seem as the variance is seeming to be changing according to some other function. We will be using the **box-cox** method to deal with this issue in a second.

We can further check our normality assumption by creating a qqplot. For a normal distribution a qqplot will tend toward a straight line.

Note however that this line is clearly skewed.

This is a process by where we find the maximum likelihood estimate of the most appropriate transformation that we should apply to our response value so that our data would be able have a constant variance.

Implemented using scipy.stats, we got a lambda value of 0.45, which we can use as 0.5 since it won’t make a huge difference in terms of fit, but it will make our answer more interpretable.

import scipy.stats as stats

df.RM2,fitted_lambda = stats.boxcox(df.RM)

Below you will find a table of common lambda values and their suggested response transformations.

After this we transform our response accordingly:

df.MEDV2 = np.sqrt(df.MEDV)

And then we fit our model to MEDV2 which is the square root of MEDV.

Both our R-squared and adjusted R-squared values went up by quite a bit with the same amount of features. Kinda cool right?

Additionally, there has also been some improvement in our residual plots:

Further, we could also try applying our transformation to **each** of the the feature variables

X = df[['intercept','CRIM', 'ZN','NOX', 'RM','DIS', 'RAD',

'TAX', 'PTRATIO', 'B', 'LSTAT']]

X2 = X.apply(np.sqrt)

We then fit our model consisting of X2 as our feature matrix and MEDV2 as our response variable.

Here we see even more improvement in these values with the **same** number of features.

Our final residual plot looks like this:

So there is clearly still some room for improvement, but we’ve definitely made some progress.

There are many more parts to the statistical understanding of the regression model. For instance, it would be possible for us to come up with confidence intervals surrounding each beat coefficient or even around our predicted values.

Further, regarding *CHAS,* it would have been fine to include chase due it’s {1,0} encoding. For any more options it would be difficult or impossible to interpret the beta coefficient unless there was a natural ordering to the options. When we fit our final model with chas we get adjusted R-squared of 0.794. The interpretation of the beta coefficient of 0.077 in this case would be :

If the house tract bounds the Charles river, then you can expect a $77 increase in the median value of owner-occupied homes over a house that is not bounded by the river, all other factors remaining the same.

I hope that I was able to get across the idea that regression can do more than just develop predictions based on certain features. In fact, there is a whole world of regression analysis dedicated to using these techniques to gain deeper understanding between variables in the real world around us.

Statistical Overview of Linear Regression (Examples in Python) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>Excel is a very popular tool in many companies, and Data Analysts and Data Scientists alike often find themselves making it part of their daily arsenal of tools for data analysis and visualization, but not always by choice. This was certainly my experience at my first job as a Data Analyst, where Excel was part of everyone’s workflow.

My team was using Excel’s data tool, Power Query, to aggregate and manipulate CSV files as well as connect to our database. Eventually, this data would be displayed as a pivot table or dashboard and shared with the rest of the company. Almost every report I got my hands on lived on Excel, and it didn’t take me long to realize that this was a big issue.

Just to paint you a picture, here are some of the things I heard my co-workers say multiple times about Excel, and that I eventually started saying myself:

“It crashed again!!”

Refreshing data on Excel reports was a daily task, and sometimes, it would be the only one we could perform at once. Even though our computers had decent hardware, we knew that as soon as we opened other programs (Did someone say Chrome?) while Excel was refreshing, it was almost guaranteed that it will crash.

“It is still refreshing…”

Not only we couldn’t use other applications while refreshing excel, but some of our reports would take 30 minutes or even a few hours to finish refreshing. Yes, Excel loved holding our computers hostage!

“We can’t load that much data.”

Our biggest frustration with Excel was not being able to load as much data as we needed. Everyone in the company was demanding more and we simply couldn’t deliver.

It was clear that something needed to be done. We were wasting too much time working around these issues, and there was very little time left to do any actual analysis or forecasting. Luckily, I was very much fluent in Python and and it’s tools to manipulate CSV files, so my team and I began the long-overdue task of optimizing our reports.

Since we needed to keep reporting in Excel, and there was no budget for a BI tool, we decided to use Python to do all of the heavy lifting and let Excel take care of displaying the data. So with the help of Python and Windows Task Scheduler, we automated the entire process of gathering our data, cleaning it, saving the results, and refreshing the Excel reports.

Since everyone’s workflow is different, and I want to make this article as useful as possible, I will keep things high level and include the links to some great tutorials in case you are looking to dig deeper. Keep in mind that some of these tips may only work on a Windows machine, which is what I was using at the time.

Using the ftplib module in Python, you can connect to an FTP server and download files into your computer. This was a module I used almost daily, since we were receiving CSV reports from an outside source. Here is some sample code:

To learn more about FTP servers and how to use ftplib check out this tutorial.

Using the pyodbc module in Python, you can easily access ODBC databases. In my case, I used it to connect to Netsuite and extract data using SQL queries. Here is some sample code:

Just note that you’ll need to have the appropriate ODBC driver installed in order for the module to work correctly. For more information check out this tutorial.

Using the pandas module in Python, you can manipulate and analyze data very easily and efficiently. This one is without a doubt one of the most valuable tools I posses. Here is some sample code:

This tutorial is a great place to get started with pandas. If you are dealing with large files then you might want to also check out this article on Using pandas with Large Data Sets. It helped me to reduce my memory usage by a lot.

Using the win32com module in Python, you can open up Excel, load a workbook, refresh all data connections and then save the results. Here is how that’s done:

I haven’t yet stumbled upon any good tutorials for the win32com module but this Stack overflow thread might be a good starting point.

With the help of Windows Task Scheduler you can run your python scripts at prescribed times and automate your work. Here is how you can do that:

Launch Task Scheduler, and find the **Create Basic Task** action, located under the **Actions** pane.

Clicking on **Create Basic Task** opens a wizard where you define the name of your task, the *trigger* (when it runs), and the *action* (what program to run). The screenshot below shows the **Action** tab where you specify the name of your Python script to run as well as any arguments to the script.

For more details on creating a task, check out this tutorial.

By introducing Python into the equation, my team and I were able to dramatically reduce the time we spent processing data. Also, including historical data into our analysis was no longer an unattainable task. These improvements did not only freed us to think more analytically but also to spend more time collaborating with other teams.

I hope you found this article useful. if you have any questions or thoughts, I’ll be happy to read them in the comments :)

When Excel isn’t enough: Using Python to clean your Data, automate Excel and much more… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>The thought of a plane crash gives me the creeps because I need to fly regularly home to visit my family. Recently there was a tragic accident by an Ethiopian airline where all passengers died. If you are interested in details about the plane crash, you can get them here:

Ethiopian Airlines crash: Six charts on what we know

If such a crash happens, apparently it is highly unlikely that anyone survives the ordeal. I want to warn you upfront, this article may contain some alarming statistics regarding flying, so if you are prone to anxiety, I would advise against reading the whole article. But, there are some positive things also to be said about flying that may ease the tension.

So, recently, I was flying from Croatia to Germany. It was tranquil sunny weather of course at departure coupled with some nasty winds. Before boarding my flight I already started thinking about the Ethiopian Airlines plane crash, really not the mindset to get into when boarding an international flight. What I was thinking concretely was Bayes’s theorem with regards to airplane fatality rate, in other words, what is the probability that you are going to die by boarding a plane vs. by driving a car?

The reason I thought about this is the classical assumption that flying is safer than driving, statistically. Let me tell you, it sure doesn’t feel like it. You look out the window of your plane, you see those wings basically flapping, not looking robust at all — of course, this was well thought through by some mechanical engineers taking into account the forces on the wings, their flexibility makes them more robust and less prone to breaking. So, flexible not-robust looking wings are a good thing. I need not state the obvious but I’ll state it anyway, I am not a mechanical engineer. But I am equipped with the ability to use Google and, luckily, had learned some physics at the university. But, despite all that, I can’t get rid of the bias in my brain — robust stuff shouldn’t move like that!

A car ride is relatively smooth sailing, there is not much going on most of the time. A plane ride is blessed on the other hand with a fun thing called turbulence. Chances are you experienced some turbulence sometime in your life if you fly regularly, then you know it is not really a good feeling.

Still nestled safely on the ground, I thought about the classical example of cancer tests in Bayesian statistics. I won’t explain Bayesian statistics in detail here, but in a nutshell, it deals with conditional probabilities i.e. what is the probability of something happening given that something else has happened. What makes the cancer test example so interesting is that, based on the confidence of the test, the probability that you will get cancer drops considerably, a more detailed explanation of Bayesian statistics with the cancer example can be found here:

An Intuitive (and Short) Explanation of Bayes' Theorem

What I was concretely interested in is the probability that I am going to die by taking a plane vs. probability that I am going to die by driving a car (excuse me for sounding a bit grim):

Fortunately, I was thinking about this while I was experiencing one of the biggest turbulences in my life, thank you Dalmatia for the Bora(strong north wind on the Adriatic sea). To solve the problem I looked up some statistics on crashes, fatalities for both planes and cars, I am not allowed to use the graphs but I will share the relevant statistics through links.

To make it a bit easier because this is a fast exercise of thought and by no means a full risk evaluation for flying vs. driving, I decided just to look at the statistics for 2018.

So, first thing is the number of global airplane fatalities that happened in 2018, statistics are available here:

Worldwide air traffic - fatalities 2018 | Statistic

Since we are looking at 2018, the number of fatalities is 561. On a positive note, I want you to remember the following:

The fatality rate i.e. number of fatalities divided by number of flights is decreasing

How do I know this? Well, here is the statistic for the number of flights per year:

Airline industry worldwide - number of flights 2019 | Statistic

Notice that the number of flights in a year is steadily increasing. Comparing to the previous fatality graph — we can’t really claim that the yearly number of fatalities is steadily decreasing. But, since the yearly number of flights is increasing, the fatality rate is effectively decreasing. The wiggly nature of the number of fatalities is understandable since a plane crash is a rare event.

On one more positive note:

Not all plane crashes result in fatalities

For 2018 we found out that the number of flights was approximately 39.8 million. I think it is a safe assumption to make that the average flight carries 300 passengers (this is of course not exact, but is making things simpler to calculate and probably is not that far from the truth). So the number of passengers on all of those flights is the following:

So, what is the probability of you being a casualty by taking 1 flight? Well, it is this by a rough estimate:

Quite small no? Don’t get your hopes up just yet. Let us take a look at the number of car rides based on the following source:

Road Safety Facts - Association for Safe International Road Travel

Approximately 1.25 million car accident deaths per year, this is quite a lot… Another statistic shows that there are 1.2 billion cars on the roads in the world today… 1.2 billion! On a side note, just imagine the pollution that that causes. Just that is worthy of a statistical analysis by itself. Again, for some math simplicity, let us assume that there are 2 passengers per car on average. Also, let us assume that on average, a car is used twice a day since most people commute with them to work and back. This is a very strong and probably wrong assumption, I would assume that the average is below 2 since many people use cars only when they travel great distances. Anyway, so the number that we are looking for is the number of persons that participated in car rides within a year:

Now we can derive the probability that you would be a casualty in a car ride:

How does this compare to flights? Well, you can notice it is somewhat lower, approximately it is 16 times more probable that you would die in a car accident as in a plane crash. 16 times sounds quite scary when you would just hear this number and nothing else. This ratio can change considerably based on the correct expected values for passengers. If the expected number of passengers is higher than 2, then the ratio becomes lower also. Concretely, if the expected number of passengers is 4, then the ratio becomes 8. But disregard the ratio, what you are supposed to see is that **both** probabilities are marginally low, there is no reason to be scared of neither planes nor cars.

Of course, the more you use your car the higher the probability that you will have a tragic accident. Needless to say, the probability depends also on other factors such as where you are driving your car, your driving experience, the length of the journey and the car quality itself. Nevertheless, let us state another interesting question. Based on probability, how many car rides or plane rides do you need such that the probability converges to 1 that you will die? We can see this as the probability of surviving converging to 0. This only happens with infinite many tries, but let us consider the case where the probability of your survival based on the statistics is 1%, those are quite slim chances of survival.

So, you need to make 9.6 million car trips in order for your survival chances to drop to 1%. How does this calculation look like for airplanes? Repeating the same procedure, you arrive at this conclusion:

You need to make over 93 million trips by plane in order for your survival rate to drop to 1%. In fact, the survival rate looks something like this as a function of N (based on the above statistics):

It takes quite a few tries for the probability of survival to drop, in fact, for N=1000 000 the probability of survival is still over 95%! This is kind of reassuring. You know when are you going to make a million airplane trips? **Never**. If you take 1 flight a day, it would take you:

So, if you live to be above 100, it would take you about 27 lifetimes for your probability of survival with a plane to be 95%. This is quite reassuring, I feel better concerning my feelings about airplanes now…

Just not to get too positive too early, I remind myself always of Murphy’s law:

What ever can happen, will happen

Obviously, plane and car crashes happen sometimes, but we can be sure that they are highly improbable. If you want to know more about the effects of highly improbable events, I would suggest that you read The Black Swan by Nassim Taleb, a very interesting read:

All in all, based on these very stipulating statistics that I’ve written down in this article, you should feel safe in using a car or a plane for traveling. But don’t forget, it is all randomness and luck.

Should you Fly or Should you Drive? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>