How to use data science principles to improve your Search Engine Optimisation efforts

There are articles out there on this already and a lot of them address the ‘what’ and the ‘why, but there is a gap in ‘how’. Here is my attempt to address that.

Published in

Towards Data Science

7 min readAug 5, 2020

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Search Engine Optimisation (SEO) is the discipline of using knowledge gained around how search engines work to build websites and publish content that can be found on search engines by the right people at the right time.

Some people say that you don’t really need SEO and they take a Field of Dreams ‘build it and they shall come’ approach. The size of the SEO industry is predicted to be $80 billion by the end of 2020. There are at least some people who like to hedge their bets.

An often-quoted statistic is that Google’s ranking algorithm contains more than 200 factors for ranking web pages and SEO is often seen as an ‘arms race’ between its practitioners and the search engines. With people looking for the next ‘big thing’ and putting themselves into tribes (white hat, black hat and grey hat).

There is a huge amount of data generated by SEO activity and its plethora of tools. For context, the industry-standard crawling tool Screaming Frog has 26 different reports filled with web page metrics on things you wouldn’t even think are important (but are). That is a lot of data to munge and find interesting insights from.

The SEO mindset also lends itself well to the data science ideal of munging data and using statistics and algorithms to derive insights and tell stories. SEO practitioners have been pouring over all of this data for 2 decades trying to figure out the next best thing to do and to demonstrate value to clients.

Despite access to all of this data, there is still a lot of guesswork in SEO and while some people and agencies test different ideas to see what performs well, a lot of the time it comes down to the opinion of the person with the best track record and overall experience on the team.

I’ve found myself in this position a lot in my career and this is something I would like to address now that I have acquired some data science skills of my own. In this article, I will point you to some resources that will allow you to take more data-led approach to your SEO efforts.

SEO Testing

One of the most often asked questions in SEO is ‘We’ve implemented these changes on a client’s webaite, but did they have an effect?’. This often leads to the idea that if the website traffic went up ‘it worked’ and if the traffic went down it was ‘seasonality’. That is hardly a rigorous approach.

A better approach is to put some maths and statistics behind it and analyse it with a data science approach. A lot of the maths and statistics behind data science concepts can be difficult, but luckily there are a lot of tools out there that can help and I would like to introduce one that was made by Google called Causal Impact.

The Causal Impact package was originally an R package, however, there is a Python version if that is your poison and that is what I will be going through in this post. To install it in your Python environment using Pipenv, use the command:

pipenv install pycausalimpact

If you want to learn more about Pipenv, see a post I wrote on it here, otherwise, Pip will work just fine too:

pip install pycausalimpact

What is Causal Impact?

Causal Impact is a library that is used to make predictions on time-series data (such as web traffic) in the event of an ‘intervention’ which can be something like campaign activity, a new product launch or an SEO optimisation that has been put in place.

You supply two-time series as data to the tool, one time series could be clicks over time for the part of a website that experienced the intervention. The other time series acts as a control and in this example that would be clicks over time for a part of the website that didn’t experience the intervention.

You also supply a data to the tool when the intervention took place and what it does is it trains a model on the data called a Bayesian structural time series model. This model uses the control group as a baseline to try and build a prediction about what the intervention group would have looked like if the intervention hadn’t taken place.

The original paper on the maths behind it is here, however, I recommend watching this video below by a guy at Google, which is far more accessible:

Implementing Causal Impact in Python

After installing the library into your environment as outlined above, using Causal Impact with Python is pretty straightforward, as can be seen in the notebook below by Paul Shapiro:

Causal Impact with Python

After pulling in a CSV with the control group data, intervention group data and defining the pre/post periods you can train the model by calling:

ci = CausalImpact(data[data.columns[1:3]], pre_period, post_period)

This will train the model and run the predictions. If you run the command:

ci.plot()

You will get a chart that looks like this:

Output after training the Causal Impact Model

You have three panels here, the first panel showing the intervention group and the prediction of what would have happened without the intervention.

The second panel shows the pointwise effect, which means the difference between what happened and the prediction made by the model.

The final panel shows the cumulative effect of the intervention as predicted by the model.

Another useful command to know is:

print(ci.summary('report'))

This prints out a full report that is human readable and ideal for summarising and dropping into client slides:

Selecting a control group

The best way to build your control group is to pick pages which aren’t affected by the intervention at random using a method called stratified random sampling.

Etsy has done a post on how they’ve used Causal Impact for SEO split testing and they recommend using this method. Random stratified sampling is as the name implies where you pick from the population at random to build the sample. However if what we’re sampling is segmented in some way, we try and maintain the same proportions in the sample as in the population for these segments:

An ideal way to segment web pages for stratified sampling is to use sessions as a metric. If you load your page data into Pandas as a data frame, you can use a lambda function to label each page:

df["label"] = df["Sessions"].apply(lambda x:"Less than 50" if x<=50 else ("Less than 100" if x<=100 else ("Less than 500" if x<=500 else ("Less than 1000" if x<=1000 else ("Less than 5000" if x<=5000 else "Greater than 5000")))))

From there, you can use test_train_split in sklearn to build your control and test groups:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(selectedPages["URL"],selectedPages["label"], test_size=0.01, stratify=selectedPages["label"])

Note that stratify is set and if you have a list of pages you want to test already then your sample pages should equal the number of pages you want to test. Also, the more pages you have in your sample, the better the model will be. If you use too few pages, the less accurate the model will be.

It is is worth noting that JC Chouinard gives a good background on how to do all of this in Python using a method similar to Etsy:

SEO Split Test Using Python + CausalImpact + Tag Manager - JC Chouinard

In this guide, I will provide you with everything you need to set up your own SEO split tests with Python, R, the…

www.jcchouinard.com

Conclusion

There are a couple of different use cases that you could use this type of testing. The first would be to test ongoing improvements using split testing and this is similar to the approach that Etsy uses above.

The second would be to test an improvement that was made on-site as part of ongoing work. This is similar to an approach outlined in this post, however with this approach you need to ensure your sample size is sufficiently large otherwise your predictions will be very inaccurate. So please do bear that in mind.

Both ways are valid ways of doing SEO testing, with the former being a type of A/B split test for ongoing optimisation and the latter being an test for something that has already been implemented.

I hope this has given you some insight into how to apply data science principles to your SEO efforts. Do read around these interesting topics and try and come up with other ways to use this library to validate your efforts. If you need background on the Python used in this post I recommend this course.