What do successful people talk about?

A machine learning analysis of the Tim Ferriss Show

Boyan Angelov
Towards Data Science

--

First off, I am a huge fan of both Tim Ferriss and his work. Several of his books have been life-changing for me. He is also known for his very popular podcast — “The Tim Ferris Show”, I always have at least several episodes downloaded on my phone. Thus it is no surprise when he decided to upload transcripts of all his episodes, I was excited. As a data scientist (especially one specializing in NLP) I knew I could do something. So here goes my analysis.

TLDR if you want to skip to the results and explore them on your own, navigate to the interactive website: https://boyanangelov.com/materials/lda_vis.html

First I had to scrape the data from his website. Fortunately, that was quite easy, since the HTML was very well structured. I downloaded and parsed data for 200 episodes. For this, I used some very cool open source libraries in Python.

The first step of any web scraping project is to investigate the HTML structure of the website. All modern browsers include developer tools that can help you with that, in my case I used Chrome:

Using the Chrome developer tools to inspect the URLs

The next steps are to use several typical scraping and NLP (Natural Language Processing) libraries. Those include BeautifulSoup and NLTK.

The most interesting result is the visualization of the LDA topic model. LDA stands for Latent Dirichlet Allocation is a common technique to discover patterns (topics) in text data. The nice package PyLDAvis allows for interactive visualization, and you can see a screenshot below:

Here you can see the several clusters which characterize the different topics Tim and his guests are talking about. Most of the topics are related to business, but there are a few outliers, most notably nutrition and sport.

There are other things you can try to do that I didn’t have time for. For example, it would be interesting to see which are the most common names that are mentioned in the podcasts, or the different lexical complexity between episodes?

Hope this short analysis was interesting, and feel free to borrow the methods in your own web scraping projects. Let me know what you find out!

The code to reproduce the analysis is in a Github Gist:

--

--