Web Scraping Craigslist: A Complete Tutorial

Riley Predum
Towards Data Science
7 min readFeb 6, 2019

--

I’ve been looking to make a move recently. And what better way to know I’m getting a good price than to sample from the “population” of housing on Craigslist? Sounds like a job for…Python and web scraping!

In this article, I’m going to walk you through my code that scrapes East Bay Area Craigslist for apartments. The code here, and/or the URI parameters rather, can be modified to pull from any region, category, property type, etc. Pretty cool, huh?

I’m going to share GitHub gists of each cell in the original Jupyter Notebook. If you’d like to just see the whole code at once, clone the repo. Otherwise, enjoy the read and follow along!

Getting the Data

First things first I needed to use the get module from the requests package. Then I defined a variable, response, and assigned it to the get method called on the base URL. What I mean by base URL is the URL at the first page you want to pull data from, minus any extra arguments. I went to the apartments section for the East Bay and checked the “Has Picture” filter to narrow down the search just a little though, so it’s not a true base URL.

I then imported BeautifulSoup from bs4, which is the module that can actually parse the HTML of the web page retrieved from the server. I then checked the type and length of that item to make sure it matches the number of posts on the page (there are 120). You can find my import statements and setup code below:

It prints out the length of posts which is 120, as expected.

Using the find_all method on the newly created html_soup variable in the code above, I found the posts. I needed to examine the website’s structure to find the parent tag of the posts. Looking at the screenshot below, you can see that it’s <li class=“result-row”>. That is the tag for one single post, which is literally the box that contains all the elements I grabbed!

Element inspection with Chrome (Ctrl+Shift+C shortcut!)

In order to scale this, make sure to work in the following way: grab the first post and all the variables you want from it, make sure you know how to access each of them for one post before you loop the whole page, and lastly, make sure you successfully scraped one page before adding the loop that goes through all pages.

Class bs4.element.ResultSet is indexed, so I looked at the first apartment by indexing posts[0]. Surprise, it’s all the code that belongs to that <li> tag!

You should have this output for the first post in posts (posts[0]), assigned to post_one.

The price of the post is easy to grab:

.strip() removes whitespace before and after a string

I grabbed the date and time by specifying the attribute ‘datetime’ on class ‘result-date’. By specifying the ‘datetime’ attribute, I saved a step in data cleaning by making it unnecessary to convert this attribute from a string to a datetime object. This could also be made into a one-liner by placing ['datetime'] at the end of the .find() call, but I split it into two lines for clarity.

The URL and post title are easy because the ‘href’ attribute is the link and is pulled by specifying that argument. The title is just the text of that tag.

The number of bedrooms and square footage are in the same tag, so I split these two values and grabbed each one element-wise. The neighborhood is the <span> tag of class “result-hood”, so I grabbed the text of that.

The next block is the loop for all the pages for the East Bay. Since there isn’t always information on square footage and number of bedrooms, I built in a series of if statements embedded within the for loop to handle all cases.

The loop starts on the first page, and for each post in that page, it works through the following logic:

I included some data cleaning steps in the loop, like pulling the ‘datetime’ attribute and removing the ‘ft2’ from the square footage variable, and making that value an integer. I removed ‘br’ from the number of bedrooms as that was scraped as well. That way, I started data cleaning with some work already done. Elegant code is the best! I wanted to do more, but the code would become too specific to this region and might not work across areas.

The code below creates the dataframe from the lists of values!

Awesome! There it is. Admittedly, there is still a little bit of data cleaning to be done. I’ll go through that real quick, and then it’s time to explore the data!

Exploratory Data Analysis

Sadly, after removing the duplicate URLs I saw that there are only 120 instances. These numbers will be different if you run the code, since there will be different posts at different times of scraping. There were about 20 posts that didn’t have bedrooms or square footage listed too. For statistical reasons, this isn’t an incredible data set, but I took note of that and pushed forward.

Descriptive statistics for the quantitative variables

I wanted to see the distribution of the pricing for the East Bay so I made the above plot. Calling the .describe() method, I got a more detailed look. The cheapest place is $850, and the most expensive is $4,800.

The next code block generates a scatter plot, where the points are colored by the number of bedrooms. This shows a clear and understandable stratification: we see layers of points clustered around particular prices and square footages, and as price and square footage increase so do the number of bedrooms.

Let’s not forget the workhorse of Data Science: linear regression. We can call a regplot() on these two variables to get a regression line with a bootstrap confidence interval calculated about the line and shown as a shaded region with the code below. If you haven’t heard of bootstrap confidence intervals, they are a really cool statistical technique that are worth a read.

It looks like we have an okay fit of the line on these two variables. Let’s check the correlations. I called eb_apts.corr() to get these:

Correlation matrix for our variables

As suspected, correlation is strong between number of bedrooms and square footage. That makes sense since square footage increases as the number of bedrooms increases.

Pricing By Neighborhood Continued

I wanted to get a sense of how location affects price, so I grouped by neighborhood, and aggregated by calculating the mean for each variable.

The following is produced with this single line of code: eb_apts.groupby('neighborhood').mean() where ‘neighborhood’ is the ‘by=’ argument, and the aggregator function is the mean.

I noticed that there are two North Oaklands: North Oakland and Oakland North, so I recoded one of them into the other like so:

eb_apts['neighborhood'].replace('North Oakland', ‘Oakland North', inplace=True).

Grabbing the price and sorting it in ascending order can show the cheapest and most expensive places to live. The full line of code is now: eb_apts.groupby('neighborhood').mean()['price'].sort_values() and results in the following output:

Average price by neighborhood sorted in ascending order

Lastly, I looked at the spread of each neighborhood in terms of price. By doing this, I saw how prices in neighborhoods can vary, and to what degree.

Here’s the code that produces the plot that follows.

Berkeley had a huge spread. This is probably because it includes South Berkeley, West Berkeley, and Downtown Berkeley. In a future version of this project it may be important to consider changing the scope of each of the variables so they are more reflective of the variability of price between neighborhoods in each city.

Well, there you have it! Take a look at this the next time you’re in the market for housing to see what a good price should be. Feel free to check out the repo and try it for yourself, or fork the project and do it for your city! Let me know what you come up with!

Scrape responsibly.

If you learned something new and would like to pay it forward to the next learner, consider donating any amount you’re comfortable with, thanks!

Happy coding!

Riley

--

--

Focused on teaching you the ins and outs of data analytics through detailed tutorials.