Automate your job search with Python and Github Actions

A real-life example using Scrapy and Github Actions

Ioannis Foukarakis
Towards Data Science

--

Photo by Marten Newhall on Unsplash

Job hunting is a time-consuming task. A lot of different sites for job searches exist, but there is not a “one size fits all”. Job openings are available in job aggregators, LinkedIn, career pages of individual companies, even as tweets or in Git repos. Following all the changes is definitely challenging.

But what if you could build your personal job hunting tool? This post is exactly about this. With the help of scrapping tools we’ll build a small proof-of-concept that helps you keep track of jobs posted to company web sites. The data will be extracted in JSON format. This way you can build your own personalized newsletter or career page. Let’s get started!.

Understanding the problem

After having a look at a few career pages, a few things are pretty obvious:

  1. Most career pages follow a two-level structure. The first level is a single page which lists available job openings. The second level is a separate page with the details of each job opening. Data from both pages might be needed to get the full job details.
  2. A lot of companies use solutions from third parties such as Workable and Recruitee. Building the components that crawl the career pages of these solutions can give us quick access to the majority of the jobs.
  3. Job Openings across different companies share some common fields, but they also contain free text for describing the job and its requirements. The expected output can be semi-structured data.

Enter Scrapy

There are a lot of tools available for scraping data from the web. Scrapy is one of the best options out there. You can definitely use a simpler solution (i.e. requests or BeautifulSoup), but scrapy offers a lot of things out of the box. Some highlights include:

  • A structured way of building spiders.
  • Selectors, an API for extracting information from HTML pages using XPath or CSS expressions.
  • Configuration for managing features like throttling.
  • Tools for building pipelines, separating individual parts of the process (crawling, parsing, transforming and storing the data), as well as other concerns (throttling, configurable serializers for scrapped data etc).

Setting up the project

The first thing needed is to install scrapy command-line interface. You can install it globally on your python environment :

pip install Scrapy

or if you have pipsi installed:

pipsi install Scrapy

Creating a new project is really easy. Running the command:

scrapy startproject jobscrapper

Scrapy’s CLI created a directory named jobscrapper. A python package (also named jobscrapper) is also created, containing basic project configuration.

Depending on the method you used for installing scrapy, you might want to add a requirements.txt with scrapy dependency, as well as create a virtual environment and install dependencies. This will also help if you want to automate deployment in a remote server.

Defining the expected output

Job information may come from different web sites. Since data might be used for generating reports or newsletter, having a structured or semi-structured format will be really helpful. If you check a couple of job sites, you’ll notice that the following information is commonly shared:

  • Id of the job
  • Name of the company that has posted the job
  • Job title
  • Description & requirements of the job
  • Link to the job’s web page
  • Department
  • Location

Scrapy introduces Items as the abstraction for defining the structure of the output. Spiders create items by processing data from the web. Scrapy offers the Item class for defining the format of an item. The following snippet represents the structure for the extracted job opening items:

Implementing spiders

The next step is to perform the actual information extraction from the web pages. Scrapy offers the tools for implementing spiders — components that are able to parse specific pages and extract information from them.

A spider is a Python class that subclasses one of Scrapy’s Spider classes. The attributes of the class hold information regarding the URLs to parse, the spider’s name, crawling configuration etc. The parent class also offers some methods that can be overridden in order to add any custom parsing logic. All spiders in this post will use the simplest spider superclass scrapy.Spider.

Some of the most common attributes and methods of a Scrapy Spider are:

  • name: the name of the spider
  • start_urls: a list of URLs to parse
  • start_requests(): this method iterates over the URLs in start_urls, downloads their content and pass the response to method parse.
  • parse(response): responsible for parsing the response. Response is encapsulated in the argument passed to this method. This method may yield either a scrapy.Item or scrapy.Request objects. In the first case, the result will be added to the list of results, while in the latter, a new Request will be made using the configuration described in scrapy.Request object.

Spiders can be built either for specific URLs or for pages following the same structure. Luckily a lot of companies use two great services: Workable and Recruitee. Both generate career pages using customizable templates. Creating crawlers for those two services will enable us to crawl a large number of jobs.

Recruitee — scrapping data from HTML pages

Let’s start with Recruitee-backed career pages. If you visit any of these pages, you’ll notice that they have a parent page that contains the list of jobs, as well as links to pages with specific job details.

Back to building the spider, it seems we’ll need to do the following in our code:

  • Create a subclass of scrapy.Spider — let’s name it RecruiteeSpider .
  • Specify name attribute to something unique for this spider.
  • Define list of start_urls .
  • Implement the parse method.

The page however doesn’t include the whole job information. Some fields are available on the web page with the job’s details. The spider will need to:

  • Load list of jobs from job list page,
  • perform a request to each job’s detail page,
  • parse results from individual child pages and
  • finally merge the data from the two pages.

Parsing the job list page

If you visit any Recruitee-backed job list page and right-click -> Inspect on a job opening’s title, you ‘ll see something similar to the following image:

Sample job information from recruitee

Scrapy is using selectors for referring to specific parts of the page. Selectors are strings representing rules that refer to specific parts of the page. They can be either XPath expressions or CSS selectors. For example div.job will refer to all div containers with class job. Checking the HTML source of the page we can see there’s one such container for each job. Looking more carefully, we can see that information available includes job title, department, location and link to the job’s page.

Getting the jobs is straight forward. We just need to iterate over all divs with class job. In order to trigger a new request for each job’s detail page, parse will yield one scrapy.Request for each job.

Requests can be configured. The most important argument is url. The URLs can be constructed by getting the URL of parent page and joining the path defined in the <a href="..."> field inside h5 with class job-title. However, when a request contains only the url parameter, the default method called for processing the response is parse. Job detail pages have different structure than parent pages. Thus a new method for parsing job details needs to be added.

Parsing job detail pages

Let’s call the method for parsing job detail pages parse_job. Its responsibility will be to load job detail pages, merge any information already available from job list page, and return an JobOpeningItem. Using the same logic as in parse, job details can be extracted using selectors.

The tricky part is how to pass information between requests. Luckily scrapy.Request has a meta argument that can be a dictionary of values. The dictionary will be copied to the response object passed to parse_job. This meta dictionary is ideal for sharing information between the two requests.

Let’s have a look at the final code:

Some things to notice:

  • start_requests() hasn’t been implemented. The spider will just iterate over the items in start_urls .
  • In the first line of parse, the name of the company is read from the URL.
  • parse uses CSS selectors but parse_job uses XPath expressions. Each has its own strengths.
  • Data read from parent page are passed using the meta mechanism (lines 19–23).
  • description & requirements fields store the HTML of the fields rather than a stripped down version of the text. If the plain text is needed, it is possible to extract it in a post-processing step. Holding the HTML value as it was originally written may add value in the future. It is also nice to hold data from scrapping in their raw form as it will allow experimentation on different transformation mechanisms.

Workable— scrapping data from JSON APIs

Job pages powered by Workable seem to use a different approach. Instead of embedding the job information in the HTML page upon load, they use a JSON API to load job information after the page has been loaded. This is a really common pattern in web pages. You can see this happening by opening Developer tools (in Chrome/Brave) or Web developer tools (Firefox), then navigating to Network🠆XHR and visiting a career page. You will see something similar to this:

The jobs request is the actual request made to the server by the Javascript code of the page in order to load the list of jobs. Headers tab contains the details of the request (it’s a POST request) as well as the payload of the request. Preview tab offers a formatted version of the JSON returned by the endpoint. You might notice that there’s two requests to the same jobs URLs. That’s because the list of results is too big in size in order to be returned in a single response. Instead, the first request returns the first 10 results, as well as a pointer to the next page. The second request returns the rest of the results.

If you click on a specific job on the web page, you’ll notice that more XHR requests are added. One of them uses the shortcode from the previous request in order to create the URL for getting the job’s details.

Parsing the job list page

As discussed earlier, the list of jobs is not part of the page, but is loaded using a new request. The spider will have to imitate this logic. The simplest solution would be to override start_requests() in order to perform a new JSON request for each job list page.

Since the responses are in JSON format, there’s no need to use selectors for extracting information. Instead standard python dictionary and array manipulation can be used. The logic is really similar as in the previous spider. Here’s the first version:

Some things to note:

  • Input URLs are the links to the job list pages. The overridden start_requests method extracts the company’s name and creates the request to the JSON API.
  • parse just yields a new request for the job details URL.
  • parse doesn’t use selectors. Just iterates over the JSON object. Line #30 loads the response’s body as JSON.
  • parse also gets the company name from the URL. A different approach could have been to pass it from start_requests using meta.

Parsing job detail pages

Job details are loaded as JSON objects. Creating the OpeningItem is pretty straight-forward.

Paging

The code as is will work, but it will load jobs only from the first call on the API. Checking back on the developer tools the two XHR requests, we can see that the first response contains a field named nextPage. This value can be added to the payload of the request in order to indicate which page to use. The second request to jobs API endpoint helps understand which field should be set to the value of nextPage (it’s token).

Pagination can now be implemented by modifying parse. It will now do the following:

  • Load job list.
  • Yield a request for each job in the list, using parse_job as the callback.
  • Yield a request for next page, if nextPage exists. The request is again to jobs endpoint, but with token set to the value of nextPage.

Here’s the code:

You can think of the pagination logic as a recursive call to parse, using the output of the previous result.

Small data cleaning

Data loaded from HTML pages (or even from the API) may contain additional spaces, possibly existing because of formatted HTML code. It might not make sense to keep them. A first solution would be to modify each spider to clear white spaces. But since this is a concern applying to all spiders we can do something better.

Scrapy allows creating item pipelines. After an item has been created, it is sent to the pipelines defined in the configuration. The first step for adding an item pipeline is adding it in jobscrapper/pipelines.py . The item pipeline is a class that needs to define method process_item(self, item, spider) . This method performs any actions to all items passed to it. Here’s how a pipeline that removes multiple spaces would look like:

In order for scrapy to actually call the pipeline, the pipeline needs to be activated in settings.py. Open jobscrapper/settings.py, locate lines with ITEM_PIPELINE and edit it to look like the

Running the spiders

Running the spiders can be done using Scrapy’s CLI. Simply run one of the following:

# Run Recruitee spider
scrapy runspider openings/spiders/recruitee.py -o jobs.json
# Run Workable spider
scrapy runspider openings/spiders/workable.py -o jobs.json

The -o jobs.json option instructs the CLI tool to append results to file jobs.json. If it’s not specified, results will just be printed on the terminal.

The drawback of this approach is that if you run the two spiders, the file will contain multiple JSON documents (rather than a single one), making it difficult to parse.

IMPORTANT NOTE: The URLs used in the code above are fictional. Make sure you replace them with real world examples! Otherwise the scraping will fail.

Storing jobs per company

The final step in python code is to store jobs as a JSON document in a separate file per company. Scrapy offers a lot of item exporters for storing parsed items. Although none of them does exactly what we want, there’s a good example that can act as our guide. The idea is to create a pipeline that creates a separate exporter per company name. Then use one the existing JSON item exporter for storing the data. Here’s how such an exporter would look like:

Here’s the explanation of how it works:

  • Lines 13–16 define how to construct a new pipeline using existing settings. Settings are read from file settings.py. The class method simply creates a new PerCompanyExportPipeline passing the path to store files as an argument.
  • Lines 8–11 is the constructor. It stores the path as a property, while it also ensures that the directory structure exists.
  • Lines 18–19 is called when the spider is initialized. It just created a dictionary to map a company to an exporter
  • Lines 21–23 are cleanup for whenever crawling stops.
  • Lines 25–33 contain the logic for selecting which exporter to use. The current company is extracted from the item. If there’s no exporter for this company, a new one is created and added to the company-to-exporter map. Finally the exporter (either new or existing) is returned
  • Lines 35–38 is the logic for exporting the item.

The final step is to activate the pipeline in settings.py. Update ITEM_PIPELINES to look like the following snippet:

I also defined JOBS_PATH so that all files are stored under data/ directory.

Spiders can now be executed by running:

# Run Recruitee spider
scrapy runspider openings/spiders/recruitee.py
# Run Workable spider
scrapy runspider openings/spiders/workable.py

Git Scrapping

The most difficult part is done. But it would be extremely nice if the process could be executed automatically, possibly sending an update. There’s a really useful technique called Git scrapping that can help. The idea is that the code will run periodically. All output will be stored inside the git repo. The benefit of this approach is that changes in job postings can be tracked using Git.

The following Github action does exactly this. Runs spiders once per week and creates a commit if something changed.

The flow is finally completed! You can check it running at https://github.com/ifoukarakis/jobscrapper/

Improvements

The code is definitely not ready for production, neither it’s something that automates job search. There’s a lot of things missing.

Feature ideas

  • Send notification with new jobs as soon as they are detected.
  • Filter jobs based on preferences or some kind of profile.
  • Group jobs by similarity.

Technical debt

  • Loading the list of starting URLs per spider from a persistent storage rather than hardcode them.
  • Storing the openings in a persistent store.
  • Tests.
  • Remove duplicate code and refactor spiders to be reusable.
  • Get openings from individual career pages.
  • Improved notifications on failures. This can be for many reasons, including network problems, protection for crawlers, changes in the APIs or structure of the page etc.

Before you go

This post is intended as a tutorial on how to scrap data from different types of sources available on the Web. However each site might have its own terms & conditions. Make sure you read them and understand them before scrapping the data!

If you ‘re looking for the source code of this post, you can find it at https://github.com/ifoukarakis/jobscrapper/

Do you have questions or suggestions? Feel free to add a comment!

Hope you enjoyed reading this post! Happy job hunting!

--

--

Software engineer, interested in machine learning projects and process, back-end development and coding in general. Geek since before it was cool to be one.