Automate your job search with Python and Github Actions
A real-life example using Scrapy and Github Actions
Job hunting is a time-consuming task. A lot of different sites for job searches exist, but there is not a “one size fits all”. Job openings are available in job aggregators, LinkedIn, career pages of individual companies, even as tweets or in Git repos. Following all the changes is definitely challenging.
But what if you could build your personal job hunting tool? This post is exactly about this. With the help of scrapping tools we’ll build a small proof-of-concept that helps you keep track of jobs posted to company web sites. The data will be extracted in JSON format. This way you can build your own personalized newsletter or career page. Let’s get started!.
Understanding the problem
After having a look at a few career pages, a few things are pretty obvious:
- Most career pages follow a two-level structure. The first level is a single page which lists available job openings. The second level is a separate page with the details of each job opening. Data from both pages might be needed to get the full job details.
- A lot of companies use solutions from third parties such as Workable and Recruitee. Building the components that crawl the career pages of these solutions can give us quick access to the majority of the jobs.
- Job Openings across different companies share some common fields, but they also contain free text for describing the job and its requirements. The expected output can be semi-structured data.
Enter Scrapy
There are a lot of tools available for scraping data from the web. Scrapy is one of the best options out there. You can definitely use a simpler solution (i.e. requests or BeautifulSoup), but scrapy offers a lot of things out of the box. Some highlights include:
- A structured way of building spiders.
- Selectors, an API for extracting information from HTML pages using XPath or CSS expressions.
- Configuration for managing features like throttling.
- Tools for building pipelines, separating individual parts of the process (crawling, parsing, transforming and storing the data), as well as other concerns (throttling, configurable serializers for scrapped data etc).
Setting up the project
The first thing needed is to install scrapy command-line interface. You can install it globally on your python environment :
pip install Scrapy
or if you have pipsi installed:
pipsi install Scrapy
Creating a new project is really easy. Running the command:
scrapy startproject jobscrapper
Scrapy’s CLI created a directory named jobscrapper
. A python package (also named jobscrapper
) is also created, containing basic project configuration.
Depending on the method you used for installing scrapy, you might want to add a requirements.txt
with scrapy dependency, as well as create a virtual environment and install dependencies. This will also help if you want to automate deployment in a remote server.
Defining the expected output
Job information may come from different web sites. Since data might be used for generating reports or newsletter, having a structured or semi-structured format will be really helpful. If you check a couple of job sites, you’ll notice that the following information is commonly shared:
- Id of the job
- Name of the company that has posted the job
- Job title
- Description & requirements of the job
- Link to the job’s web page
- Department
- Location
Scrapy introduces Items as the abstraction for defining the structure of the output. Spiders create items by processing data from the web. Scrapy offers the Item
class for defining the format of an item. The following snippet represents the structure for the extracted job opening items:
Implementing spiders
The next step is to perform the actual information extraction from the web pages. Scrapy offers the tools for implementing spiders — components that are able to parse specific pages and extract information from them.
A spider is a Python class that subclasses one of Scrapy’s Spider classes. The attributes of the class hold information regarding the URLs to parse, the spider’s name, crawling configuration etc. The parent class also offers some methods that can be overridden in order to add any custom parsing logic. All spiders in this post will use the simplest spider superclass scrapy.Spider.
Some of the most common attributes and methods of a Scrapy Spider are:
name
: the name of the spiderstart_urls
: a list of URLs to parsestart_requests()
: this method iterates over the URLs instart_urls
, downloads their content and pass the response to methodparse
.parse(response)
: responsible for parsing the response. Response is encapsulated in the argument passed to this method. This method may yield either ascrapy.Item
orscrapy.Request
objects. In the first case, the result will be added to the list of results, while in the latter, a new Request will be made using the configuration described inscrapy.Request
object.
Spiders can be built either for specific URLs or for pages following the same structure. Luckily a lot of companies use two great services: Workable and Recruitee. Both generate career pages using customizable templates. Creating crawlers for those two services will enable us to crawl a large number of jobs.
Recruitee — scrapping data from HTML pages
Let’s start with Recruitee-backed career pages. If you visit any of these pages, you’ll notice that they have a parent page that contains the list of jobs, as well as links to pages with specific job details.
Back to building the spider, it seems we’ll need to do the following in our code:
- Create a subclass of
scrapy.Spider
— let’s name itRecruiteeSpider
. - Specify
name
attribute to something unique for this spider. - Define list of
start_urls
. - Implement the
parse
method.
The page however doesn’t include the whole job information. Some fields are available on the web page with the job’s details. The spider will need to:
- Load list of jobs from job list page,
- perform a request to each job’s detail page,
- parse results from individual child pages and
- finally merge the data from the two pages.
Parsing the job list page
If you visit any Recruitee-backed job list page and right-click -> Inspect
on a job opening’s title, you ‘ll see something similar to the following image:
Scrapy is using selectors for referring to specific parts of the page. Selectors are strings representing rules that refer to specific parts of the page. They can be either XPath expressions or CSS selectors. For example div.job
will refer to all div
containers with class job
. Checking the HTML source of the page we can see there’s one such container for each job. Looking more carefully, we can see that information available includes job title, department, location and link to the job’s page.
Getting the jobs is straight forward. We just need to iterate over all div
s with class job
. In order to trigger a new request for each job’s detail page, parse
will yield one scrapy.Request
for each job.
Requests can be configured. The most important argument is url
. The URLs can be constructed by getting the URL of parent page and joining the path defined in the <a href="...">
field inside h5
with class job-title
. However, when a request contains only the url
parameter, the default method called for processing the response is parse
. Job detail pages have different structure than parent pages. Thus a new method for parsing job details needs to be added.
Parsing job detail pages
Let’s call the method for parsing job detail pages parse_job
. Its responsibility will be to load job detail pages, merge any information already available from job list page, and return an JobOpeningItem
. Using the same logic as in parse
, job details can be extracted using selectors.
The tricky part is how to pass information between requests. Luckily scrapy.Request
has a meta
argument that can be a dictionary of values. The dictionary will be copied to the response
object passed to parse_job
. This meta
dictionary is ideal for sharing information between the two requests.
Let’s have a look at the final code:
Some things to notice:
start_requests()
hasn’t been implemented. The spider will just iterate over the items instart_urls
.- In the first line of
parse
, the name of the company is read from the URL. parse
uses CSS selectors butparse_job
uses XPath expressions. Each has its own strengths.- Data read from parent page are passed using the
meta
mechanism (lines 19–23). description
&requirements
fields store the HTML of the fields rather than a stripped down version of the text. If the plain text is needed, it is possible to extract it in a post-processing step. Holding the HTML value as it was originally written may add value in the future. It is also nice to hold data from scrapping in their raw form as it will allow experimentation on different transformation mechanisms.
Workable— scrapping data from JSON APIs
Job pages powered by Workable seem to use a different approach. Instead of embedding the job information in the HTML page upon load, they use a JSON API to load job information after the page has been loaded. This is a really common pattern in web pages. You can see this happening by opening Developer tools (in Chrome/Brave) or Web developer tools (Firefox), then navigating to Network🠆XHR and visiting a career page. You will see something similar to this:
The jobs
request is the actual request made to the server by the Javascript code of the page in order to load the list of jobs. Headers
tab contains the details of the request (it’s a POST request) as well as the payload of the request. Preview
tab offers a formatted version of the JSON returned by the endpoint. You might notice that there’s two requests to the same jobs
URLs. That’s because the list of results is too big in size in order to be returned in a single response. Instead, the first request returns the first 10 results, as well as a pointer to the next page. The second request returns the rest of the results.
If you click on a specific job on the web page, you’ll notice that more XHR requests are added. One of them uses the shortcode from the previous request in order to create the URL for getting the job’s details.
Parsing the job list page
As discussed earlier, the list of jobs is not part of the page, but is loaded using a new request. The spider will have to imitate this logic. The simplest solution would be to override start_requests()
in order to perform a new JSON request for each job list page.
Since the responses are in JSON format, there’s no need to use selectors for extracting information. Instead standard python dictionary and array manipulation can be used. The logic is really similar as in the previous spider. Here’s the first version:
Some things to note:
- Input URLs are the links to the job list pages. The overridden
start_requests
method extracts the company’s name and creates the request to the JSON API. parse
just yields a new request for the job details URL.parse
doesn’t use selectors. Just iterates over the JSON object. Line #30 loads the response’s body as JSON.parse
also gets the company name from the URL. A different approach could have been to pass it fromstart_requests
usingmeta
.
Parsing job detail pages
Job details are loaded as JSON objects. Creating the OpeningItem
is pretty straight-forward.
Paging
The code as is will work, but it will load jobs only from the first call on the API. Checking back on the developer tools the two XHR requests, we can see that the first response contains a field named nextPage
. This value can be added to the payload of the request in order to indicate which page to use. The second request to jobs
API endpoint helps understand which field should be set to the value of nextPage
(it’s token).
Pagination can now be implemented by modifying parse
. It will now do the following:
- Load job list.
- Yield a request for each job in the list, using
parse_job
as the callback. - Yield a request for next page, if
nextPage
exists. The request is again tojobs
endpoint, but withtoken
set to the value ofnextPage
.
Here’s the code:
You can think of the pagination logic as a recursive call to parse
, using the output of the previous result.
Small data cleaning
Data loaded from HTML pages (or even from the API) may contain additional spaces, possibly existing because of formatted HTML code. It might not make sense to keep them. A first solution would be to modify each spider to clear white spaces. But since this is a concern applying to all spiders we can do something better.
Scrapy allows creating item pipelines. After an item has been created, it is sent to the pipelines defined in the configuration. The first step for adding an item pipeline is adding it in jobscrapper/pipelines.py
. The item pipeline is a class that needs to define method process_item(self, item, spider)
. This method performs any actions to all items passed to it. Here’s how a pipeline that removes multiple spaces would look like:
In order for scrapy to actually call the pipeline, the pipeline needs to be activated in settings.py
. Open jobscrapper/settings.py
, locate lines with ITEM_PIPELINE
and edit it to look like the
Running the spiders
Running the spiders can be done using Scrapy’s CLI. Simply run one of the following:
# Run Recruitee spider
scrapy runspider openings/spiders/recruitee.py -o jobs.json# Run Workable spider
scrapy runspider openings/spiders/workable.py -o jobs.json
The -o jobs.json
option instructs the CLI tool to append results to file jobs.json
. If it’s not specified, results will just be printed on the terminal.
The drawback of this approach is that if you run the two spiders, the file will contain multiple JSON documents (rather than a single one), making it difficult to parse.
IMPORTANT NOTE: The URLs used in the code above are fictional. Make sure you replace them with real world examples! Otherwise the scraping will fail.
Storing jobs per company
The final step in python code is to store jobs as a JSON document in a separate file per company. Scrapy offers a lot of item exporters for storing parsed items. Although none of them does exactly what we want, there’s a good example that can act as our guide. The idea is to create a pipeline that creates a separate exporter per company name. Then use one the existing JSON item exporter for storing the data. Here’s how such an exporter would look like:
Here’s the explanation of how it works:
- Lines 13–16 define how to construct a new pipeline using existing settings. Settings are read from file
settings.py
. The class method simply creates a newPerCompanyExportPipeline
passing the path to store files as an argument. - Lines 8–11 is the constructor. It stores the path as a property, while it also ensures that the directory structure exists.
- Lines 18–19 is called when the spider is initialized. It just created a dictionary to map a company to an exporter
- Lines 21–23 are cleanup for whenever crawling stops.
- Lines 25–33 contain the logic for selecting which exporter to use. The current company is extracted from the item. If there’s no exporter for this company, a new one is created and added to the company-to-exporter map. Finally the exporter (either new or existing) is returned
- Lines 35–38 is the logic for exporting the item.
The final step is to activate the pipeline in settings.py
. Update ITEM_PIPELINES
to look like the following snippet:
I also defined JOBS_PATH
so that all files are stored under data/
directory.
Spiders can now be executed by running:
# Run Recruitee spider
scrapy runspider openings/spiders/recruitee.py# Run Workable spider
scrapy runspider openings/spiders/workable.py
Git Scrapping
The most difficult part is done. But it would be extremely nice if the process could be executed automatically, possibly sending an update. There’s a really useful technique called Git scrapping that can help. The idea is that the code will run periodically. All output will be stored inside the git repo. The benefit of this approach is that changes in job postings can be tracked using Git.
The following Github action does exactly this. Runs spiders once per week and creates a commit if something changed.
The flow is finally completed! You can check it running at https://github.com/ifoukarakis/jobscrapper/
Improvements
The code is definitely not ready for production, neither it’s something that automates job search. There’s a lot of things missing.
Feature ideas
- Send notification with new jobs as soon as they are detected.
- Filter jobs based on preferences or some kind of profile.
- Group jobs by similarity.
Technical debt
- Loading the list of starting URLs per spider from a persistent storage rather than hardcode them.
- Storing the openings in a persistent store.
- Tests.
- Remove duplicate code and refactor spiders to be reusable.
- Get openings from individual career pages.
- Improved notifications on failures. This can be for many reasons, including network problems, protection for crawlers, changes in the APIs or structure of the page etc.
Before you go
This post is intended as a tutorial on how to scrap data from different types of sources available on the Web. However each site might have its own terms & conditions. Make sure you read them and understand them before scrapping the data!
If you ‘re looking for the source code of this post, you can find it at https://github.com/ifoukarakis/jobscrapper/
Do you have questions or suggestions? Feel free to add a comment!
Hope you enjoyed reading this post! Happy job hunting!