The world’s leading publication for data science, AI, and ML professionals.

The Web Scraping Template | by Durgesh Samariya

Summary of web scraping steps.

The Web Scraping Template

Photo by Marvin Meyer on Unsplash
Photo by Marvin Meyer on Unsplash

Web scraping is a process of extracting data from the web automatically. Web scrapping involves getting data from the web pages and store them as you want.

In this post, I will be sharing my template that I use to save my time by not writing the same things again and again. I use the Python programming language for web scrapping.

Disclaimer: This template doesn’t work on all the websites because all websites are not the same. However, it works most of the time. This post is not a tutorial post.

TL;DR

If you want to check my template, check out here.


Load required libraries

The first step is to load all the required libraries. Let’s import all the necessary libraries. I am using the BeautifulSoup library for web scrapping. Apart from that, I am using Pandas and request libraries. Before importing, make sure you have all libraries installed in your system.

!pip3 install bs4, pandas, request

Parsing

Now all required libraries are loaded. Requesting a website is very important to scrap data from the web. Once the request is made successfully, the entire data/content of the website is available. Then we can parse URL to bs4, which makes content available in plain text.

Simply add URL that you want to scrap and run cell.


Extracting the required elements

Now we can use soup.find()and soup.find_all() methods to search required tags from the web page. Usually, my target is a table where data is stored. First, I always search for headings. Usually, headings can be found in tag. So let’s find them and store them in Python list.

Now our headings are stored in a list named headings. Now, let’s find the table body which can be found in tag.

Now we have headings and content. It’s time to store them in DataFrame. Here, I created a data frame called data.

Finally, we have data ready for future use. I like to perform some data analysis before saving data in CSV.


Data Analysis

It is essential to analyze data. Using pandas, we can perform data analysis using different methods like head(), describe(), and info(). Apart from that, you can check column names.

Once you have analyzed data, you might want to clean them. This step is optional, as it doesn’t always require when you are creating data. However, sometimes it is needed.


Data Cleaning

This template has some data cleaning processes, such as removing un-wanted symbols from data and renaming column names. You can even add more things if you want to.

Now our data is ready to save.


Save Data into CSV

Let’s save data in the CSV file. You need to change the file name.


That’s it for this post. Scrapping data from the web is not just limited to this template. There are plenty of things to do, but it all depends on the website. This template has a few things that you have to repeat, which saves your time. You can find a complete jupyter Notebook here.

Thanks for reading. Happy web scraping!


If you like my work and want to support me, I’d greatly appreciate it if you follow me on my social media channels:

In case you have missed my Kaggle step by step guide

Getting started with Titanic Kaggle | Part 1

Getting started with Titanic Kaggle | Part 2

In case you have missed my Python series.

I hope you will like my other articles.


Related Articles