Big Data: What is Web Scraping and how to use it

Vladimir Fedak
5 min readFeb 9, 2018

What is web scraping? It is essential for gathering Big Data sets, which are the cornerstone of Big Data analytics, Machine Learning (ML) and tutoring the Artificial Intelligence (AI) algorithms.

The hard point is that information is the most valuable commodity in the world (after time, as you cannot buy the time back), as Michael Douglas has said in the famous “Wall Street” movie long before the Internet era.

This means that the ones who possess the information make all possible precaution to protect it from copying. In the pre-Internet times it was easy, as copyright legislation is pretty solid in the developed countries. The World Wide Web changed everything, as anybody can copy the text on the page and paste it into another page, and web scrapers are simply the algorithms that can do it much quicker than humans.

DISCLAIMER: The following is intended for the Big Data researchers who comply with the permissions from robots.txt, set the correct User Agent and do not violate the Terms of Service of the sites they scrape.

IT Svit has ample experience with scraping the websites for our Big Data projects. We believe there are three levels of web scraping complexity, depending on the amount of JavaScript (JS) you have to tackle:

  1. A lucky loiterer
    a) The web pages you need to scrape have simple and clean markup without any JS. In this case, one needs to simply create the “locators” for the data in question. XPath statements are great examples of such locators.
    b) All the URLs to other websites and pages are direct. Finding only the relevant URLs is the main difficulty here. For example, you can look for the `class` attribute. In such a case the XPath will look like this: `//a[@class=’Your_target_class’]`
  2. A skilled professional
    a) Partial JS rendering. For example, the search results page has all the information, but it is generated by JS. Typically, if you open a specific result, the full data without JS is there.
    b) Simple pagination. Instead of constantly clicking the “next page” button you can receive pages simply by creating the necessary URL like this: http://somesite.com/data?page=2&limit10. In the same way, you can, for example, increase the number of results in a single query.
    c) Simple URL creation rules. The links can be formed by JS, but you can unravel the rule and create them yourself.
  3. A Jedi Knight, may the Force be with you
    a) The page is fully built with JS. There is no way to get the data without running JS. In this case, you should use more sophisticated tools. Selenium or some other WebKit-based tools will get the job done.
    b) The URLs are formed using JS. The tools from the previous paragraph should solve this problem also, yet there may be a slowdown in processing due to the fact that JS rendering takes additional time. Perhaps you should consider splitting the scrapper and spider and performing such slow operations on a separate handler.
    c) CAPTCHA is present. Usually, CAPTCHA does not appear immediately and takes several requests. In this case, you can use various proxy services and simply switch IP when the scraper is stopped by CAPTCHA. By the way, these services can be also useful for emulating access from different locations.
    d) The website has an underlying API with complex rules of data transfer. JS scripts render the pages after referring to the back-end. It is possible that it will be easier to receive data when making queries directly to the back-end. To analyze the operation of scripts, use the Developer Console in your browser. Press F12 and go to the Network tab.

It is also important to understand the difference between web scraping and data mining. In short, while data scraping can happen in any data array and can be done manually, web scraping or crawling takes place only on the web pages and is performed by special robots — crawlers/scrapers. We have also listed 5 success factors for Big Data mining, where finding the correct and relevant data sources is the most important basis for a successful analytics.

For example, the manufacturer might want to monitor the market tendencies and uncover the actual customer attitudes, without relying on the retailer’s monthly reports. By using web scraping the company can collect a huge data set of the product descriptions on the retailer sites, customer reviews and feedback on the websites of the retailers. Analyzing this data can help the manufacturer provide the retailers with better descriptions for their product, as well as list the problems the end users face with their product and apply their feedback to further improving their product and securing their bottom line through bigger sales.

Web scraping tools and methods

Most of the scrapers are written in Python to ease the process of further processing of the collected data. We write our scrapers using frameworks and libraries for web crawling, like Scrapy, Ghost, lxml, aiohttp or Selenium.

When building the scrapers we must be prepared for dealing with any level of complexity — from a loiterer to a powerful Jedi Knight. This is why we have to validate the data sets prior to building the scrapers for them in order to allocate sufficient resources. In addition, the conditions might change (and often do) during the scraper development, so a skilled data scientist must be prepared to deal with the third level and use their lightsabers at any given moment.

In the next article of this series, we will tell what measures can be taken in order to protect your website content from unwanted web data crawling. Stay tuned for updates and the best of luck with web scraping to you!

Previously, I’ve posted these materials in my company’s blog.

--

--