The world’s leading publication for data science, AI, and ML professionals.

A Useful Tool to Collect Data: Web Scraping

Benefits and ethics of using this tool

Photo by Andrey Tikhonovskiy on Unsplash
Photo by Andrey Tikhonovskiy on Unsplash

Section 1: Introduction

The development of computers has produced many useful techniques that can create massive databases. One technique is web scraping, which is used most commonly by statisticians, data scientists, computer scientists and web developers to accumulate vast amounts of data that is processed with statistical methods so that it can be analyzed. As the name suggests, web scraping is a way to extract information such as specific numbers, texts and tables from the world wide web, using software that can easily store and manage all the information that has been downloaded.

Regardless of the web browser that we use, every single web page uses computer languages such as XML/HTML, AJAX, and JSON to present the information inside a web page. When a person enters a web page on the Internet, whether it be social media, Wikipedia or search engines like Google or Bing, using a browser means using HTML (Munzert et al., 2014). The information that is presented on any browser from a web page varies from the one presented in HTML; in other words, HTML is the code of the web page, and the browser is capable of ensuring a user-friendly experience. In particular, this article will try to explain some features that HTML has to implement a web scraping tool successfully and how they relate to Python.

The primary purpose of this article is to show the usefulness behind web scraping and how statisticians could take advantage of this method. At the end of Section 4, Python code is provided with an explanation to get an insight into the scope of this technique.

The article will be composed of different sections that go as follows. First, Section 2 will explain why web scraping is useful for statisticians. Section 3 will explain why, in some scenarios, web scraping could be challenging to use and what are the legal consequences of doing web scraping. In Section 4, Python code will be provided to explain a simple implementation of web scrapping using a financial web page. Finally, conclusions are presented in the last section.


Section 2: The Importance of Web Scraping

Statisticians often find it challenging to obtain data. The way to collect data has commonly been by observation, sample surveys, interviews or focus groups. In many cases, it requires time and money to gather information from the methods mentioned above. Nowadays, the Internet has become a significant source of information for many professionals and scientists. In this sense, to solve many problems from an academic point of view or in the industry, it is easy, quick and cheap to extract information from the Internet. As an example, governments provide public business cycle data for free through their respective bureau of statistics; central banks around the world provide economic and financial data; and the International Energy Agency (IEA) provides information on oil and gas, amongst other things.

As an alternative of copying & pasting information from a web browser into a spreadsheet, a computer program can perform this job more quickly and precisely than a human can (Broucke and Baesens, 2018). Moreover, if a person requires to download data from many web pages, the algorithm automatically arranges all the information collected into one database so it can be ready for being analyzed.

Alternatively, an API is a set of functions that are provided by the creator of a web page so that a programmer is able to use those functions to extract information from that specific web page. Furthermore, the API allows the program where it is running to communicate with the web page correctly. As Broucke and Baesens (2018) suggest, "Twitter, Facebook, LinkedIn, and Google, for instance, all provide such APIs in order to search and post tweets, get a list of your friends and their likes, see who you’re connected with, and so on".

Unfortunately, the website might offer an API that is expensive or has limited use, among other things; therefore, it may be practical to use web scraping. The truth remains that the scraper can use a program to access and extract some information as long as the data is visible in the web browser. If so, the information can be cleaned, stored and used in any way (Broucke and Baesens, 2018).

As mentioned in the previous section, HTML has some features that can be used to implement a web scraping robot with Python. Moreover, a programmer can install web scraping tools within Python, such as Selenium. The primary function of this scraping tool is to automate browsers so that websites can be loaded, recover its content, and use the browser to perform operations as a user would. Selenium can be operated by using different programming languages, including Python, PHP, C #, and Java (Broucke and Baesens, 2018). For the purpose of this article, Python will be used.

Consider that a statistician has successfully downloaded and installed Selenium in Python and has created a simple code to extract some information from the web. That person runs the code and finds that it did not work; Selenium by itself will not work without a web browser. In particular, it will need to use a WebDriver. This tool is available for most browsers, such as Chrome, Firefox, Safari and Explorer (Broucke and Baesens, 2018). WebDrivers can be downloaded from the Internet and must have the same version as the browser that is going to be used to do the scraping. By downloading the correct WebDriver and then running the code that was introduced earlier, Python through Selenium, it will "speak" the same language as the browser. This will enable to correctly extract information from the web page.

Hitherto, we know that we need a specific program to do web scraping, Python; we also need to install Selenium within Python, as well as a WebDriver that will successfully communicate with Python.

There are many ways to web scrape from a web page. Selenium can extract specific information from a particular web page by the ID, name, XPath, Link Text, Partial Link Text, Tag Name, Class Name and CSS Selector, in the HTML code (Broucke and Baesens, 2018). One useful framework mentioned above is the XML Path or XPath. The XPath is a hierarchical addressing mechanism comparable to the one that the computer uses to locate a file within many folders (Munzert et al., 2014). In other words, the XPath is a route to point exactly where the information is on an HTML code of a web page. It is worth mentioning that Munzert et al. (2014) state that XPath "is simply a very helpful tool for selecting information from marked-up documents such as HTML, XML".


Section 3: Common Problems When Using Web Scraping

This section will show real examples and complex situations that can arise when doing web scraping; problems are inevitable and common when scraping the web. It is essential for web scraping to be able to detect, throw, and manage these errors (Mitchell, 2013). As Mitchell (ibid) state, "the web is messy, and you can never be sure that there is an element, or that a page returns the data you want, or that the server of a website is up and running". Many data analysts struggle with unexpected issues while doing web scraping. Anecdotally, I used to work at an investment bank in which much of the analyzed data came from the Internet that we extracted by doing web scraping over financial web pages. Instead of paying Bloomberg or Reuters as the primary source of information, we developed these tools for extracting financial time series of stocks, bonds and indexes.

The most frequent issue that the team had was the sudden update of a web page template. In other words, the web scraping tool developed to extract the information of that specific web page will not work with the new structure of the web page. A code that can take a considerable amount of time to develop could be irrelevant in just seconds. One way to solve this issue is to make sure that the structure of the code is flexible enough so it can be adapted more quickly to the new web page.

Some websites may not agree with web scrapers when trying to download information. For example, if hundreds of HTTP requests are sent at the same website, although many web pages are relatively robust, some may stop working and will not be able to attend regular flow of users (Broucke and Baesens, 2018). Web pages work because companies have servers that can process a huge amount of users. If many users enter at the same time to a web page, the web page will probably crash.

That same problem happened to Bidder’s Edge, a company dedicated to web scraping data from auction sites that feed their website. Instead of posting many applications to each of the auction locations, users could search listings on their website. The company was sued by eBay and defended effectively against the use of bots on his website (H. Liu and Davis, 2015).

Furthermore, a new problem arises: the legality of web scraping. After analyzing some real cases where web scraping was involved, Munzert et al. (2014) affirm that "the lesson to be learned from these disconcerting stories is that it is not clear which actions that can be subsumed under the ‘web scraping’ label are actually illegal and which are not". H. Liu and Davis (2015) state that "the law is cloudy. For the most part, companies have succeeded in stopping unwanted scraping-at least partly. But it is far too early to call this an area of settled law". Additionally, Mitchell, R. (2013) shares a similar point of view, arguing that "the legal precedent for web scraping is scant".

It is generally accepted to avoid problems with the law, to follow the terms of use and copyright documents on websites (ibid). Broucke and Baesens (2018) also discussed the legality of web scraping, writing that "what is clear is that the legal landscape around web scraping is still evolving and that many of the laws cited as being violated have not matured well in our digital age". Finally, 13 Wash. J.L. Tech. & Arts 275 (2018) state that "technology often advances ahead of law and policy. Web crawlers are currently governed almost entirely by social norms and politeness, and neither Congress, the executive branch, nor the courts have promulgated laws". It is evident that a grey area exists when using web scraping tools; big companies did not win all the cases.

A recent thought that was created to avoid problems with the web pages is the robots.txt file. The concept was to indicate which data robots could or could not access in a text file stored in a website’s root directory (Munzert et al., 2014). In other words, this text file "tells" the web scraping tools what information can and cannot be downloaded from the webpage. However, it has not been shown that the robots.txt file alone is legally binding, although the terms of service can often be (Mitchell, R., 2013).

Finally, the most common legal problems listed in the United States are: Breach of Terms and Conditions; Copyright or Trademark Infringement; Computer Fraud and Abuse Act (CFAA); Trespass to Chattels; Robots Exclusion Protocol; and The Digital Millennium Copyright Act (DMCA), CAN-SPAM Act (Broucke and Baesens, 2018).


Section 4: An example of Web Scraping

This section will help to understand how to implement a Python web scraping code from a financial web page. The idea is to give a summary appointing the most important parts of the code; the complete code can be found in my GitHub repository. Please feel free to check it and use it if you wish. The bot extracts and arranges specific features from different Exchange Traded Funds (ETFs) to a database (etf.com, 2020). An ETF is a group of securities, such as stocks, commodities, bonds, or a mixture of investment types, whose purpose is to replicate or follow an underlying index. Every ETF has a price allowing investors to buy and sell them easily (Investopedia, 2020).

First of all, we need to import the libraries that are going to be used through-out the code. As mentioned before, we will be using Selenium, Chrome as the browser, and Chrome WebDriver (as the WebDriver version needs to have the same version as the browser). From lines of code 1–5, it is shown what libraries are going to be used and what functions of specific libraries were used.

For a better understanding, library openpyxl helps to read and write Excel documents, numpy provides useful mathematical functions and pandas is a package that helps manipulate data structures (PyPI, 2020). In this sense, line 7 will create a workbook that will store all the data downloaded from the web.

Lines 16, 20, 21 and 105 are specific directories that need to be changed so that the code could work in another computer. In particular, line 21 contains the workbook with the inputs of the ETFs that are going to be scraped. Also, line 20 shows the location of the WebDriver that is going to be used to do the web scraping.

Line 29 shows the variable that contains the path and the preferences of how to display the browser. Line 34 describes the features that are going to be extracted from the web page. Consequently, line 43 shows the command on how to open the browser by using the variable created in line 29.

XPath syntaxis can be found in lines 45, 50, 52, 58, 72, 77, 82 and 95. By observing the code, between the XPath codes, we can find try: and except: arguments. This is helpful when the code does not find the XPath in the HTML code. Otherwise, we would see an error when executing the code.

On the one hand, the for-loop in lines 38 will iterate over each ETF that was extracted from the workbook of line 21. On the other hand, lines 67 and 92 will help to iterate over some features of some specific ETFs. Finally, line 103 accumulated all the information downloaded; line 104 saves that information on a .csv file on a specific directory and line 107 closes the WebDriver. An example of what the code is doing can be seen in the video below.


Section 5: Conclusions

Web scraping is a powerful tool to extract data from web pages. Depending on the type of analysis that the researcher is trying to do, web scraping may replace a survey that will cost money and will be harder to implement. If the code is programmed appropriately, the computer can extract and arrange much more information compared to a human being.

Even though there are many ways of web scraping, this article explained how to implement a web scraping tool using Python. Moreover, it was shown how to use Selenium on a Python code and how that is related to HTML. Finally, it was shown a way to extract specific information from the HTML code using the XPath.

Difficulties arise when doing web scraping; problems are inevitable but they can usually be solved. This article has proven, with the research of many different authors, that there is not a clear path surrounding the legality of this technological tool. Regarding the uncertainty of this issue, it is suggested to first check the terms and conditions, and having official authorization from the website to do web scraping is recommended. Without the correct use, web scrapers may end up with a lawsuit. Free access does not mean necessary free data.


References:

Marty Alchin. and J. Burton Browning (2019). Pro Python 3: Features and Tools for Professional Development. Apress.

Seppe vanden B. and Bart B. (2018). Practical Web Scraping for Data Science: Best Practices and Examples with Python. Apress.

Munzert, S., Rubba, C., Meissner, P. and Nyhuis (2014). Automated Data Collection with R: A Practical Guide to Web Scraping and Tex. John Wiley & Sons.

Mitchell, R. (2013). Instant Web Scraping with Java. Packt. Publishing, Limited.

Gold, Z. and Latonero, M. (2018). Robots Welcome? Ethical and Legal Considerations For Web Crawling And Scraping. Washington Journal of Law, Technology & Arts, 13(3), pp.277–281.

H. Liu, P. and Davis, M. (2015). Web Scraping-Limits on Free Samples. Landslide, 8(2 (November/December 2015), pp.1–5.

etf.com. (2020). ETF.com: Find the Right ETF – Tools, Ratings, News. [online] Available at: http://www.etf.com. (etf.com, 2020)

Investopedia. (2020). Investopedia. [online] Available at: https://www.investopedia.com. (Investopedia, 2020)

PyPI. (2020). PyPI. [online] Available at: https://pypi.org. (PyPI, 2020)


Related Articles