Ethics in Web Scraping

James Densmore
Towards Data Science
3 min readJul 23, 2017

--

We all scrape web data. Well, those of us who work with data do. Data scientists, marketers, data journalists, and the data curious alike. Lately, I’ve been thinking more about the ethics of the practice and have been dissatisfied by the lack of consensus on the topic.

Let me be clear that I’m talking ethics not the law. The law in regards to scraping web data is complex, fuzzy and ripe for reform, but that’s another matter. It’s not that no one is thinking, or writing, about the ethics in scraping but rather that both those scraping and those being scraped can’t agree on basic principles.

I’ve been on both sides. I scape data mostly for personal projects, but I’ve employed it as a form of data collection on the job as well. On the other side, I’ve wrestled over how to filter out “bots” from my own or my employer’s web logs and analytics in order to focus on real customers. It’s been a reality of life for years now, and rather than fighting it let’s just set some ground rules.

Though I have no illusion that these rules are complete and absolute, they cover the key points of contention I’ve come across over the years.

The Ethical Scraper

I, the web scraper will live by the following principles:

  • If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
  • I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
  • I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
  • I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
  • I will respect any content I do keep. I’ll never pass it off as my own.
  • I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
  • I will respond in a timely fashion to your outreach and work with you towards a resolution.
  • I will scrape for the purpose of creating new value from the data, not to duplicate it.

The Ethical Site Owner

I, the site owner will live by the following principles:

  • I will allow ethical scrapers to access my site as long as they are not a burden on my site’s performance.
  • I will respect transparent User Agent strings rather than blocking them and encouraging use of scrapers masked as human visitors.
  • I will reach out to the owner of the scraper (thanks to their ethical User Agent string) before blocking permanently. A temporary block is acceptable in the case of site performance or ethical concerns.
  • I understand that scrapers are a reality of the open web.
  • I will consider public APIs to provide data as an alternative to scrapers.

Where Does This Leave Us?

The ease of scraping in Python

The fact is, scraping data is easy. With a few lines of Python and the help of some awesome libraries such as urllib2 (or Requests if you prefer) and BeautifulSoup you can grab and parse the HTML of a page. It’s so easy in fact, that responsible use is more important than ever.

Of course, scraping a few thousand blog posts for a weekend project isn’t the problem. Heck, even scraping for use in business can be done quite ethically in my opinion. It’s high volume web scraping for questionable commercial use that gets the most attention and poses the highest risk for those of us who rely on the vast data of the web to innovate, learn and create new value.

With a little respect we can keep a good thing going.

Thanks for reading! You can connect with me or read my other blog posts on my website.

--

--