Categorizing World Wide Web

Jay Pavagadhi
Towards Data Science
8 min readSep 11, 2019

--

The Internet is the world’s largest library. It’s just that all the books are on the floor.

- John Allen Paulos

Let us sweep the floor and try to stack these books in bookshelves.

Common Crawl Dataset

Instead of crawling the open web, it’s a good idea to use existing Common Crawl dataset — A crawled archive of 2.95 billion webpages with 260 terabytes of total content. Of course, it’s not a full representation of web but it gives us a pretty good start.

To analyze and categorize such a large corpus, we would need lots of computation power. Therefore, we would be using hundreds of machines on Amazon Elastic MapReduce (EMR) running Apache Spark. Good news is Common Crawl dataset already lives on Amazon S3 as a part of Amazon Public Datasets program and therefore we should be able to efficiently access it.

Extracting Text from Webpages

Common Crawl WARC files contain raw data from crawl including HTTP response from the websites it has contacted. If we take a look at one such HTML document –

<!DOCTYPE html>
<html lang=”en-US”>
<head>
<meta charset=”UTF-8">
<meta name=”viewport” content=”width=device-width, initial-scale=1">
<link rel=”profile” href=”http://gmpg.org/xfn/11">
...
<div class=”cover”>
<img src=”https://images-eu.ssl-images-amazon.com/images/I/51Mp9K1v9IL.jpg" /></p>
</div>
<div>
<div id=”desc-B00UVA0J6E” class=”description”>
<p>By Richard S. Smith,Simon W. M. John,Patsy M. Nishina,John P. Sundberg</p>
<p>ISBN-10: 084930864X</p>
<p>ISBN-13: 9780849308642</p>
<div> finishing touch of the 1st section of the Human Genome venture has awarded scientists with a mountain of recent info.
...
<footer id=”colophon” class=”site-footer” role=”contentinfo”>
<div class=”site-info”>
...
</body>
</html>

As you can see, HTML documents are very complex code often in a multitude of formats. We would use a python library — Beautiful Soup to extract the text out of these documents.

def get_text(html_content):
soup = BeautifulSoup(html_content, "lxml")

# strip all script and style elements
for script in soup(["script", "style"]):
script.decompose()

return soup.get_text(" ", strip=True)

We have created BeautifulSoup object by passing raw HTML content. It contains all the data in nested structure. All the textual data can be programmatically extracted by calling get_text on this object. However, apart from the text it also extracts javascript and CSS code which we may not want, so prior to extraction, we would remove them. That should give us what we want, just the texts —

Download e-book for iPad: Systematic Evaluation of the Mouse Eye: Anatomy, Pathology, by Richard S. Smith,Simon W. M. John,Patsy M. Nishina,John P. - Gdynia Design E-books...finishing touch of the 1st section of the Human Genome venture has awarded scientists with a mountain of recent info. the supply of all human genes and their destinations is fascinating, yet their mechanisms of motion and interplay with different genes ...the booklet then studies and illustrates nearby ocular pathology and correlates it with human eye disease.

Classification

To classify this text, we would use scikit-learn — a python library for machine learning. We would be training our classifier with 20 newsgroups dataset. It is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. We would be mapping these newsgroups in the following eight categories.

class Classifier:

# train the model
def __init__(self):
newsgroups_train = fetch_20newsgroups(subset='train', remove = ('headers', 'footers', 'quotes'))
self.target_names = newsgroups_train.target_names
self.vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
vectors = self.vectorizer.fit_transform(newsgroups_train.data)
self.clf = LinearSVC(penalty='l2', dual=False, tol=1e-3)
self.clf.fit(vectors, newsgroups_train.target)

def predict(self, document):
x_test = self.vectorizer.transform([document])
pred = self.clf.predict(x_test)
return self.target_names[pred[0]]

The fetch_20newsgroups function downloads the data archive from the original 20 newsgroups website. We are configuring it to remove headers, signature blocks, and quotation blocks respectively which has little to do with topic classification. We will convert the text into vectors of numerical values using TfidfVectorizer in order to feed them to the predictive model. Now we will use Linear Support Vector Classification (LinearSVC) to train our model which scales better on large dataset. In fact, numerous other models can be easily plugged in like Naïve Bayes, SGD Classifier, k-nearest neighbors, Decision Trees, etc., but empirically I found SVM performing well in our case. Now we can call predict function on our model to classify document in one of the eight categories.

Analyzing Predictions

Classifier seems to do a decent job. Here are some web pages –

Beautiful Kazak carpet of distinctive colours matches with the curtains add lively touch. ...Carpet cleaning is something which can be done at home but you might damage your carpet instead of cleaning it. Often the colours of carpet run as soon as it comes in contact with water. ...Carpet RepairingIf your carpet is torn or there is piece missing, AbeeRugs can help you fix that. We fix it in a way that we also complete the design no matter how intricate it is. ...Classified as E-Commerce
--------------------------------
Make more time for the moments that matter. Get expertly-curated updates and medical education, instantly.New to EasyWeb?EasyWeb is a free service for United States Healthcare Professionals. ...Specialty Anesthesiology: general Anesthesiology: adult cardiothoracic Anesthesiology: critical care Anesthesiology: pain Anesthesiology: pediatric Dermatology Emergency Medicine Family Medicine ...Classified as Medicines
--------------------------------
All Films & Events | Arab Film Festival 2014 ...Watch Trailer Watch TrailerA Stone’s Throw from PrisonRaquel Castells / Documentary / Palestine, Spain / 2013 / 65 minsGrowing up in the Occupied Palestinian Territory is not easy. When you leave home for school, your mother can't be sure of when you'll be back. Rami, Ahmed, Mohammed, three among thousands, this documentary is their story, but also that of courageous Israelis and Palestinians working to cut abuses, stop conflict ...Classified as Politics

Apparently, by analyzing more predictions, it’s noticeable that categories are too broad.

Tanzania farmers adopts vegetable farming to improve nutritionThe farmers in Tanzania are encouraged to grow elite varieties of vegetables, enriched with high-value nutritional content, in order to fight malnutrition, hunger and double agricultural productivity and income of smallholdersThe Africa RISING project focuses on the need to take urgent action in achieving Sustainable Development Goals (SDGs) which aim ...Classified as Medicines

By looking at all the possible categories that we have, the predicted category seems about right, but it would have been better to categorize it as Agriculture. Also, Aeronautics would have been more suitable for the following sample.

Schofields Flying Club - N1418V Flight ScheduleN1418V Flight ScheduleCessna 172MPerhaps the staple of general aviation flight training, and rightfully so. The 172 has been in production since 1956, with no end in sight! Our 172 is an ideal VFR training platform!! Because of its high wings, 18V provides a great view of the ground, ...Classified as Space

There are many examples where web pages could fall into multiple categories. For example, the following sample could also be categorized as E-Commerce.

Recently Sold VehiclesHere are some vehicles that we recently sold. It's just a sample of the variety and quality of our vehicle selection.Please feel free to browse our Internet showroom and contact us to make an appointment if you would like to see any of our vehicles in person. ...Classified as Automobile

And there are also quite a few samples where classifier seems to get it wrong.

... We all make mistakes. That is a dictum of life. It’s surprising how ignored it is and how discouraged one is to fail in the world. Then again current with this theme of artificial organizing is that the soul is now completely detached from the world it swims in. The title Greenberg centers around […]To the dilettante the thing is the end, while to the professional as such it is the means; and only he who is directly interested in a thing, and occupies himself with it from love of it ...Support Joseph A. Hazani on Patreon! ...Misclassified as Sports
--------------------------------
... Fast, Reliable EverytimeABC Messenger Local Flower Hill NY Delivery ServiceABC Messenger & Transport, Inc. has the ability to offer rush and pre-scheduled Delivery Service at competitive prices and excellent Flower Hill NY service. Their Delivery Service is recognized as fast and dependable. Through highly experienced and skilled veteran Flower Hill NY dispatchers and seasoned couriers we are able to predict traffic patterns easily and avoid or overcome ...Misclassified as Medicines

All in all, it gives decent predictions that we can rely on for our starter project.

Language Detection

Since classifier is trained on English dataset, we will be using langdetect library to skip non-English web pages.

from langdetect import detectlang = detect(text)
if lang != 'en':
# skip non-English pages
return

Profanity Filter

We will also use profanity filter to ignore adult content.

from profanity_check import predictif predict([text])[0]:
# adult content

Spark on Amazon EMR

As discussed earlier, since the Common Crawl dataset is huge, we will run our application on Spark on Amazon EMR. We will adapt to Common Crawl example code to process data on Spark.

class WebClassifier(CCSparkJob):

name = "WebClassifier"
def
__init__(self):
CCSparkJob.__init__(self)
self.output_schema = StructType([
StructField("topic", StringType(), True),
StructField("count", LongType(), True)
])
self.classifier = Classifier()

def
process_record(self, record):
# html record
if self.is_html(record):
# extract text from web page
text = self.get_text(record)
else:
return
# skip non-English pages # filter profane content

topic = self.classifier.predict(text)
yield topic, text

if __name__ == '__main__':
job = WebClassifier()
job.run()

process_record function would be called on each record in the dataset where we will classify it to one of the categories listed earlier. Full source is uploaded here.

To further speed up, we will use WET files instead of WARC, which contains extracted plaintext. Also, we will load the pre-trained serialized classifier to avoid training it on every machine. I used 100 ec2 instances and it took about an hour and a half to process 1% of Common Crawl corpus. It cost me about 25 USD. And here’s what we got:

Web Categories

We started with approximately 1% of Common Crawl data or about 25.6 million webpages. We filtered 1.4 million web pages since they didn’t contain enough textual information for classification purpose. Later we skipped about 13.2 million non-English pages. We were able to classify rest of 10.9 million pages. Apparently, there is plenty of E-Commerce content and lot of content talking about Computer and Electronics followed by Politics, Medicines, Religion, Automobile, and Space. Note that we also filtered about quarter millions of adult webpages.

In conclusion, we made a gentle attempt to categorize the current state of the web. I hope this also serves as a good starter for further web analysis.

--

--