Industrial Classification of Websites by Machine Learning with hands-on Python

Ridham Dave
Towards Data Science
11 min readJul 30, 2018

--

Hey folks, welcome to my first technical tutorial. In this tutorial, I would like to explain extraction, cleaning and classification of websites into different categories. I will use python environment to run my code for data scraping and use neural network to classify websites.

Text classification

Text classification is one of the widely used natural language processing(NLP) task in a lot of different areas of Data Science. An efficient text classifier can automatically distinguish the data into categories efficiently with the use NLP algorithms.

Text Classification is an example of supervised machine learning task since a labelled dataset containing text documents and their labels is used for train a classifier.

Some common techniques for text classification are :

  1. Naive Bayes Classifier
  2. Linear Classifier
  3. Support Vector Machine
  4. Bagging Models
  5. Boosting Models
  6. Deep Neural Networks

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Generally, this is done with software that simulates human Web surfing to collect specified bits of information from different websites.

Some techniques that can be used for web scraping are :

  1. Human copy-and-paste
  2. Text pattern matching
  3. HTTP programming
  4. HTML parsing
  5. DOM parsing
  6. Vertical aggregation
  7. Semantic annotation recognizing
  8. Computer vision web-page analysis

In this tutorial, we will try to implement the complete model into three different modules :

  1. Data Scraping
  2. Classification based on keywords for creating a training data set
  3. Applying neural network for actual testing model

Module 1: Data Scraping

In this module, I will use Python 3.5 environment to implement my scripts. So follow along for complete reference.

Step 1: Requesting data from the website

To extract web data many different packages are available, but in this tutorial I will use requests.

import requests
url='
https://medium.com/'
try:
page = requests.get(url) #to extract page from website
html_code = page.content #to extract html code from page
except Exception as e:
print(e)

In the above code, the requests.get() method will request the page from website using https protocol and load the page into the object “page”. The next line of code will move the HTML code to String html_code. So until now we have extracted the data from websites but it is still in HTML format which is far different than the actual text.

Step 2: Extracting text from HTML page

To extract complete text data from the HTML page, we have two highly preferred packages , BeautifulSoup and html2text. Using the html_code string found from the previous step, we can apply any of the following two methods.

from bs4 import BeautifulSoup
try:
soup = BeautifulSoup(html_code, 'html.parser') #Parse html code
texts = soup.findAll(text=True) #find all text
text_from_html = ' '.join(texts) #join all text
except Exception as e:
print(e)

In the above snippet, BeautifulSoup package will parse the HTML code and assign the data to soup object. The findall() function finds all visible text from code and return a list of Strings which we store in texts. And finally we join all individual text into a common string using join() function.

import html2text
h = html2text.HTML2Text() #Initializing object
h.ignore_links = True #Giving attributes
try:
text = h.handle(html_code) #handling the HTML code
text_from_html=text.replace("\n"," ") #replacing next line char
except Exception as e:
print(e)

In this alternate block, we use package html2text to parse the string and directly get the text from the HTML code. Also we need to replace blank lines with spaces and finally find text_from_html.

Similarly we can use a loop on some 1000+ urls and extract data from those sites as well and store them in a csv(Comma Separated File) format, which we can further use in the classification module.

Module 2 : Classification based on keywords

For any machine learning algorithm, we need some training set and test set for training the model and testing the accuracy of that model. Hence to create the set of data for the model, we already have the text from different websites, we will just classify them according to the keywords, and then apply the results in the next module.

In this tutorial, we are going to classify websites into three categories namely:

  1. Technology, Office, & Education products website(Class_1)
  2. Consumer products website(Class_2)
  3. Industrial Tools and Hardware products website(Class_3)

The approach here is that we will have certain keywords belonging to the particular category, and we will match those keywords with the text and find the class with the maximum Matching_value.

Matching_value = (Number of keywords matched with one industry)/(Total number of keywords matched)

So accordingly we have a list of keywords for the individual categories as follows:

Class_1_keywords = ['Office', 'School', 'phone', 'Technology', 'Electronics', 'Cell', 'Business', 'Education', 'Classroom']Class_1_keywords = ['Restaurant', 'Hospitality', 'Tub', 'Drain', 'Pool', 'Filtration', 'Floor', 'Restroom', 'Consumer', 'Care', 'Bags', 'Disposables']Class_3_keywords = ['Pull', 'Lifts', 'Pneumatic', 'Emergency', 'Finishing', 'Hydraulic', 'Lockout', 'Towers', 'Drywall', 'Tools', 'Packaging', 'Measure', 'Tag ']keywords=Class_1_keywords + Class_2_keywords + Class_3_keywords

Now, we will use KeywordProcessor to find keywords inside the text recieved from the the urls.

KeywordProcessor is available in flashtext package on pypi.

from flashtext.keyword import KeywordProcessor
kp0=KeywordProcessor()
for word in keywords:
kp0.add_keyword(word)
kp1=KeywordProcessor()
for word in Class_1_keywords:
kp1.add_keyword(word)
kp2=KeywordProcessor()
for word in Class_2_keywords:
kp2.add_keyword(word)
kp3=KeywordProcessor()
for word in Class_3_keywords:
kp3.add_keyword(word)

In the above code, we will load KeywordProcessor objects with the keywords which we will use further for finding the matching keywords.

To find percentage value of Matching_value we define a function percentage as follows:

def percentage1(dum0,dumx):
try:
ans=float(dumx)/float(dum0)
ans=ans*100
except:
return 0
else:
return ans

We will now use extract_keywords(string) method to find keywords present in the text. And we will find length of that list to find the number of matching keywords in the text. The following function will find the percentage and the class with maximum percentage is selected.

def find_class:
x=str(text_from_html)
y0 = len(kp0.extract_keywords(x))
y1 = len(kp1.extract_keywords(x))
y2 = len(kp2.extract_keywords(x))
y3 = len(kp3.extract_keywords(x))
Total_matches=y0
per1 = float(percentage1(y0,y1))
per2 = float(percentage1(y0,y2))
per3 = float(percentage1(y0,y3))
if y0==0:
Category='None'
else:
if per1>=per2 and per1>=per3:
Category='Class_1'
elif per2>=per3 and per2>=per1:
Category='Class_2'
elif per3>=per1 and per3>=per2:
Category='Class_3'
return Category

Using a loop on the above function, we can find the category of all websites based on the keywords. We will save the classified data into a file Data.csv which we will use further. So now we have our Data Set ready for applying neural network for classification.

Module 3 : Applying neural network

Classification of websites

In the following implementation, we will create a neural network from scratch and will use NLTK word tokenizer for preprocessing. First we need to import our dataset obtained from the above steps and load it into an list.

import pandas as pd
data=pd.read_csv('Data.csv')
data = data[pd.notnull(data['tokenized_source'])]
data=data[data.Category != 'None']

The above code will load and clean the classified data. The NULL values will be removed.

The following code will create a dictionary of DATA against its class.

for index,row in data.iterrows():
train_data.append({"class":row["Category"], "sentence":row["text"]})

For applying neural network, we need to convert the Language Words into mathematical notations, which will be used for calculations. We will form a list of all the words across all the strings.

words = []
classes = []
documents = []
ignore_words = ['?']
# loop through each sentence in our training data
for pattern in training_data:
# tokenize each word in the sentence
w = nltk.word_tokenize(pattern['sentence'])
# add to our words list
words.extend(w)
# add to documents in our corpus
documents.append((w, pattern['class']))
# add to our classes list
if pattern['class'] not in classes:
classes.append(pattern['class'])

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = list(set(words))

# remove duplicates
classes = list(set(classes))

print (len(documents), "documents")
print (len(classes), "classes", classes)
print (len(words), "unique stemmed words", words)

For example, the output will be :

1594 documents
3 classes ['Class_1', 'Class_3', 'Class_2']
unique stemmed words 40000

Now, we will create a list of tokenized words for the pattern and also create a bag of words by using NLTK Lancaster Stemmer.

from nltk.stem.lancaster import LancasterStemmerstemmer = LancasterStemmer()
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)

# training set, bag of words for each sentence
for doc in documents:
# initialize our bag of words
bag = []
# list of tokenized words for the pattern
pattern_words = doc[0]
# stem each word
pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
# create our bag of words array
for w in words:
bag.append(1) if w in pattern_words else bag.append(0)

training.append(bag)
# output is a '0' for each tag and '1' for current tag
output_row = list(output_empty)
output_row[classes.index(doc[1])] = 1
output.append(output_row)

print ("# words", len(words))
print ("# classes", len(classes))

Output:

# words 41468
# classes 3

Now, we do the final preprocessing on the data and create some functions.

Sigmoid Function

def sigmoid(x):
output = 1/(1+np.exp(-x))
return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
return output*(1-output)

Cleaning function

def clean_up_sentence(sentence):
# tokenize the pattern
sentence_words = nltk.word_tokenize(sentence)
# stem each word
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
return sentence_words

Bag Of Words function

def bow(sentence, words, show_details=False):
# tokenize the pattern
sentence_words = clean_up_sentence(sentence)
# bag of words
bag = [0]*len(words)
for s in sentence_words:
for i,w in enumerate(words):
if w == s:
bag[i] = 1
if show_details:
print ("found in bag: %s" % w)

return(np.array(bag))

The final function which will be used in neural network : Think function

def think(sentence, show_details=False):
x = bow(sentence.lower(), words, show_details)
if show_details:
print ("sentence:", sentence, "\n bow:", x)
# input layer is our bag of words
l0 = x
# matrix multiplication of input and hidden layer
l1 = sigmoid(np.dot(l0, synapse_0))
# output layer
l2 = sigmoid(np.dot(l1, synapse_1))
return l2

Now we are all set to train our model of Neural Network. We are going to implement it through scratch and will be using Logistic regression into each neuron. With just one layer, but with 50000 epochs, we will be training our model. The complete training example will run on CPU.

def train(X, y, hidden_neurons=10, alpha=1, epochs=50000, dropout=False, dropout_percent=0.5):

print ("Training with %s neurons, alpha:%s, dropout:%s %s" % (hidden_neurons, str(alpha), dropout, dropout_percent if dropout else '') )
print ("Input matrix: %sx%s Output matrix: %sx%s" % (len(X),len(X[0]),1, len(classes)) )
np.random.seed(1)

last_mean_error = 1
# randomly initialize our weights with mean 0
synapse_0 = 2*np.random.random((len(X[0]), hidden_neurons)) - 1
synapse_1 = 2*np.random.random((hidden_neurons, len(classes))) - 1

prev_synapse_0_weight_update = np.zeros_like(synapse_0)
prev_synapse_1_weight_update = np.zeros_like(synapse_1)

synapse_0_direction_count = np.zeros_like(synapse_0)
synapse_1_direction_count = np.zeros_like(synapse_1)

for j in iter(range(epochs+1)):

# Feed forward through layers 0, 1, and 2
layer_0 = X
layer_1 = sigmoid(np.dot(layer_0, synapse_0))

if(dropout):
layer_1 *= np.random.binomial([np.ones((len(X),hidden_neurons))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))

layer_2 = sigmoid(np.dot(layer_1, synapse_1))

# how much did we miss the target value?
layer_2_error = y - layer_2

if (j% 10000) == 0 and j > 5000:
# if this 10k iteration's error is greater than the last iteration, break out
if np.mean(np.abs(layer_2_error)) < last_mean_error:
print ("delta after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error))) )
last_mean_error = np.mean(np.abs(layer_2_error))
else:
print ("break:", np.mean(np.abs(layer_2_error)), ">", last_mean_error )
break

# in what direction is the target value?
# were we really sure? if so, don't change too much.
layer_2_delta = layer_2_error * sigmoid_output_to_derivative(layer_2)

# how much did each l1 value contribute to the l2 error (according to the weights)?
layer_1_error = layer_2_delta.dot(synapse_1.T)

# in what direction is the target l1?
# were we really sure? if so, don't change too much.
layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

synapse_1_weight_update = (layer_1.T.dot(layer_2_delta))
synapse_0_weight_update = (layer_0.T.dot(layer_1_delta))

if(j > 0):
synapse_0_direction_count += np.abs(((synapse_0_weight_update > 0)+0) - ((prev_synapse_0_weight_update > 0) + 0))
synapse_1_direction_count += np.abs(((synapse_1_weight_update > 0)+0) - ((prev_synapse_1_weight_update > 0) + 0))

synapse_1 += alpha * synapse_1_weight_update
synapse_0 += alpha * synapse_0_weight_update

prev_synapse_0_weight_update = synapse_0_weight_update
prev_synapse_1_weight_update = synapse_1_weight_update

now = datetime.datetime.now()

# persist synapses
synapse = {'synapse0': synapse_0.tolist(), 'synapse1': synapse_1.tolist(),
'datetime': now.strftime("%Y-%m-%d %H:%M"),
'words': words,
'classes': classes
}
synapse_file = "synapses.json"

with open(folder_path+synapse_file, 'w') as outfile:
json.dump(synapse, outfile, indent=4, sort_keys=True)
print ("saved synapses to:", synapse_file)

And finally we will train the model :

import time
X = np.array(training)
y = np.array(output)

start_time = time.time()

train(X, y, hidden_neurons=10, alpha=0.1, epochs=50000, dropout=False, dropout_percent=0.2)

elapsed_time = time.time() - start_time
print ("processing time:", elapsed_time, "seconds")

Output :

Training with 10 neurons, alpha:0.1, dropout:False
Input matrix: 1594x41468 Output matrix: 1x3
delta after 10000 iterations:0.0665105275385
delta after 20000 iterations:0.0610711168863
delta after 30000 iterations:0.0561908365355
delta after 40000 iterations:0.0533465919346
delta after 50000 iterations:0.0461560407785
saved synapses to: synapses.json
processing time: 33060.51151227951 seconds

As we can see, it took almost 11 hours hours to train the model. And after such an intensive calculations, we are ready to test the data.

The function to test the data :

# probability threshold
ERROR_THRESHOLD = 0.2
# load our calculated synapse values
synapse_file = 'synapses.json'
with open(synapse_file) as data_file:
synapse = json.load(data_file)
synapse_0 = np.asarray(synapse['synapse0'])
synapse_1 = np.asarray(synapse['synapse1'])

def classify(sentence, show_details=False):
results = think(sentence, show_details)

results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD ]
results.sort(key=lambda x: x[1], reverse=True)
return_results =[[classes[r[0]],r[1]] for r in results]
#print ("\n classification: %s" % ( return_results))
return return_results

Lets test the model with its accuracy :

classify("Switchboards Help KA36200 About Us JavaScript seems to be disabled in your browser You must have JavaScript enabled in your browser to utilize the functionality of this website Help Shopping Cart 0 00 You have no items in your shopping cart My Account My Wishlist My Cart My Quote Log In BD Electrical Worldwide Supply Remanufacturing the past SUSTAINING THE FUTURE Hours and Location Michigan Howell")

Output:

[['Class_3', 0.97663437888614435]]

classify("  New Website Testimonial Policies Parts Catalog Contact Support Forum Documentation Themes WordPress Blog Products Spindle Parts Latest News Kennard Parts Suggest Ideas Legal/Disclaimers WordPress Planet News About CDT Home Latest News Testimonial Products Parts Catalog About CDT History Staff Policies Centrum Legal Disclaimers Contact About CDT Custom Drilling Technologies established in 1990 has been providing superior customer service to the printed circuit board industry for almost 20 years We specialize in Excellon Drilling and Routing Equipment Parts and Service Our staff has over sixty years of combined experience in the design building troubleshooting operation programming")

Output:

[['Class_1', 0.9620297535870017]]

As you can see, we are getting pretty high accuracy on these tests. I have tried this model on different data, and have found pretty higher accuracy.

An accuracy of around 95%+ is considered pretty accurate in this type of model with just one layer. For further classification by different models, we can use Keras or Tensorflow. To decrease the time to train the model, we can use a NVIDIA GPU.

And now we can easily scrape data and classify its category, with the help of Deep Neural Network using Back Propogation .

In the further tutorials, I will try to explain Keras and Tensorflow’s working and hands-on.

Please share your feedback about this tutorial in comment section below or via my LinkedIn page: https://www.linkedin.com/in/ridhamdave/ . Also share your doubts regarding this tutorial.

--

--