The world’s leading publication for data science, AI, and ML professionals.

COVID-19 Policies Multi-classification with Neural Network

Documenting my learning curve for tuning an ML model that can categorize 20 classes (~ and even more in the future)

Photo by Obi Onyeador on Unsplash
Photo by Obi Onyeador on Unsplash

Little intro~

Before getting into the detail of the work, I want to spare some space to share a summary of the motivation behind this project as well as this article.

I have been having a great time at Coronanet.org, an amazing multinational team with more than 500 research assistants (RAs) working together to collect data on COVID-19 policies around the globe. Our end goal is to publish as many datasets on the policies in their most usable form as possible, which then can be used for different kinds of research/journalism work. One of many ways we organize collected data is to manually read through and classify them into one of the 20 different policy types (Social Distancing, Lockdown, etc, you name it) and other subtypes. The fact that many policies don’t appear as clear that they belong to any specific policy types and the enormous daily amount of incoming data prove it difficult to double check every single policy for the labeling accuracy. This is where we get to experiment with different ML models in search of an algorithm that can identify miscoding and clean up these errors more effectively than manual work.

There has been, and will be, a lot of changes as the search for the optimal model continues. Thus, I want to document some personal progress I make, successes and limitations as I keep on learning and reaching for a better model in hope that I can make meaningful contributions, small or large, to the common goal of the team! So…

Let’s get into it

Data collection

I scraped the data from the public repo of our organization, which can be found here. I used selenium to collect the datasets from some selected countries only (countries with many published records of policies). [Note: If you also use Chrome, you’ll need to download chromedriver first to your computer in order to run and open the browser.

from selenium import webdriver
driver=webdriver.Chrome('/Users/irenechang/Downloads/chromedriver')
driver.get('https://github.com/CoronaNetDataScience/corona_tscs/tree/master/data/CoronaNet/data_country/coronanet_release')

Since the dataset filenames have the same format, I’m going to make a list for all the country names I want to scrape and parse them into this format.

#showing only a few countries
country_names = ['Germany', 'United States of America', 'Spain', 'Australia', 'India']
countries_to_scrape = []
#parse string together
for country in country_names:
    countries_to_scrape.append("coronanet_release_" + country + ".csv")

Here comes the beauty of selenium that makes all the scraping enjoyable as you stand back at what the browser moves to pages on its own. We’re going to

(1) create xpath for each link →

(2) go into each link → navigate to the download button →click on it,

all automated with the following snippets of codes:

# get the link for each data file
urls = []
for country in countries_to_scrape:
    xPath = "//a[@title='"+ country + "']"
    print(xPath)
    link = driver.find_element_by_xpath(xPath).get_attribute("href")
    urls.append(link)

[Note: In case you haven’t used it before, XPath is a query language, just like SQL, but used on XML documents, aka the web ‘structure’ that can be viewed in ‘Developer Tools’. Here, we want to extract the links (aka the href attribute) to the csv’s of these countries, which can be found in tags that happen to have the titles (title attribute) being the same as the text shown]

The highlighted part is where we are navigate towards
The highlighted part is where we are navigate towards
//a[@title='coronanet_release_Germany.csv']
//a[@title='coronanet_release_United States of America.csv']
//a[@title='coronanet_release_Spain.csv']
//a[@title='coronanet_release_Australia.csv']
//a[@title='coronanet_release_India.csv']

☝️ is the result from the print(), these are our xpaths. Next, following similar processes, we move on to the second stage, getting the raw csv files, and concat all these dataframes.

from parsel import Selector
csv_urls = []
for url in urls:
    driver.get(url)
    sel = Selector(text=driver.page_source)
    raw_csv = driver.find_element_by_id('raw-url').get_attribute("href")
    csv_urls.append(raw_csv)
dfs = []
for csv in csv_urls:
    # read each csv into a separate dataframe
    dfs.append(pd.read_csv(csv))
big_frame = pd.concat(dfs, ignore_index=True)

Data Preprocessing

From big_frame, I’m only interested in 2 columns, "description" and "type".

For the rest of the preprocessing steps, I followed closely Miguel’s notebook, which can be found here. His explanations at each step are very clear and helpful, especially for those who just started handling text data like me. Generally, the checklist should be:

  1. Special characters and punctuation
  2. Lowering cases
  3. Remove numbers
  4. Stemming and lemmatization
  5. Stopwords (country names, region names, and words that are used in formal documents and policies)
  6. Label encodings
  7. Train-test spilt

Here is a snippet of my step 5:

#stopwords
nltk.download('stopwords')
stop_words = list(stopwords.words('english'))
# include country names in stopwords
country_text = []
for text in df["description_1"].tolist():
    for c in pycountry.countries:
        if c.name.lower() in text:
            text = re.sub(c.name.lower(), '', text)
    country_text.append(text)
df["description_2"] = country_text
for stop_word in stop_words:
    regex_stopword = r"b" + stop_word + r"b"     
    df['description_2'] = df['description_2'].str.replace(regex_stopword, '')

Fit the model

First of all, since the data is imbalanced, so we will calculate the class weights for later use in the neural network:

from sklearn.utils import class_weight
class_weights = list(class_weight.compute_class_weight('balanced',                np.unique(df2['type']), f2['type']))
class_weights.sort()
class_weights
# assign labels to these weights
weights = {}
for index, weight in enumerate(class_weights) :
    weights[index] = weight

weights

Out: {0: 0.3085932721712538, 1: 0.4814408396946565, 2: 0.49128529698149953, 3: 0.6827469553450609, 4: 0.6949724517906336, 5: 0.7166903409090909, 6: 0.7187321937321938, 7: 0.8298519736842105, 8: 1.0577568134171909, 9: 1.1733720930232558, 10: 1.3105194805194804, 11: 1.4883480825958701, 12: 1.7955516014234876, 13: 2.156196581196581, 14: 2.787569060773481, 15: 2.9505847953216375, 16: 3.1933544303797468, 17: 3.2977124183006534, 18: 10.091, 19: 11.733720930232558}

Next, we transform our train and test sets into usable object form and print them out to take a look at:

dataset_train = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values))
dataset_test = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values))
for text, target in dataset_train.take(5):
    print('Desc: {}, label: {}'.format(text, target))

Here is the result of the first 5 descriptions with their labels attached:

Desc: b' us embassy montevideo consular section  closed   routine consular services   notice    emergency situations   considered   time ', label: b'Restriction and Regulation of Government Services'
Desc: b' pennsylvania governor signed  senate bill    waives  requirement  schools    session  least  days  provides  continuity  education plans  ensures school employees  paid   closure   provides  secretary  education  authority  waive student teacher  standardized assessments   march  ', label: b'Closure and Regulation of Schools'
Desc: b"dumka   district   n state  jharkhand   defined  government services  would remain operational   lockdown   follows    law  order agencies -  function without  restrictions   officers attendance - compulsory  grade ''  'b' officers  reduced  %  grade 'c'      district administration  treasury officials -   function  restricted staff    wildlife  forest officers -  function  taking necessary precautions ", label: b'Restriction and Regulation of Government Services'
Desc: b'texas     reopening  non-essential businesses starting may      per executive order ga-    hair salons  barber shops  nail salons   tanning salons must maintain mandatory ft distance  patrons    swimming pools  operate  % capacity     may   jail time  removed   enforcement mechanism  update   may   gov  abbott removed jail time   penalty  failing  follow covid- restrictions ', label: b'Restriction and Regulation of Businesses'
Desc: b' public safety curfew   imposed  mobile  alabama  effective   pm  april      remain  effect  april      persons shall remain   places  residence  shall    public places ', label: b'Lockdown'

Our next task is to one-hot encode the 20 labels. To do this, I created a hash table to conveniently look up the values by category codes:

table = tf.lookup.StaticHashTable(
    initializer=tf.lookup.KeyValueTensorInitializer(
        keys=tf.constant(list(category_codes.keys())),
        values = tf.constant(list(category_codes.values()))
    ),
    default_value=tf.constant(-1),
    name="target_encoding"
)
@tf.function
def target(x):
    return table.lookup(x)

Using these functions, we can encode the labels in our train and test sets into arrays of numbers. Print out the results with next()

def fetch(text, labels):
    return text, tf.one_hot(target(labels), 19)
train_data_fetch = dataset_train.map(fetch)
test_data_fetch = dataset_test.map(fetch)
next(iter(train_data_fetch))

Output (The first tf.Tensor is our train data, and the second one is our transformed train labels):

(<tf.Tensor: shape=(), dtype=string, numpy=b'   april   mha writes  states  ensure smooth harvesting  sowing operations   maintaining social distancing   -day lockdown  fight covid-  -  union ministry  home affairs  mha   sent  advisory   states regarding granting   exception  agricultural operations  lockdown restrictions  fight covid-  keeping  mind  harvesting  sowing season   -   advisory  exceptions   allowed  farming operations  farmers  farm workers  procurement  agricultural productions  operation  mandis  movement  harvesting  sowing related machinery  etc '>,
 <tf.Tensor: shape=(19,), dtype=float32, numpy=
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0.], dtype=float32)>)

Start creating a model

First of all, we need to tokenize our text data before feeding it into our model. I used an already implemented model from Google that will create an embedding layer for us. Briefly, this is how it works:

Google has developed an embedding model, nnlm-en-dim128 which is a token-based text embedding-trained model that uses a three-hidden-layer feed-forward Neural-Net Language Model on the English Google News 200B corpus. This model maps any body of text into 128-dimensional embeddings – Dipanjan (DJ) Sarkar

To use it, simply pass in the link to the model, which can be found on this website [Note: you might want to check to make sure the version you are using is the most up-to-date one, or else it will give you errors]:

embedding = "https://tfhub.dev/google/nnlm-en-dim128/2"
hub_layer = hub.KerasLayer(embedding, output_shape=[128],input_shape=[], dtype=tf.string,
                          trainable=True)
hub_layer(train_data[:1])

Sneak peak into what this layer gives us:

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 1.85923278e-01,  3.82673025e-01,  8.69123638e-02,
        -2.36745372e-01, -1.19763926e-01, -5.65516986e-02,
         2.45870352e-01,  5.02816178e-02, -2.10541233e-01,
        -4.42932360e-02,  1.28366366e-01,  1.47269592e-01,
         1.41175740e-04,  4.45434526e-02,  2.13784329e-03,
         1.61750317e-01, -2.32903764e-01, -2.10702419e-01,
        -2.09106982e-01,  1.55449033e-01,  4.53584678e-02,
         4.31233309e-02,  1.48296393e-02, -1.68935359e-01,
         1.12579502e-01, -1.03304483e-01,  1.61703452e-01,
         2.13061482e-01, -4.74388264e-02,  1.27027377e-01,
        -3.04564610e-02, -1.92816645e-01, -3.22420187e-02, ... ]])

Neural network model’s pitfall is overfitting, as always, therefore we usually have to use dropout layers to regularize it. Dropout layer is a tool to regularize neural network, in which a number of layer outputs at random are ignored so that the sparser network will have to adapt to correct mistakes from prior layers (read more here).

Without knowing whether my model will be in this situation or not, I trained 2 models, one with the dropout layers and one without, and compare their results. For these model, don’t forget to use Early Stopping as well.

#build the basic model without dropout
model_wo_dropout = tf.keras.Sequential()
model_wo_dropout.add(hub_layer)
for units in [128, 128, 64, 32]:
    model_wo_dropout.add(tf.keras.layers.Dense(units, activation='relu'))
model_wo_dropout.add(tf.keras.layers.Dense(19, activation='softmax'))
model_wo_dropout.summary()
#compile the model
model_wo_dropout.compile(optimizer='adam',loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
train_data_fetch = train_data_fetch.shuffle(70000).batch(512)
test_data_fetch = test_data_fetch.batch(512)
#fit the model
from keras import callbacks
earlystopping = callbacks.EarlyStopping(monitor="val_loss", mode="min", patience=5, restore_best_weights=True, verbose=1)
text_classifier_wo_dropout = model_wo_dropout.fit(train_data_fetch, epochs=25, validation_data = test_data_fetch,
                   verbose=1, class_weight=weights, callbacks =[earlystopping])

Evaluation on our test and train sets~

#training errors
y_train_pred = model_wo_dropout.predict(train_data)

print(classification_report(train_labels.numpy().argmax(axis=1), y_train_pred.argmax(axis=1)))
# test errors
test_data, test_labels = next(iter(dataset_test.map(fetch).batch(test_length)))
y_pred = model_wo_dropout.predict(test_data)

print(classification_report(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1)))
Training results
Training results
Testing results
Testing results

The result, compared to Logistic Regression model that I tried out as a baseline model before building this neural network, improved by 10%, which is a really good sight. I will discuss about these results further after I finish implementing the model with dropout layers.

Model with Dropout layers for confirmation of overfitting

As I said earlier, to confirm whether our model overfits the data or not, I trained a secondary neural network with several dropout layers and compare the results. We can grid search the best dropout parameter between the range 0.1–1.0. Here, I used 0.3 as the parameter.

We have to recreate the train and test sets. The only difference in this model is in the layer building process. Everything from that point onwards is the same as the model without the dropout layers. It’s important to note that we should not have the dropout layer right before the output layer (aka the last layer) because the output layer can’t correct the errors from the previous layers anymore, so adding another dropout layer will end up hurting the model’s performance (which was the case for me).

# re-create train, test data for fitting the model
train_data_fetch_dropout = dataset_train.map(fetch)
test_data_fetch_dropout = dataset_test.map(fetch)
# build a similar model but with the dropout layers
model = tf.keras.Sequential()
model.add(hub_layer)
for units in [128, 128, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(19, activation='softmax'))

model.summary()

Here are the results:

Training results
Training results
Testing results
Testing results

~Discussion and extended questions:

  • The performance of the model with and without dropout layers are pretty much the same. The diverge between train and test errors don’t improve even after using the dropout layers. In fact, with the dropout layer, the model fails to predict the labels for a lot more categories than the model without the dropout layer (many have 0% accuracy). However, after examining the big dataset and the fact that regularization doesn’t significantly improve the model, I don’t think that overfitting is not that big of a problem in this model. Further analysis can involve fitting models on different dropout values to confirm if the performance doesn’t improve because we didn’t choose a good dropout rate, or because of the dataset itself.
  • The logistic regression, as well as other ML models, don’t have that big of a gap between the training and testing errors. It can be said that these ML models don’t suffer from overfitting as much as the neural network model, however, these models can hardly be improve further, given that we already use the best parameters. Meanwhile, the neural network stands a good chance of improving beyond this point. And since our interest lies in getting more accurate prediction, we will go with the neural network.
  • It is observed that the overall accuracy for both models goes up after I added more data points (more countries). This makes me wonder if getting all the data available will even increase the accuracy more.
  • The accuracy shows slight fluctuations across different fitting attempts (due to the fact that neural network model can fall into the ‘local max/min’ trap), but on average, the result is in the range [0.66–0.71], which is ~5–10% increase from the average of the results of the machine learning models I fitted (which is ~59–60%). Looking at the confusion matrix, the metrics within each category also improves.
  • Another source that might affect the predictive power of the model is the fact that a lot of categories (take health resources, health monitoring and health testing) share many words in common, so one of the next steps from here is to identify and develop a collection of stopwords that appear a lot across different categories to reduce the similarity between these categories and see if this increases the performance even further.

To conclude, this neural network model is pretty simple and there is definitely room for improvement, with the emphasis on the data preprocessing procedure (more stopwords, over-/undersampling), as well as further cross validation to average out the accuracy. That being said, this is a very fulfilling learning experience and first exposure to neural network for me! If you have made this far, thank you so much! I look forward to updating you more on my future version of this model.

A full version of this notebook can be found in this Github repo:

irenechang1510/Topic-classification-NN

Feel free to reach out to me via my LinkedIn, if you have any question, I would love to connect!

References:

[1] Dipanjan (DJ) Sarkar, Deep Transfer Learning for Natural Language Processing – Text Classification with Universal Embeddings (2018), towardsdatascience

[2] Miguel Fernández Zafra, Text Classification in Python (2019), towardsdatascience


Related Articles