I recently participated in an Omdena AI challenge in partnership with Save the Children, a leading humanitarian organization for children to use natural language processing (NLP) techniques to explore the problem of Online Sexual Violence against children.

Topic Modeling
Topic modeling is an unsupervised machine learning text analysis technique capable of detecting word and phrase patterns within a collection of documents. The detected patterns are then used to cluster the documents into specific topics. It is a frequently used text-mining tool for discovering semantic structures within text data.
In simple terms, a document about a particular topic will more or less contain certain words particular to that topic.
The goal of this task was to collect text data for scientific articles, apply topic modeling to cluster the documents into separate topics, then create a visualization of this Clustering using a Graph Network.
Data Collection
Before any analysis could be done, one of the first tasks performed was to collect data on CSAM. Articles were collected from a number of repositories including CORE, IDEAS, Research Gate, Directory of Open Access Journals, and WorldCAT. We shall be focusing on the articles collected from CORE in this article.
The data was collected via API. To make requests, we registered for an account with CORE and received an API key. We would need the API key and search terms to put in a request. Search terms used –
#query = "((internet OR online OR technology OR chatroom OR chat) AND (minor OR child OR underage OR children OR teenager) AND (abuse OR exploitation OR grooming OR predation) AND (sex OR sexual OR carnal)) OR (internet OR online OR technology OR chatroom OR chat) AND (predators OR offenders) AND (sex OR sexual OR carnal) OR (child sexual abuse material OR CSAM)"
#query = "(internet OR online OR technology OR chatroom OR chat) AND (predators OR offenders) AND (sex OR sexual OR carnal)"
#query = "(child sexual abuse material OR CSAM)"
Python code to collect data –
Data Cleaning & Preprocessing
As with any type of data analysis, we had to first clean the data. A visual inspection of the data showed there were duplicate rows present. Duplicates were dropped based on a number of columns including the Title, abstract, and DOI. These were looked at individually and also in combination.
core_df = core_df.drop_duplicates(subset=['Title'])
We used the clean text library, a python package that combines a lot of cleaning steps including removing punctuation, numbers, and extra spaces, and changing text to lower case.
The data were filtered by checking that certain keywords were present in the abstract and/or text columns.
This manual filtering method, however, resulted in many relevant articles dropped. A different approach was eventually taken. More on this later.
After the preprocessing steps, we have 35K rows in the data frame –

TF-IDF Vectorization
In Natural Language Processing, the text must first be converted into something a machine can understand. The transformation into numbers allows us to perform mathematical operations on them. There are many text vectorization methods, in this article, we shall be exploring TF-IDF vectorization.
TF-IDF stands for Term Frequency Inverse Document Frequency. This is a technique where we assign an importance score to words in a corpus or dataset. The importance is assigned based on two criteria –
Term Frequency – This is defined as how frequently the word appears in a document divided by how many words there are.
Inverse Document Frequency -of a word is how rare or common the word is in the corpus. This may be calculated by taking the logarithm of the ratio of the total number of documents and the number of documents that contain the word. We calculate the TF-IDF score by taking the product of these two terms. The higher the term is, the more important the word is.
The parameter value _maxfeatures of 10000 refers to the max number of top features to consider. The ngram_range specifies we’re considering unigrams and bigrams. The output of the vectorizer is a 35000×10000 sparse matrix with 35K referring to the number of articles and 10000 the max_features.
Dimensionality Reduction
This is the transformation of data by reducing the number of dimensions of features in a dataset from a high-dimensional space to a low-dimensional space such that the low-dimensional representation still retains some relatively meaningful properties of the data.
We ran into some issues with fitting the data on the RAM for K-Means clustering. We opted for dimensionality reduction using Latent Semantic Analysis (LSA) or truncated Singular Value Decomposition (on SKlearn).
As with any dimensionality reduction technique, to decide on the number of components to select, we checked on a number of a different number of components and selected that with the highest explained ratio.
We chose a number of components of 2800 which had an explained variance ratio of 0.89.
K-Means Clustering
K-Means is a popular unsupervised learning clustering technique to group data points into non-overlapping subgroups. Clusters are defined as groups of objects that are more similar to other objects within their clusters than they are to objects outside of their cluster.
Similarly, we ran K-Means for a number of cluster numbers in a range of 2–80 and calculated the silhouette score for each. The cluster with the best score was chosen.

We chose 72 as the initial number of optimal clusters.
Upon manual inspection of the clusters, we noticed a number of clusters deemed for removal. We noticed clusters of documents not relevant to the domain, as well as clusters in a different language. The clustering worked as a really good way to filter out irrelevant articles.

The word cloud above shows the most important words of each segment the first time the clustering was implemented. Cluster-ID 4, 10, 14, 18, and 44 for example, we notice are not in the English language.
Network Graphs

Network graphs are a way to show the interconnectedness between entities. The goal of this sub-task was to visually represent the relationships between the different articles. To achieve this, we chose to highlight two types of relationships, the different topics the articles belong to, and a numerical measure of the similarity between the articles. We decided on two different techniques to achieve this –
Cosine Similarity
This is a measure of how similar documents are to each other. It is used to measure the cosine of the angle between two vectors projected in a multi-dimensional space. This tells us whether the vectors are pointing in more or less the same direction.
Another common approach to measuring the similarity between documents is to count the number of words they have in common. There is a problem with this approach, however, the larger the document, the more words they are likely to have in common irrespective of the topic. Cosine Similarity works here as it captures the orientation of the vector and not the magnitude.
Louvain Clustering
Louvain clustering for community detection is an unsupervised Machine Learning technique for understanding and detecting communities in large and complex networks.
Topics
From Louvain, we get 5 distinct communities. A manual inspection of the communities suggest—
- Institutional, Political(legislative) & Social Discourse
- Online Child Protection – Vulnerabilities
- Technological Perspective
- Analysis of Offenders
- Commercial Perspective & Trafficking

Visualization
With Cosine Similarity calculated and Louvain Clustering implemented, we now have what we need to create a network graph. As mentioned, a network graph shows the connection between entities, an entity can be represented by a node (or vertice), and connections between nodes are represented by links (edges).
Cosine Similarity output is a number between 0 and 1 representing the range of no relationship to perfect relationship respectively. The more similar the documents, the stronger the links. The nodes can be colored based on the topic the article most identifies with.
The visualization was created using D3.js and hosted on Netlify’s free Github hosting service (refresh the page if it doesn’t load properly the first time) and may be found here.
Hover over the nodes to view the titles of the articles and of other linked articles. The nodes may be dragged for better visibility. To allow for better visibility, we have set a threshold for link strength meaning links with less than 0.3 should be excluded from the visualization.
Also, to prevent the nodes from straying too far from the visible screen area, we have set around a border and this is what forms the square border shape around the visualization.
A similar analysis was done for newspaper articles and the visualization for that may be found here.
Future Work
A number of different areas may be improved. TF-IDF has a number of limitations – can be computationally intensive for large vocabularies, and does not capture semantic meanings of the words. Baffle and confuse, for example, are similar in meaning but TF-IDF may not capture that.
More advanced methods include word embeddings, word2vec, and doc2vec(for documents as opposed to words).
A more state-of-the-art method to measure document similarity, for example, is using Bidirectional Encoder Representation from Transformers (BERT) using Transformers, an attention-based model to learn word embeddings.
One possible application of this would be to create a search functionality to query articles within particular topics within some range of similarity to a specific article. This could be achieved with the help of graph databases that are capable of storing data with the graph structure. The intuitive structure of graph networks allows one to make complex queries that are typically not possible with relational databases.

The particular data loaded at the moment is quite simple with some number of nodes, properties, and only one relationship (cosine similarity) defined. However, thanks to the flexibility of graph databases, it would be possible to add more nodes and labels, as well as define more relationships in the future or as the need arises. These relationships may be derived using machine learning, as well as other non-ML methods.
Neo4j also allows the installation of third-party plugins including, data science/ML plugins that allow implementing and applying machine learning algorithms within the database including Louvain Clustering and Community Detection.
References
Nasar Z, Jaffry S W, Malik M K (2018) Information extraction from scientific articles: a survey http://faculty.pucit.edu.pk/swjaffry/rpr/Scientometrics18InformationMining.pdf
Stecanella R (2017)The Beginner’s Guide to Text Vectorization https://monkeylearn.com/blog/beginners-guide-text-vectorization/
van der Maaten, Laurens; Postma, Eric; van den Herik, Jaap (2009) Dimensionality Reduction: A Comparative Review https://members.loria.fr/moberger/Enseignement/AVR/Exposes/TR_Dimensiereductie.pdf
Pantola P (2018) Natural Language Processing: Text Data Vectorization https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7
Prabhakaran S (2019) Cosine Similarity – Understanding the math and how it works https://www.machinelearningplus.com/nlp/cosine-similarity/
Bostock M (2017) Force-Directed Graph https://observablehq.com/@d3/force-directed-graph
Raghunathan D (2020) NLP in Python- Vectorizing https://towardsdatascience.com/nlp-in-python-vectorizing-a2b4fc1a339e
Jayaswal V (2020) Text Vectorization: Term Frequency – Inverse Document Frequency (TFIDF) https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d
Bach T (2020) Finding similar documents with transformers https://www.codegram.com/blog/finding-similar-documents-with-transformers/
Borad A (2020) NLP Text Pre-Processing: Text Vectorization https://www.einfochips.com/blog/nlp-text-vectorization/