The world’s leading publication for data science, AI, and ML professionals.

Text mining with Simone – part 1

The first blog in a series about text mining within an organisation

Introduction

With the digital transformation of well, the entire world, there has been an explosion of textual information from a wide array of sources. Within this context, textual information refers to unstructured data, such as html, xml and document formats like Microsoft Word, Adobe PDF and email.

Along with the world’s digital transformation came the rise of data mining. Organisations have made steps to create value using data and data-driven insights, but these efforts largely remain focused on structured data, not unstructured data. Unstructured data is often scoped out.

Yet the most natural form of storing information is through text. To explain something, you would likely write it out in a document. However, no matter the structure the document is given, for a computer it is an unstructured collection of words.

Even information about structured data is often communicated using documents. There is no denying that textual information plays a huge role in the communication of information. At the end of the day, it is through language that we communicate knowledge. One could even argue that without language knowledge would not exist at all.

Consumers can be grateful that out-of-the-box tooling such as Microsoft File Explorer and Apple Finder offer enough functionality to make private collections of documents (somewhat) manageable. Consumers also benefit from companies like Google which make the vast ocean of textual information on the Internet searchable.

Organisations, on the other hand, face a more complex struggle with the management of documents. The majority of organisational data is stored in an unstructured form (about 80%), yet control over all this information remains limited. Not being able to find a specific document on a company share probably doesn’t sound unfamiliar to you either.

Although organisations have recently been putting a lot of effort into the effective management of data, this generally remains limited to structured data. Organisations take up Data Management with goals related to growth, efficiency and compliance. So what about the documents? Why are those gains not cherished? Why is structured data mining more common than text mining?

Quick introduction to text mining

Text mining is a type of data analysis that aims to retrieve valuable insights from textual information. It is part of the field of study referred to as Natural Language Processing (NLP), which sits at the intersection of computational linguistics, computer science and artificial intelligence. NLP is a way for computers to analyse and understand human language. It is commonly used for applications such as machine translation, automated question answering and of course, text mining.

Due to the ambiguity and complexity of human language, data preparation needs to be carried out before text mining to make sure that the information inside documents is presented in a suitable way. This stage can be referred to as text refinement.

Text refinement contains cleansing activities such as the removal and stemming of specific words. For example removing stop words such as ‘the’, ‘as’ and ‘a’. The most common word in this blog is ‘the’ but that is not a valuable insight. That is why these essentially meaningless words are removed before starting the analysis. Stemming words refers to the generalisation of words that have the same meaning, for example, ‘manager’ ‘manage’ and ‘management’ could all be classified into the same category named ‘managing’. During the text refinement phase, concepts are predefined and synonyms are identified. This is all done with the goal to create value. To illustrate, a ‘good party’ in the context of a contract negotiation means something entirely different as opposed to the context of a nightclub. Once the text is refined, it can be analysed. There will likely be a few jumps between refinement and analysis before achieving your analytical goal.

Text mining within an organisation

The organisational applications of text mining are twofold: analytics and enterprise search.

Analytics

Descriptive analytics in text mining refer to the automatic retrieval of information from documents (without having to entirely read the documents). A word cloud could be created to extract and visually represent the main topics of one or more documents. This can create interesting insights into your information landscape. The application of word clouds is limited however.

A more value adding exercise would be topic extraction and named-entity recognition. Topic extraction is the identification of meaningful terms within a document. Named-entity recognition is the extraction of names that fall into predefined categories such as people, organisations and locations. The extracted terms, topics and entities can be attached to the document through using metadata. This makes it much easier to end-users to find and understand documents.

Another application within descriptive text analytics are sentiment analyses. A sentiment analysis can be carried out on a body of documents to determine whether the general sentiment of the documents is negative, positive or neutral. This is done by identifying positive and negative terms and counting the amount of these terms in each document. An example of an organisational application could be to determine the general sentiment of employee reviews or to assess the success of a company event using social media data. Starbucks, for example, uses real-time text mining and sentiment analysis to identify negative tweets and quickly respond to them.

Levels of text mining - Image by author
Levels of text mining – Image by author

Predictive text analytics take things one step further, as documents can also be clustered and classified based on their content. This is done by grouping similar documents based on the frequency of specific terms within the document in comparison with the frequency of terms in other documents. This is referred to as term frequency-inverse document frequency (tf-idf).

Knowing that a specific document is of a certain type, we can use text mining technologies to determine which other documents belong to that same category. This is a way of supervised learning that greatly benefits the organisation of textual information. Implementing such text mining technologies in your document landscape can help your organisation make its way towards effective content curation.

Content curation refers to the process of discovering, gathering and presenting information about a specific topic. This resembles what Netflix does with movies, as it suggests other movies based on the characteristics of the selected one. In a knowledge-driven organisation this can be of great benefit, for example when looking for a subject matter expert or when collecting existing information to write a new contract or proposal.

Using prescriptive text analytics, the computer could predict where a document needs to be saved based on its content. Implementing real time text mining technologies could allow your system to classify the documents you write on the go. Based on the words you use while writing your document the system could detect anomalies. For example, the system could give a warning asking if the saving location is appropriate, suggest a different title or automatically generate metadata for the document.

Enterprise Search

In the context of organisational Document Management, the implementation of text mining technologies mentioned above all improve the quality of organisation search by making it easier to find information. Organisational search, however, can also be improved upon by implementing text mining technologies within the search itself.

Say, an employee wants to find all supplier telephone numbers but these are ‘hidden’ somewhere in their inbox. Using the predefined ‘telephone number’ category, the search would not look a specific phone number one at the time but return all results that match the predefined phone number category. These generalisation techniques make search much more efficient.

The concept of stemming help with this as well as Text Mining technologies return search results even when the term does not match entirely. The implementation of such search technologies has the benefit of increasing operational effectiveness as employees no longer have to click through hundreds of folders to find specific documents.

Such search methods are not only useful for the end-user but also for managers trying to maintain control over the information landscape and remain compliant with laws and regulations. Such generalised search categories can be used to monitor whether people are saving information on the right location, for example, if personal information (such as credit card numbers and social security numbers) is used in letters to the customer, these can easily be found in bulk and archived appropriately.

Implementation

Naturally, there are many existing solutions available in the market that make use of the functionalities described above. Stay tuned for the next blog in this series to read more about existing market solutions.


Related Articles