The world’s leading publication for data science, AI, and ML professionals.

Text mining with Simone – part 2

The second blog in a series about text mining within an organisation

In the first blog of this series, we dove into the concept of text mining. In part 2 of this series, we continue our quest for control.

Introduction

So it has been decided. Control needs to be gained over that vast amount of unstructured information. Relying on people alone is not feasible, our employees have got enough on their plates as it is. Cleaning up terabytes of unstructured data simply does not fit in their busy schedules. Automation is required. So, what are the options?

Naturally, many out-of-the-box solutions exist to help organisations manage their unstructured data, the Enterprise Content Management (ECM) market has evolved and changed a lot over the years. There have been many mergers and takeovers, as the focus of offerings has changed from ‘document management’ to ‘content management’. Gartner publishes quarterly reports on ECM services and in their latest release, they identified OpenText, Box, Hyland and Microsoft as some of the top solution providers in this field.

The solution that best suits your needs, depends on the ins and outs of your organisation. It is very important to keep in mind that technology is there to support you. What I often see in the market, is that employees have adjusted their ways of working to comply with what the solution demands, rather than adjusting the solution to fit their own demands.

In this blog, I will focus on the provision of seven different capabilities, of which the first four – creation, formalisation, Archiving and destruction – fall in the more traditional document and records management sphere, whereas the other three – monitoring, metadata and search – fall in the content management sphere.

The seven capabilities of content and records management - Image by author
The seven capabilities of content and records management – Image by author

Creation

To start off, an ECM solution should support the end user’s needs. In this digital society, that means that it should be possible to create documents on a number of different devices and that (real-time) document collaboration with colleagues is facilitated. From a content management viewpoint, it is also desired that the solution provides the end users with templates when starting with document creation.

Formalisation

Secondly, there is generally a point at which the document is formalised. This is the moment a document becomes a ‘record’ and it can no longer be adjusted or changed. For a contract, for example, this would be the moment that it is signed by both parties. For other documents this moment may not be as clearly defined. Generally this would be the moment a 1.0 version is created, but even this 1.0 version may later be subject to change. Think about a policy document, for example. To support the formalisation of such documents, a workflow is desired where the head of a department or head of a team can formally approve the document, making it crystal clear which version is the last. For documents that do not require formal approval from a department head, the end users should be able to publish the document as the final version themselves. When formalised, these documents should no longer be confused with the body of active documents such as earlier versions of the same documents, making it clear for the entire organisation which document they should consult when looking for specific information.

Archiving

Thirdly, the solution should have an archiving feature. There must be a moment at which the documents leave the active environment to be kept in a secure environment where they can be viewed when needed. These ‘legacy documents’ create unnecessary clutter when kept between active documents, yet need to be saved in case a question is raised regarding the history of documents. Therefore, these two types – legacy and active documents – should be stored in separate environments.

Destruction

Naturally, legacy documents also stack up. What’s more, they are subject to legal requirements such as laws about retention periods. When it comes to personally identifiable information, it may not even be allowed to store these documents at all. That is why a destruction mechanism is an essential part of any ECM solution. For formalised records, this can be automated by having documents destructed when the retention period is over. Non-formal documents that never obtain a final status, such as notes or drafts, should periodically be cleansed according to a standardised destruction policy.

Metadata

This brings us to my favourite, the crux to ECM: metadata. Metadata is simply Data about data. For a document, this could be the document type, the title, the status, the author or the date of creation. This is where text mining technologies do some of their best work. Standardising and structuring unstructured information is done by means of metadata. During the document’s lifecycle, certain information should be attached to the document. For example, when the document is formalised, the document is marked as ‘final’ and a retention period is added to the document. This facilitates the automatic archiving and removal of the document after a certain period of time. For contracts, adding the start and end date of the document as metadata can also facilitate the contract negotiation process, as organisations can quickly identify which contracts need to be renewed, without having to sift through a pile of old documents. Ideally a solution would not ask the document creator to manually add metadata to documents every single time, but rather automatically generate metadata based on content. Automatic generation is favourable, as manual input can lead to end user frustration and is more prone to error.

Monitoring

Like any data management initiative, it is essential to implement some kind of monitoring mechanism that ensures that once brought to an acceptable quality level, data remains on that quality level. The solutions should therefore support some kind of overview of the information landscape that an information manager can consult to gain insight in the document creation process, the availability of templates and the central management of metadata attributes. Patterns and new document types can be identified through Text Mining technologies that enable content curation. Content curation refers to the clustering of similar and related documents. Content curation identifies documents that do not fit within existing metadata attributes. Hence, it indicates when a new document type and its relevant retention period need to be defined. Monitoring should also be facilitated on an ‘object’ level, allowing the information manager to identify documents that contain specific or forbidden information. Think of personally identifiable information, for example.

Search

For the end user, this is probably the most essential aspect of any ECM solution. How do I search, and more importantly, find what I am looking for? Here, it is desired to go far beyond having to remember the title of a document. A solution should scan document content, metadata and offer user friendly filtering options to refine search results. A single access point with all the answers to your questions.

Market solutions

Disclaimer: the list of available market solutions below is not exhaustive. The goal of this blog is not to provide a full list of solutions, but rather to explore various text mining functionalities. The interviews to create this list were carried out in the summer of 2019. Note that all parties have developed their capabilities since. Please contact them individually to find out more about the current features.

Rather than focusing on the top solutions as identified by Gartner, this blog will consider solutions that vary in their offerings. SharePoint, for example is one of the most popular solutions in the market as it comes with Office 365 and is therefore often already part of an organisation’s available licences. It covers many of the features discussed above through its Advanced Data Governance capability. The overview considers SharePoint online (for Business). OpenText is one of the market leaders and can therefore not be left out. The overview below considers their entire product stack. In order to cater to all requested features, a combination of their off-the-shelves offerings must be made. iManage is a lower cost solution that offers many of the same functionalities as OpenText. INDICA and Index Engines are not traditional ECM solutions, in the sense that they focus on the monitoring, metadata and search rather than document creation and storage. They have developed some extremely useful functionalities in this field. These solutions are great if you do not want to change your current architecture (e.g. you would like to keep using your current file shares) but do want to gain control over that landscape. Below is an overview of some of the out-of-the-box features of these solutions mapped to the focus areas defined above. Note that all of these solutions are continuously being developed and improved upon and this overview was created in the summer of 2019.

Capability mapping, Creation - Image by author
Capability mapping, Creation – Image by author
Capability mapping, Formalisation - Image by author
Capability mapping, Formalisation – Image by author
Capability mapping, Archiving - Image by author
Capability mapping, Archiving – Image by author
Capability mapping, Destruction - Image by author
Capability mapping, Destruction – Image by author
Capability mapping, Monitoring - Image by author
Capability mapping, Monitoring – Image by author
Capability mapping, Metadata - Image by author
Capability mapping, Metadata – Image by author
Capability mapping, Search— Image by author
Capability mapping, Search— Image by author
Legend for capability mapping - Image by author
Legend for capability mapping – Image by author

This gives an overview of enterprise software solutions that can be leveraged in order to better manage content. Stay tuned for the next blog in this series to find out more about how this works in practice.


Related Articles