It has been estimated that more than 80% of enterprise data is unstructured (emails, documents, job descriptions, etc.). On the other hand, most of the decisions are based purely on structured data.
In this article, I’ll describe how we turned unstructured data into useful information with the help of pre-trained Machine Learning models and semantic search. The specific use-case is about unstructured profile descriptions of our employees and how we can start asking arbitrary questions about the co-workers (e.g. who can I talk about puppies?).
Choosing the right technology
We started the task of Unstructured Data representation with the research of knowledge graph technologies.
Knowledge graph is a way of representing the data by its meaning and context in graph-structured data model (image below). The idea is attractive because you can search the knowledge graph by the relationship between the objects instead of specific keywords.

While conceptually and visually attractive, we realized that it brings only minimal value compared to traditional databases because you still need to define all the relationships (e.g. {Mart _has visited_ Louvre}
and {Louvre _is in_ France}
) yourself.
We didn’t use knowledge graphs because you still need to structure the data yourself.
Next, we turned our attention to pre-trained language representation models. These are machine learning models (e.g. BERT) that have been trained on millions of text documents to learn the representations and the connection between the words in various contexts.
Instead of defining the relationships between people and object directly (knowledge graph approach), we can represent each object as a list of numbers. These numbers can’t be interpreted directly but they carry a latent (hidden) meaning (e.g. cats and dogs are closer to each other than cats and elephants.)
The general workflow goes like this:
- First, we gather the unstructured descriptions (texts) of our employees,
- Second, we pass these descriptions to pre-trained language model.
- Third, we receive a list of numbers for each person that represents the description of a person numerically.

At this point, it might seem that we haven’t achieved much because the list of cryptic numbers for each person doesn’t feel much better than unstructured texts. But machines love numbers and since this vector of numbers always have the same length (300 elements in our example), we can easily start to make calculations. Think of each number as a point is space. If we were to have only 3 numbers for each person, we could visualize them on the 3-dimensional chart (image below). The closer the points are to each other the more similar they are. Now, we can’t draw a chart with 300 dimensions, but the logic remains the same.

Semantic search
Semantic search means the ability to search the content by the meaning of the search phrase. So, when I ask the question "who has visited France?" I should also be able to find all the persons that have written about their visit to Eiffel tower.
At this point, we have our employees encoded (by pre-trained model) and the semantic search could be carried out as follows:
- Ask a question.
- Pass the question through the pre-trained language representation model.
- Calculate the distance between the question and the employee descriptions.
- Return the employee who has the shortest distance to the question representation.

Demonstration
As a final step, let me demonstrate two examples from our co-worker search application. The real names and the revealing job positions are hidden for privacy reasons.
In the first example, I try to find co-workers who knows about crypto currencies with the question "who knows crypto?". The first search result is a co-worker who is interested in investing and mentions cryptocurrencies. The second result also talks about investing but doesn’t mention cryptocurrencies. This is exactly what we wanted to achieve – system understands that our question is closely related to the concepts like investing.
Semantic search with the help of pre-trained language models enables to make queries by meaning not by specific keywords!

Figure 5. Example of semantic search – the engine understands that "crypto" and investing in shares are similar concepts (Snapshot from our application).
The example below is about finding the co-workers to make a band with. Notice, that I mention two concepts: 1) having a garage and 2) having a wish to make a band. The system is smart enough to understand that the focus here is on the band-making. The first person seems to have a long experience of playing in bands and the second person is a music producer (nice 🙂 ). Also, the second result doesn’t have the word "band" mentioned but the system understands that the music production and band-making are closely related.

Figure 5. Example of semantic search – the engine understands that: 1) the focus is on band-making not on having a garage and 2) Music production is closely related to band-making.
Next steps
Semantic search of co-workers is just one of the examples how unstructured data can be used for arbitrary information queries. Other common tasks include sentiment analysis, extractive question answering about the text, table question answering, named entity recognition and automated text summarization. To learn more, take a look at this article (with descriptions and code examples).