Author: Janna Lipenkova

In the past years, the tech world has seen a surge of Natural Language Processing (NLP) applications in various areas, including adtech, publishing, customer service and market intelligence. According to Gartner’s hype cycle, NLP has reached the peak of inflated expectations in 2018. Many businesses see it as a "go-to" solution to generate value from the 80% of business-relevant data that comes in unstructured form. To put it simply – NLP is wildly adopted with wildly variable success.
In this article, I share some practical advice for the smooth integration of NLP into your tech stack. The advice summarizes the experience I have accumulated on my journey with NLP – through academia, a number of industry projects, and my own company which develops NLP-driven applications for international market intelligence. The article does not provide technical details but focusses on organizational factors including hiring, communication and expectation management.
Before starting out on NLP, you should meditate on two questions:
1. Is a unique NLP component critical for the core business of our company?
Example: Imagine you are a hosting company. You want to optimize your customer service by analyzing incoming customer requests with NLP. Most likely, this enhancement will not be part of your critical path activities. By contrast, a Business in targeted advertising should try to make sure it does not fall behind on NLP – this could significantly weaken its competitive position.
2. Do we have the internal competence to develop IP-relevant NLP technology?
Example: You hired and successfully integrated a PhD in Computational Linguistics with the freedom to design new solutions. She will likely be motivated to enrich the IP portfolio of your company. However, if you are hiring middle-level data scientists without a clear focus on language that need to split their time between data science and engineering tasks, don’t expect a unique IP contribution. Most likely, they will fall back on ready-made algorithms due to lack of time and mastery of the underlying details.
Hint 1: if your answers are "yes" and "no" – you are in trouble! You’d better identify technological differentiators that do match your core competence.
Hint 2: if your answers are "yes" and "yes" – stop reading and get to work. Your NLP roadmap should already be defined by your specialists to achieve the business- specific objectives.
If you are still there, don’t worry – the rest will soon fall in place. There are three levels at which you can "do NLP":
- Black belt level, reaching deep into mathematical and linguistic subtleties
- Training & tuning level, mostly plugging in existing NLP/ML libraries
- Blackbox level, relying on "buying" third-party NLP
The black belt level
Let’s elaborate: the first, fundamental level is our "black belt". This level comes close to computational linguistics, the academic counterpart of NLP. The folks here often split into two camps – the mathematicians and the linguists. The camps might well befriend each other, but the mindsets and the way of doing things will still differ.
The math guys are not afraid of things like matrix calculus and will strive on details of newest methods of optimization and evaluation. At the risk of leaving out linguistic details, they will generally take the lead on improving the recall of your algorithms. The linguists were raised either on highly complex generative or constraint-based grammar formalisms, or alternative frameworks such as cognitive grammar. These give more room to the imagination but also allow for formal vagueness. They will gravitate towards writing syntactic and semantic rules and compiling lexica, often needing their own sandbox and taking care of the precision part. Depending on how you handle communication and integration between the two camps, their collaboration can either block productivity or open up exciting opportunities.
In general, if you can inject a dose of pragmatism into the academic perfectionism you can create a unique competitive advantage. If you can efficiently combine mathematicians and linguists on your team – even better! But be aware that you have to sell them on an honest vision – and then, follow through. Doing hard fundamental work without seeing its impact on the business would be a frustrating and demotivating experience for your team.
The training & tuning level
The second level involves the training and tuning of models using existing algorithms. In practice, most of the time will be spent on data preparation, training data creation and feature engineering. The core tasks – training and tuning – do not require much effort. At this level, your people will be data scientists pushing the boundaries of open-source packages, such as nltk, scikit-learn, spacy and tensorflow, for NLP and/or Machine Learning. They will invent new and not always academically justified ways of extending training data, engineering features and applying their intuition for surface-side tweaking. The goal is to train well-understood algorithms such as NER, categorization and sentiment analysis, customized to the specific data at your company.
The good thing here is that there are plenty of great open-source packages out there. Most of them will still leave you with enough flexibility to optimize them to your specific use case. The risk is on the side of HR – many roads lead to data science. Data scientists are often self-taught and have a rather interdisciplinary background. Thus, they will not always have the innate academic rigor of level 1 scientists. As deadlines or budgets tighten, your team might get loose on training and evaluation methods, thus accumulating significant technical debt.
The blackbox level
On the third level is a "blackbox" where you buy NLP. Your developers will mostly consume paid APIs that provide the standard algorithm outputs out-of-the-box, such as Rosette, Semantria and Bitext (cf. this post for an extensive review of existing APIs). Ideally, your data scientists will be working alongside business analysts or subject matter experts. For example, if you are doing competitive intelligence, your business analysts will be the ones to design a model which contains your competitors, their technologies and products.
At the blackbox level, make sure you buy NLP only from black belts! With this secured, one of the obvious advantages of outsourcing NLP is that you avoid the risk of diluting your technological focus. The risk is a lack of flexibility – with time, your requirements will get more and more specific. The better your integration policy, the higher the risk that your API will stop satisfying your requirements. It is also advisable to invest into manual quality assurance to make sure the API outputs deliver high quality.
Final Thoughts
So, where do you start? Of course, it depends – some practical advice:
- Talk to your tech folks about your business objectives. Let them research and prototype and start out on level 2 or 3.
- Make sure your team doesn’t get stuck in low-level details of level 1 too early. This might lead to significant slips in time and budget since a huge amount of knowledge and training is required.
- Don’t hesitate – you can always consider a transition between 2 and 3 further down the path (by the way, this works in any direction). The transition can be efficiently combined with the generally unavoidable refactoring of your system.
- If you manage to build up a compelling business case with NLP – welcome to the club, you can use it to attract first-class specialists and add to your uniqueness by working on level 1!
About the author: Janna Lipenkova holds a PhD in Computational Linguistics and is the CEO of Anacode, a provider of tech-based solutions for international market intelligence. Find out more about our solution here.
Originally published at anacode.de on November 15, 2018.