Building client routing / semantic search in the wild

Alexander Veysov
Towards Data Science
8 min readNov 10, 2018

--

Main objective — help our clients find what they want, ideally even before they type the whole query. Search also should generalize reasonably well (synonyms / forms of known words, Russian has rich morphology)

TLDR

This is an executive summary about what we managed to do in approximately 2 months in the Profi.ru DS department (Profi.ru is one of online leading service marketplaces in the CIS region) within a Semantic factory project.

This article mostly focuses on comparing novel NLP techniques applicable in a applied business setting and may serve as a guide if you plan to do something similar in your company/project.

In a nutshell — we had 2 objectives:

  • Make search / routing within the main search bar on Profi.ru better (supervised task);
  • In doing so — develop unsupervised methods to efficiently search for new services in third-party data (unsupervised task);

We considered many architectures, but in the end a single bi-LSTM (or bi-GRU) with embedding bag layer instead of plain embeddings (essentially fast-text meets RNN) turned out to be the best both for supervised and unsupervised tasks (see charts below). Remarkably, making the model robust to partial input/input with errors increased its performance. Also adding second objective (fake “service” detection) also slightly boosted accuracy. More sophisticated models like transformer showed no real benefits in our domain.

Available data

We used internal data from several main sources:

  • Actual client search (30k unique queries per 3–6 months) + manual annotation (which produced our main dev-set);
  • Internal synonym database (30k unique synonyms);
  • Internal marketing SEO databases with annotation (300k);

No one built ML models before us in the company, so we had to improvise when obtaining the data and managing its annotation. Also it should be noted that even though the total number of queries is significant, the unique query count turned out to be very low, which forced us to be creative in obtaining / augmenting data. Also the quality of data-sets and annotation at hand varied quite drastically. The best supervised datasets were, as expected, the manually annotated dev-set and the synonym database. We also tried exploring conversion pipelines (i.e. query — session — order) and distilling previous classifier results — with expected poor performance.

As for external data, we used the following sources:

  • Search engine data (data provided by SEO panels of Google and Yandex as well as their statistical services) (11m queries);
  • External SEO services (3m queries) ;
  • Our competitors’ public data (1.5m queries);

Achieved goals

Business goals:

  1. 88+% (vs 60% with elastic search) accuracy on client routing / intent classification (~5k classes);
  2. Search is agnostic to input quality (misprints / partial input);
  3. Classifier generalizes, the morphologic structure of the language is exploited;
  4. To be on the safe side — at least 1,000 new services were found + at least 15,000 synonyms (vs. the current state of 5,000 + ~30,000). I expect this figure to double of even triple;

“Scientific” goals:

  1. We thoroughly compared many modern sentence embedding techniques using a downstream classification task + KNN with a database of service synonyms;
  2. We managed to beat weakly supervised (essentially their classifier is a bag-of-ngrams) elastic search on this benchmark (see details below) using UNSUPERVISED methods;
  3. We developed a novel way of building applied NLP models (a plain vanilla bi-LSTM + bag of embeddings, essentially fast-text meets RNN) — this takes the morphology of the Russian language into consideration and generalizes well;
  4. We demonstrated that our final embedding technique (a bottle-neck layer from the best classifier) combined with state-of-the-art unsupervised algorithms (UMAP + HDBSCAN) can produce stellar clusters on external data;
  5. We demonstrated in practice the possibility, feasibility and usability of: (i) knowledge distillation (ii) augmentations for text data (sic!);
  6. Training text-based classifiers with dynamic augmentations reduced convergence time drastically (10x) compared to generating larger static datasets (i.e. the CNN learns to generalize the mistake being shown drastically less augmented sentences);

What works in NLP now?

A birds’ eye view on the NLP landscape:

Also, you may know that NLP may be experiencing the Imagenet moment now.

Large scale UMAP hack

When building clusters, we stumbled upon a way/hack to essentially apply UMAP to 100m+ point (or maybe even 1bn) sized datasets. Essentially build a KNN graph with FAISS and then just rewrite the main UMAP loop into PyTorch using your GPU. We did not need that and abandoned the concept (we had only 10–15m points after all), but please follow this thread for details.

What works best

  • For supervised classification fast-text meets the RNN (bi-LSTM) + carefully chosen set of n-grams;
  • Implementation — plain python for n-grams + PyTorch Embedding bag layer;
  • For clustering — the bottleneck layer of this model + UMAP + HDBSCAN;

Best classifier benchmarks

Manually annotated dev-set

Performance of the best model (n-gram bag + biLSTM) on manually annotated dev-set

Manually annotated dev-set + 1–3 typos per query

Performance of the best model (n-gram bag + biLSTM) on manually annotated dev-set, where we added 1 to 3 typos to each query

Manually annotated dev-set + partial input

Performance of the best model (n-gram bag + biLSTM) on manually annotated dev-set, where we randomly cut the input sequence to simulate partial input

Large scale corpora / n-gram selection

We collected the largest corpora for the Russian language:

  • We collected a 100m word dictionary using 1TB crawl;
  • Also use this hack to download such files faster (overnight);
  • We selected an optimal set of 1m n-grams for our classifier to generalize best (500k most popular n-grams from fast-text trained on Russian Wikipedia + 500k most popular n-grams on our domain data);

Stress test of our 1M n-grams on 100M vocabulary:

Share of n-grams (3–7 grams) in a 100M word vocabulary covered by our n-gram selection

Text augmentations

In a nutshell:

  • Take a large dictionary with errors (e.g. 10–100m unique words);
  • Generate an error (drop a letter, swap a letter using calculated probabilities, insert a random letter, maybe use keyboard layout, etc);
  • Check that new word is in the dictionary;

We brute forced a lot of queries to services like this one (in attempt to essentially reverse engineer their dataset), and they have a very small dictionary inside (also this service is powered by a tree classifier with n-gram features). It was kind of funny to see that they covered only 30–50% of words we had on some corpora.

Our approach is far superior if you have access to a large domain vocabulary.

Best unsupervised / semi-supervised results

KNN is used as a benchmark to compare different embedding methods. This is more of an illustration, but obviously we used these methods for clustering / searching for new services.

(vector size) List of models tested:

  • (512) Large scale fake sentence detector trained on 200 GB of common crawl data;
  • (300) Fake sentence detector trained to tell a random sentence from Wikipedia from a service;
  • (300) Fast-text obtained from here, pre-trained on araneum corpus;
  • (200) Fast-text trained on our domain data;
  • (300) Fast-text trained on 200GB of Common Crawl data;
  • (300) A Siamese network with triplet loss trained with services / synonyms / random sentences from Wikipedia;
  • (200) First iteration of embedding bag RNN’s embedding layer, a sentence is encoded as a whole bag of embeddings;
  • (200) Same, but first the sentence is split into words, then each word is embedded, then plain average is taken;
  • (300) The same as above but for the final model;
  • (300) The same as above but for the final model;
  • (250) Bottleneck layer of the final model (250 neurons);
  • Weakly supervised elastic search baseline;

To avoid leaks, all the random sentences were randomly sampled. Their length in words was the same as the length of services / synonyms they were compared with. Also, measures were taken to make sure that models did not just learn by separating vocabularies (embeddings were frozen, Wikipedia was under-sampled to make sure that there was at least one domain word in each Wikipedia sentence).

Cluster visualization

An example of how the best embedding technique + UMAP work on one of the external corpora.

3D

2D

Cluster exploration “interface”

Green — new word / synonym.Gray background — likely new word.

Gray text — existing synonym.

Ablation tests and what works, what we tried and what we did not

  1. See the above charts;
  2. Plain average/tf-idf average of fast-text embeddings — a VERY formidable baseline;
  3. Fast-text > Word2Vec for Russian;
  4. Sentence embedding by fake sentence detection kind of works, but pales in comparison with other methods;
  5. BPE (sentencepiece) showed no improvement on our domain;
  6. Char level models struggled to generalize, despite the recant paper from google;
  7. We tried multi-head transformer (with classifier and language modeling heads), but on the annotation available at hand it performed roughly the same as plain vanilla LSTM-based models. When we migrated to embedding bad approaches we abandoned this line of research due to the transformer’s lower practicality and impracticality of having an LM head along with embedding bag layer;
  8. BERT — seems to be overkill, also some people claim that transformers train literally for weeks;
  9. ELMO — using a library like AllenNLP seems counter-productive in my opinion both in research/production and education environments for reasons I will not provide here;

Deploy

Done using:

  • Docker container with a simple web-service;
  • CPU-only for inference is enough;
  • ~2.5 ms per query on CPU, batching not really necessary;
  • ~1GB RAM memory footprint;
  • Almost no dependencies, apart from PyTorch, numpy and pandas (and web server ofc).
  • Mimic fast-text n-gram generation like this;
  • Embedding bag layer + indexes as just stored in a dictionary;

Conclusion

We have shown that you can apply relatively simple / off-the-shelf tools for solving both supervised and unsupervised challenges within a semantic search framework in a real business.

Further reading

  1. A more detailed presentation in Russian;
  2. Parsing Common Crawl and Wikipedia;
  3. How to mimic fast-text bag of embeddings split in one function;
  4. More pre-processed common crawl;
  5. Fast-text word vectors pre-trained on Wikipedia;
  6. Fast-text and Word2Vec models for the Russian language;
  7. Seminal word embeddings papers: Word2Vec, Fast-Text, further tuning;
  8. Some of the current SOTA CNN-based approaches: InferSent / Generative pre-training of CNNs / ULMFiT /Deep contextualized word representations (Elmo);
  9. Imagenet moment in NLP?
  10. Sentence embedding baselines 1, 2, 3, 4;
  11. Fake Sentence Detection as a Training Task for Sentence Encoding;

Originally published at spark-in.me on November 2, 2018.

--

--