Building a Backend System for Artificial Intelligence

Challenges of building scalable “artificial intelligence” system

Published in

Towards Data Science

12 min readSep 6, 2019

Let’s explore the challenges involved in building a backend system to store and retrieve high-dimensional data vectors, typical to modern systems that use “artificial intelligence” — image recognition, text comprehension, document search, music recommendations, …

In my last article, I mentioned a system that I wrote to replace a 3rd party API. The new system is composed of three components:
1. Domain logic service —”artificial intelligence”
2. Storage & retrieval service
3. Backing data-store (Redis)

Domain logic service processes input data and produces a 1024-dimensional normalized vector that represents the data. Which is then passed onto the Storage & retrieval service to be persisted.
At retrieval time, Domain logic service produces another vector from the input that is transformed using same processing logic and Storage & retrieval service is tasked with producing a set of vectors that are already persisted in the system that are most similar to the given input.

The reason the system is decomposed this way is because storage and retrieval is a challenging problem in itself. To isolate the problem, I placed it in a self-contained service. And following text is intended to explain why.

Vector Similarity

Because we work with vectors, let’s define “similarity” of two vectors as follows:
Vector a is similar to input vector b if
a) a is identical to b or
b) distance between vectors a and b is smaller than distance between vector b and any other vector in the dataset
and this distance is lower than given threshold.

There are different methods for calculating distance between vectors:
1. Euclidean distance
2. Cosine distance
3. Cartesian

This is a visual representation of euclidean distance (d) and cosine similarity (θ). While cosine looks at the angle between vectors (thus not taking into regard their weight or magnitude), euclidean distance is similar to using a ruler to actually measure the distance.

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter. Which is exactly what we are going to use here. Our vectors come to Storage & retrieval service normalized (all vectors are unit vectors with length 1).

Now that we established what it means to find “similar” vectors, let’s look at the real challenge — retrieving the set of similar vectors.

The Challenge

A brute force approach for finding similar vectors would be to take the input and calculate cosine similarity to every other vector in the data set. Comparing vector to “all other vectors” does not scale well (it’s linear — O(n)) and considering the dimensionality of the vectors, you can see this becomes very ineffective when the data set grows. Not very practical if you need a realtime system that will process high number of these queries per second…

Locality-Sensitive Hashing

Instead, we are going to use Locality-Sensitive Hashing (LSH).
In computer science, LSH is an algorithmic technique that hashes similar input items into same “buckets” with high probability (the number of buckets are much smaller than the universe of possible input items).

LSH primarily differs from conventional hashing (aka cryptographic) in the sense that cryptographic hashing tries to avoid collisions but LSH aims to maximize collisions for similar inputs.

With LSH, vectors that are close to each other (“similar”) have same hash value, and vectors that are far apart have different hash value (“dis-similar”).
This offers a potential solution to us— if we hash our vectors using LSH, we can reduce the number of vector comparisons required for a match retrieval.
We can limit our search to a bucket (or set of buckets) to dramatically reduce the number of vectors for which we calculate cosine distance.
But number of vectors to search can still be significant even within a bucket, if we consider datasets with millions of vectors. More on that later.

However, this comes at a cost. The keyword you should focus on in the previous LSH definition is “probability”. LSH is not an exact algorithm, it uses approximation. There is no guarantee that similar vectors in the dataset will end up in the same bucket. It depends on the parameters that we provide to our algorithm how “probable” it will be for similar vectors to have the same hash.

An important thing to note here is that for my use case, similarity cut off (threshold) is the cosine distance of 0.36, which is very wide and that makes the use of LSH more difficult because you need to consider much wider set of vectors. And LSH works best on vectors that are very close together.

Luckily, we have one advantage working in our favor. We don’t need the single most similar vector. We can approximate. We can terminate the search within a bucket as soon as we find the first vector with distance smaller than our threshold. Let’s take an example of facial recognition. There, each vector in the system represents a face. But every person has multiple faces (multiple photos of his or her face) associated. And let’s say we are searching for a person using another face as an input. We don’t need to retrieve the single closest vector — the most similar face. We just need a face that is “close” enough (below specified threshold). By definition, that threshold should guarantee that every such vector below the threshold represents same person.

While gathering notes for this article, I decided to put my current implementation to a test to understand performance of my system, and to see if I can improve upon it. And boy, did I learn a lot of things!

Baseline

Because my objective was to observe (and hopefully improve) my current implementation, I first had to establish a way to measure the system under observation and then establish a baseline upon which we are trying to improve.

I am lucky because I have been developing and testing my system for a while now, and I have “real-world” data from test users to play with.
To setup my experiments, I hand picked a dataset of ~250 vectors to seed my system and nearly 3000 vectors to be used for similarity retrieval
(i.e. “is any of the 250 seeded vectors similar to each one of the 3000 test vectors?”). This is not a huge dataset if you were to talk about production scale, but remember that we want to use this to observe accuracy of our algorithms, not hardware performance. And for that purpose it should be sufficient. Also worth noting, this dataset mustn’t be confused with dataset for training the artificial intelligence sub-system. That is already pre-trained on millions of data points and it’s out of scope of what we are discussing here today.

To establish accuracy baseline, I used simple “brute-force” approach. That way, I am always guaranteed the most accurate result, at the cost of comparing my input with every single vector in the “database”. That’s the dreaded O(n) linear performance.

Total number of “matches” was 673, with following similarity distribution:

Similarity distribution histogram for baseline scenario (no approximation)

To state the obvious, probability of finding the “right” match in this case is always 1. That is, if a similar vector exists, it will be retrieved. Looking at the probability graph for different algorithms will be the key to understanding their accuracy.

Probability of retrieving correct match in brute-force approach

Approximation

Next, I decided to measure my currently implemented algorithm. Enter the world of problem approximation with LSH. Ideally, what we would like to see with LSH is reduced complexity (better than linear) of our search but keep accuracy at — or close to- 100%. To achieve this holy grail we would need to come up with a step function that hashes all vectors below threshold to same bucket and all other vectors to a different bucket. If we then search vectors within that particular bucket, we greatly reduce the search space and yet we are guaranteed to find similar vectors (provided they exist in our data set).

First we need calculate binary hash from our input vector. To do that, we partition the space by n random planes. For each plane, vector ends up in one of the two regions which the plane separates. We assign value of 1 if vector is on the positive side, or 0 if vector is on the negative side. Intuitively, the closest two documents are, the more likely they will be on the same region for a random plane.
For example, if two documents are almost on top of each other, then they will end up on the same region for most of the planes that you can come up with.

For planes drawn at random, it can be mathematically proven that two documents x and y hash to the same value with probability:

The expression encapsulates our intuition: The closer two documents are (small θ) the greater the chances they end up in the same region.

If we use just one plane to split our space in half, probability of two vectors having the same hash would be:

Probability of two vectors having same single-bit binary hash

You can interpret this as probability of two vectors with cosine similarity 0.36 (threshold) to be hashed into same bucket to be roughly 0.72 (72%). Which may sound ok at first, but it also means that in 0.38 (38%) of cases, similar vectors will hash to different buckets (and thus we won’t find our match). And we only reduced our search space by 50%. That is not a good optimization, considering we are loosing a fair bit of accuracy.

Let’s add more more random planes to end up with a hash which will look something like this:

01110011

(8-bit hash constructed by dividing space by 8 random hyperplanes)

Now our probability decreases. It becomes very unlikely for vectors to hash into the same bucket unless they are nearly identical. Which, if you recall, is exactly what we established early on about LSH — the algorithm works best for very close matches (very small cosine distance between vectors).

Probability of two vectors having same 8bit binary hash

But there are different techniques we can employ to improve our situation.

Explore nearby buckets
Consider partial hash matches

Nearby buckets

To explore nearby buckets, I added Hamming distance hash variations.
The idea is simple: once we obtain LSH hash, we calculate variations of that hash up to n distance. This allows for error correction for vectors that are similar according to our criteria, but end up in different buckets.

If we take the hash from previous example, variations with Hamming distance 1 would be:

11110011
00110011
01010011
01100011
01111011
01110111
01110001
01110010

For Hamming distance 2, we get:

10110011
11010011
11100011
11111011
11110111
11110001
11110010

00010011
00110011
00111011
00110111
00110001
00110010

01000011
01011011
01010111
01010001
01010010

etc.

In my case I chose a Hamming distance of 3, which produces total of 1+8+ 7*8/2 + 6*7/2 + 5*6/2 + 4*5/2 + 3*4/2 + 2*3/2 + 1*2/2 =93 variations. This gives us following probability of finding the correct match:

Which is much better than the previous case, and shape of the curve starts to resemble the ideal step function. But it’s a clear tradeoffs between accuracy (approximation) and performance (time/resources it takes to get the result).

This configuration gives satisfactory 93% accuracy compared to the baseline, and the sub-linear complexity provides average 63.7% reduction in the search space. Let me explain: in a hypothetical scenario with 1 million uniformly distributed vectors in our database, we keep ~3906
vectors in each one of the 256 buckets (8 bit hash = 256 different buckets). We need to search 93 buckets per query, which gives us ~363281 vectors to search (36.3% of 1 million stored vectors). But I could not stress enough that these calculations are average results in idealized scenario (uniformly distributed vectors), in reality, vectors will be more clustered in some buckets producing widely different results for different queries.

That is a modest improvement, but it also cost us 7% of accuracy. In other words, in 7% of cases, there was a similar vector in the database but we did not find it.

For completeness sake, let me explain how the actual data persistence is implemented. I use Redis and every hash simply translates to a key in the data store. Value is a Redis list of vectors to be searched along with metadata (for example, in music recommendation service that could be name of a song, an artist, …) that is returned in the query response.

Partial hash matches

As a last experiment, I tried a slightly different algorithm that I found while doing my research on LSH.

For this experiment, we generate longer hash, but split it in multiple groups and search every bucket where at least one hash group matches a group in our query.

input: 01110011 11001001 11001100

(24 bit hash split in 3 groups)

And if we have two different vectors stored in the system, hashed as

a: 00110010 11001001 10001100

b: 11000011 11001001 11001000

Our input vector will be considered similar to vector b because the second group (11001001) matches the second group of the input.

It was not until later - when I started analyzing my results - that
I realized mistake I made only using 24bit hash split in 3 groups. This third experiment had no chance of outperforming my current implementation. The reason was staring me right in the face:

Once I calculated the probability graph, I understood why I only saw 71% accuracy compared to the baseline. The probability of hashing two vectors with cosine distance of 0.36 falls down to ~22%.
While the performance was better — in our idealized scenario with 1 million vectors, using this algorithm we only need to compare (on average)~12k vectors, a.k.a we reduce the search space by 98.8%.

Because we split the hash to groups and we need to consider each group separately, we need a different mechanism to query Redis backend.

For a sample hash 01110011 11001001 11001100, we break it down to 3 Redis keys:

01110011xx
x11001001x
xx11001100

and we store the vector under each of the three keys. This creates duplication of data, but allows us to find vectors when only some of the groups that make up the hash match.
We also need to introduce an additional step before returning our results to filter out duplicates.

Can we do better?

That is something I have been thinking about a lot. And it’s not just me — this is fairly cutting edge stuff and lot of people far more clever than me are spending a lot of time coming up with new algorithms to query large volumes of high-dimensional data.

I have been playing with different configurations to fine tune precision vs. performance of the system. But at the end of the day, it’s just that — marginal improvements that, while important, don’t drastically change the observed behavior and the (at best) sub-linear performance.

I try to look at the problem from a different angle, too. My system stores a large number of vectors that might be used shortly after being stored but more often than not aren’t used at all. It’s only a small subset of vectors that end up being returned in majority of cases as most similar.

What if we were more aggressive about reducing the overall size of the dataset?

My plan is to implement of time-to-live (TTL), concept well known in caching, to only keep vectors “used” (retrieved) in certain window of time (and re-new TTL every time the vector is used, perhaps even with a sliding window — first TTL to be 1 hour, after the vector is “used”, next TTL will be 1 day, then 1 week, and so on…). That, combined with the ability to disable TTL for some vectors, should allow me to purge large number of vectors from the data set and worry little less about performance of the algorithm.

Another thing to consider in future is parallel processing. We could partition our data set to be stored and queried on different nodes and then combine partial results into a final result (map-reduce). Perhaps, that will be another interesting topic to write about soon?