
Let’s just jump in the deep end right away. It’ll be fun that way. Say you have a corpus of text.
["dog barked at the tree", "cat drank milk"]
Now represent each word/token as a bag-of-ngram vector. To do that, first we need all the n-grams in all the tokens. Let’s take 1 & 2-gram vectors.
[d, o, g, b, a, r, k, e, d, ..., m, i, l, k, do, og, ba, ar, ..., mi, il, lk]
Now represent each word as a ngram vector
dog ->[1, 1, 1, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 1, 1, 0, 0, ..., 0, 0, 0]
barked->[0, 0, 0, 1, 1, 1, 1, 1, 1, ..., 0, 0, 0, 0, 0, 0, 1, 1, ..., 0, 0, 0]
...
milk ->[0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 1, 1, 1, 1, 0, 0, 0, 0, ..., 1, 1, 1]
Once you have each word as a n-gram vector, compute the following.
w = xR;-xR
Here, x
is the ngram matrix of size (V , d)
and R
is a random normal matrix of size (d ,(b/2))
. So we end up with a (V , b)
. Here, V
is the vocabulary size and b
is a hyperparameter (number of hash bins). In other words, we have a vector of size b
for each token.
You can visualize these vectors using a dimensionality reduction technique like T-SNE and you get the following.

Taking a closer look,
watch ['wat.', 'always', 'Wat', 'wat', 'Was']
Free ['free', 'FREE', 'here,', 'fine.', 'alone']
My ['my', 'MY', 'whole', '<DECIMAL>', '<#>']
sending ['Send', 'send', 'end', 'dinner', 'seeing']
May ['may', 'mayb', 'mail', 'well', 'Well']
was ['was', 'WAS', 'office', 'wan', 'wana']
Enjoy ['Enjoy', 'write', 'wit', 'sorry', 'Sorry']
you can see that words with a similar composition end up together.
insert mind-blown meme
Code: Here
Holup! What the heck just happened?

Fear not! I won’t leave you on a cliff-hanger like the above without an explanation. The core idea behind this possibility lies in locality sensitive hashing (LSH). Now let me tell you the story about how I became aware of this interesting property in the first place.
Prologue
I was reading the Reformer (a variant of the Transformer model) paper a while ago. The feature that stole the spotlight in that paper was the LSH (Locality Sensitive Hashing) algorithm they used to bin queries and keys of the Transformer model. I was amazed by the simplicity and the sheer power of it. The easiest explanation of LSH (I guess) is,
LSH does the opposite of a normal hash function. A hash function attempts to reduce collisions for unique items, where a LSH algorithm tries to maximize collisions for similar items (based on some input representation).
But the actual light bulb moment for me was a different one. I was wondering if there’s a way to use the LSH algorithm leveraged in the Reformer to come up with crude yet fast word vectors. The fact that I spent lot of time with word vectors (to which I can’t give a good reason why?) might have planted the seed.
Now that we have context, let’s get to the crux of it! First stop, LSH.
Idea behind LSH
LSH stems from the very annoyingly common problem of finding approximate nearest neighbors (ANN). Approximate NN (Nearest Neighbors) means finding NN quickly by sacrificing some degree of accuracy. For example, recommendation models thrive on ANN because, finding users similar to a given user from a user base of 100s of millions is not a walk in the park! Don’t forget the fact that this needs to happen with a swift user trigger (e.g. clicking a button, scrolling the page, etc.).
Okay back to LSH. There are many different LSH algorithms available, including popular ones like minhash. The idea is to come up with a hash function that will,
increase the probability of a collision (same hash output) for highly similar inputs
Minhash is a tad complex though and is out of scope for us. The algorithm powering the Reformer under the hood is much, much simpler. The algorithm is known as Angular LSH (or Spherical LSH). In fact, could be expressed with a one-liner if you shave off the extra fat (thus the click-baity title). Let’s now dive into the specifics of the LSH algorithm used in the Reformer paper.
bins = 32
R = np.random.normal(size=(5, bins//2))
inputs = np.array([[1,1,0,0,1], [1, 1, 1, 0, 0], [0, 0, 1, 1, 0]])
h_x = np.dot(inputs, R)
# The crux of the computation is a single line
# outs is a hash bin that is similar for similar inputs
hash_bins = np.argmax(np.concatenate([h_x, -h_x], axis=-1), axis=-1)
That’s all you need! Oh yeah, and I guess you want me to believe ice-cream comes from frozen cows too? No really, it works. Here’s why?
Intuition: Spherical LSH
Spherical LSH revolves around a unit-length sphere (in some D dimension) and a query vector of same dimension D. The idea is that,
If you have B randomly rotated vertices on this sphere, query vectors with similar representations would get the same vertex as their nearest neighbor
That’s exactly what the above computation does, specifically,

Each column in R ( r_j
) can be thought of as a randomly rotated vertex of our hyper-sphere (this is a detail you can leave as is – but for rather math-savvy individuals, there’s a small appendix on this). x_i.r_j
is simply taking the dot product. Another name for dot product is "similarity" (the higher the dot product, the more similar). Then argmax
gives the index of the vertex that is the NN for the query. This is sometimes interpreted as "assigning to a bin" but the same principle. Yeah as simple as that!
One of the important things to let sink in is the entity x_i.r_j
this is a vector representing how similar a given query is to an arbitrary vertex in space. And similar queries will be placed in close proximity. And when you do this over many random vertices, you end up with a vector for each query.

How do I tie in LSH to get good word vectors?
I’m glad you asked. For one thing, both LSH and word vectors are targeting similar goals, "finding similar units of inputs among a large corpus". So the question I am trying to answer here is, can we come up with a LSH mechanism to get some quick word vectors out?
We simply take the argmax
out of the equation. Because that leaves us with a vector of distances to each random vertex in our high dimensional space. And for similar words (with each word represented as a bag-of-ngrams), these vectors would be similar.
Why wouldn’t we just keep the bag-of-ngram vectors?
So the other question is that we are exploiting an already existing similarity in our bag-of-ngram representation. Why do this transformation in the first place? There are two main reasons
- Bag of n-gram vectors can get arbitrarily large as the size is controlled by the number of different n-grams in the corpus, where in this method,
b
controls the size of the vectors - Forcing high dimensional n-grams to be in a lower dimension forces it to learn more similarities. The same principle applies to word vectors. When the dimensionality is small, it forces word vectors to learn about similarities in words.
Performance on a SPAM classification task
A typical method of evaluating word vectors is by using them on a down-stream task. Here I have evaluated the following on a spam-classification dataset.
- Hash2vec vectors
- Bag-of-ngram vectors
- Actual word vectors
You can find the code for experiments here. The model is a simple logistic regression model on top of vectors. After 10 trials (each trained for 10 epochs), I get the following result.
- Hash2vec: 82.18 +/- 0.93 %
- Multihot: 83.99 +/- 0.28%
- Word2vec: 92.04 + /- 0.36%
We can see that Hash2vec is no oracle. But it performs quite on par, despite having a smaller size than Multihot (bag-of-ngrams) approach. Here we end our discussion about the method. You can see the code for this experiment: Here.
Why and why not Hash2Vec
Why Hash2Vec
- Captures only compositional similarities – It is a good way to make sure words with similar structure (e.g. walk and walked) or words with minor spelling mistakes (e.g. third vs thrid) gets mapped to similar vectors
- Simple word vectors with no training – No need for training. A simple matrix multiplication gives the vectors
- Fits in with any model – Unlike word vectors, they don’t need to be in an end-to-end differentiable architecture (e.g. can be used with models like xgboost)
- Hashing can go wrong – There’s no guarantee in LSH that very opposite things will be in very different bins. In rare instances, they can be mapped to the same bin. That is why I ran multiple trials on this experiments.
Why not Hash2Vec
- Captures semantics – And word vectors will capture semantics (e.g. cat vs feline) where the above method won’t.
- No vector arithmetics – Unlike Word2vec, you can’t perform vector calculations (e.g. king – man = queen – woman) with Hash2vec because it doesn’t capture semantics.
- Accuracy – Word2vec is much more performant than this method. So if you’re after accuracy Word2vec is a more sensible choice.
If you can’t beat ’em, join ’em
These methods (Hash2vec & Word2vec) don’t need to be mutually exclusive. They can empower each other as well. For example,
- You can use Hash2vec to provide a good initialization for word vectors and then train word vectors normally.
- You can use LSH to develop a simple spelling correction mechanism. Since LSH maps similar words to similar buckets, you can use this property develop a simple spelling correction lookup dictionary. Then map words in the same bucket to the same vector in a word2vec layer.
Conclusion
The objective of this article wasn’t to introduce a rival for word2vec, but to show the power of simple LSH mechanism and cement it with an exercise of word vectors. It gave us a way to come up with low-dimensional word vectors without the need for any training. Furthermore, these vectors can capture the compositional similarities between words. Finally, these methods can benefit from each other as well.
PS: I haven’t seen a technique similar to this being experimented with, nor being used before. That is what drove me to write this article. But being lost in the sheer volume of ML papers published in the space, it is not rare to similar work go unnoticed. If you see a paper doing work on a similar line. Feel free to post about them.
Appendix
Random rotation and matrix multiplication
We have been saying that R
in h(x) = argmax(xR;-xR)
represents a randomly rotated structure in a hyper-dimensional space. To understand why that’s the case, we have to turn our heads to the paper "Spherical LSH for approximate nearest neighbor search on unit hypersphere". In this paper, the actual computation looks like,

where R
is a random matrix (the paper uses A
) and v_j
represents the j^th
vertex of a regular hypersphere. So here we’re specifically using a random rotation matrix to rotate vertices of a regular hyperspher. So how did we narrow x_i(R v_j)
down to xR;-xR
(where ";" is the concatenation). The trick lies in what we use as our shape. Here we are using a special polytope (a generalization of a polygon to high dimensions) called an "orthoplex". An orthoplex’s vertices are simply all permutations of {+/- 1, 0, 0, 0,...}
. For example, in 2D space it’s [1, 0], [0, 1], [-1, 0]
and [0, -1]
.

In other words the vertices are I
and -I
, where I is the identity matrix. So by replacing v_j
with I
and -I
, we get xR
and — xR
. And a simple reshape would give xR;-xR
.