If you are on social media like Twitter or LinkedIn, you have probably noticed that emojis are creatively used in both informal and professional text-based communication. For example, the Rocket emoji is often used on LinkedIn to symbolize high aspirations and ambitious goals, while the Bullseye
emoji is used in the context of achieving goals. Despite this growth of creative emoji use, most social media platforms lack a utility that assists users in choosing the right emoji to effectively communicate their message. I therefore decided to invest some time to work on a project I called Emojeez
, an AI-powered engine for emoji search and retrieval. You can experience Emojeez
live using this fun interactive demo.
In this article, I will discuss my experience and explain how I employed advanced natural language processing (NLP) technologies to develop a semantic search engine for emojis. Concretely, I will present a case study on embedding-based semantic search with the following steps
- How to use LLMs
to generate semantically rich emoji descriptions
- How to use Hugging Face
Transformers for multilingual embeddings
- How to integrate Qdrant
โ
vector database to perform efficient semantic search
I made the full code for this project available on GitHub.
Inspiration
Every new idea often begins with a spark of inspiration. For me, the spark came from Luciano Ramalhoโs book Fluent Python. It is a fantastic read that I highly recommend for anyone who likes to write truly Pythonic code. In chapter 4 of his book, Luciano shows how to search over Unicode characters by querying their names in the Unicode standards. He created a Python utility that takes a query like "cat smiling" and retrieves all Unicode characters that have both "cat" and "smiling" in their names. Given the query "cat smiling", the utility retrieves three emojis: ,
, and
. Pretty cool, right?
From there, I started thinking how modern AI technology could be used to build an even better emoji search utility. By "better," I envisioned a search engine that not only has better emoji coverage but also supports user queries in multiple languages beyond English.
Limitations of Keyword Search 
If you are an emoji enthusiast, you know that ,
, and
arenโt the only smiley cat emojis out there. Some cat emojis are missing, notably
and
. This is a known limitation of keyword search algorithms, which rely on string matching to retrieve relevant items. Keyword, or lexical search algorithms, are known among information retrieval practitioners to have high precision but low recall. High precision means the retrieved items usually match the user query well. One the other hand, low recall means the algorithm might not retrieve all relevant items. In many cases, the lower recall is due to string matching. For example, the emoji
does not have "smiling" in its name โ cat with tears of joy. Therefore, it cannot be retrieved with the query "cat smiling" if we search for both terms cat _and smi_ling in its name.
Another issue with lexical search is that it is usually language-specific. In Lucianoโs Fluent Python example, you canโt find emojis using a query in another language because all Unicode characters, including emojis, have English names. To support other languages, we would need to translate each query into English first using machine translation. This will add more complexity and might not work well for all languages.
But hey, itโs 2024 and AI has come a long way. We now have solutions to address these limitations. In the rest of this article, I will show you how.
Embedding-based Semantic Search 
In recent years, a new search paradigm has emerged with the popularity of deep neural networks for NLP. In this paradigm, the search algorithm does not look at the strings that make up the items in the search database or the query. Instead, it operates on numerical representations of text, known as vector embeddings. In embedding-based search algorithms, the search items, whether text documents or visual images, are first converted into data points in a vector space such that semantically relevant items are nearby. Embeddings enable us to perform similarity search based on the meaning of the emoji description rather than the keywords in its name. Because they retrieve items based on semantic similarity rather than keyword similarity, embedding-based search algorithms are known as semantic search.
Using semantic search for emoji retrieval solves two problems:
- We can go beyond keyword matching and use semantic similarity between emoji descriptions and user queries. This improves the coverage of the retrieved emojis, leading to higher recall.
- If we represent emojis as data points in a multilingual embedding space, we can enable user queries written in languages other than English, without needing translation into English. That is very cool, isnโt it? Letโs see how
Step 1: Generating Rich Emoji Descriptions using LLMs 
If you use social media, you probably know that many emojis are almost never used literally. For example, and
rarely denote an eggplant and peach. Social media users are very creative in assigning meanings to emojis that go beyond their literal interpretation. This creativity limits the expressiveness of emoji names in the Unicode standards. A notable example is the
emoji, which is described in the Unicode name simply as rainbow, yet it is commonly used in contexts related to diversity, peace, and LGBTQ+ community.
To build a useful search engine, we need a rich semantic description for each emoji that defines what the emoji represents and what it symbolizes. Given that there are more than 5000 emojis in the current Unicode standards, doing this manually is not feasible. Luckily, we can employ Large Language Models (LLMs) to assist us in generating metadata for each emoji. Since LLMs are trained on the entire web, they have likely seen how each emoji is used in context.
For this task, I used the Llama 3 LLM to generate metadata for each emoji. I wrote a prompt to define the task and what the LLM is expected to do. As illustrated in the figure below, the LLM generated a rich semantic description for the Bullseye
emoji. These descriptions are more suitable for semantic search compared to Unicode names. I released the LLM-generated descriptions as a Hugging Face dataset.
Step 2: Representing Emojis as Embeddings using Sentence Transformers 
Now that we have a rich semantic description for each emoji in the Unicode standard, the next step is to represent each emoji as a vector embedding in a multidimensional space that captures the meaning of the emoji description. For this task, I used a multilingual transformer based on the BERT architecture, fine-tuned for sentence similarity across 50 languages. You can see the supported languages in the model card in the Hugging Face library.
So far, I have only discussed the embedding of emoji descriptions generated by the LLM, which are in English. But how can we support languages other than English?
Well, hereโs where the magic of multilingual transformers comes in. The multilingual support is enabled through the embedding space itself. This means we can take user queries in any of the 50 supported languages and match them to emojis based on their English descriptions. The multilingual sentence encoder (or embedding model) maps semantically similar text phrases to nearby points in its embedding space. Let me show you what I mean with the following illustration.
In the figure above, we see that semantically similar phrases end up being data points that are nearby in the embedding space, even if they are expressed in different languages. Multilingual sentence Transformers enable cross-lingual search applications, therefore user queries and indexed search items do not have to be expressed in the same language.
Step 3: Integrating Qdrantโs Vector Database
โ
Once we have our emojis represented as vector embeddings, the next step is to build an index over these embeddings in a way that allows for efficient search operations. For this purpose, I chose to use Qdrant, an open-source vector similarity search engine that provides high-performance search capabilities.
Setting up Qdrant for this task is a simple as the code snippet below (you can also check out this Jupyter Notebook).
# Load the emoji dictionary from a pickle file
with open(file_path, 'rb') as file:
emoji_dict: Dict[str, Dict[str, Any]] = pickle.load(file)
# Setup the Qdrant client and populate the database
vector_DB_client = QdrantClient(":memory:")
embedding_dict = {
emoji: np.array(metadata['embedding'])
for emoji, metadata in emoji_dict.items()
}
# Remove the embeddings from the dictionary so it can be used
# as payload in Qdrant
for emoji in list(emoji_dict):
del emoji_dict[emoji]['embedding']
embedding_dim: int = next(iter(embedding_dict.values())).shape[0]
# Create a new collection in Qdrant
vector_DB_client.create_collection(
collection_name="EMOJIS",
vectors_config=models.VectorParams(
size=embedding_dim,
distance=models.Distance.COSINE
),
)
# Upload vectors to the collection
vector_DB_client.upload_points(
collection_name="EMOJIS",
points=[
models.PointStruct(
id=idx,
vector=embedding_dict[emoji].tolist(),
payload=emoji_dict[emoji]
)
for idx, emoji in enumerate(emoji_dict)
],
)
Now the search index _vector_DBclient is ready to take queries. All we need to do is to transform the coming user query into a vector embedding using the same embedding model we used to embed the emoji descriptions. This can be done through the function below.
def retrieve_relevant_emojis(
embedding_model: SentenceTransformer,
vector_DB_client: QdrantClient,
query: str,
num_to_retrieve: int) -> List[str]:
"""
Return emojis relevant to the query using sentence encoder and Qdrant.
"""
# Embed the query
query_vector = embedding_model.encode(query).tolist()
hits = vector_DB_client.search(
collection_name="EMOJIS",
query_vector=query_vector,
limit=num_to_retrieve,
)
return hits
To further show the retrieved emojis, their similarity score with the query, and their Unicode names, I wrote the following helper function.
def show_top_10(query: str) -> None:
"""
Show emojis that are most relevant to the query.
"""
emojis = retrieve_relevant_emojis(
sentence_encoder,
vector_DB_clinet,
query,
num_to_retrieve=10
)
for i, hit in enumerate(emojis, start=1):
emoji_char = hit.payload['Emoji']
score = hit.score
space = len(emoji_char) + 3
unicode_desc = ' '.join(
em.demojize(emoji_char).split('_')
).upper()
print(f"{i:<3} {emoji_char:<{space}}", end='')
print(f"{score:<7.3f}", end= '')
print(f"{unicode_desc[1:-1]:<55}")
Now everything is set up, and we can look at a few examples. Remember the "cat smiling" query from Lucianoโs book? Letโs see how Semantic Search is different from keyword search.
>>> show_top_10('cat smiling')
1
0.651 CAT WITH WRY SMILE
2
0.643 GRINNING CAT WITH SMILING EYES
3
0.611 CAT WITH TEARS OF JOY
4
0.603 SMILING CAT WITH HEART-EYES
5
0.596 GRINNING CAT
6
0.522 CAT FACE
7
0.513 CAT
8
โ
0.495 BLACK CAT
9
0.468 KISSING CAT
10
0.452 LEOPARD
Awesome! Not only did we get the expected cat emojis like ,
, and
, which the keyword search retrieved, but it also the smiley cats
,
,
, and
. This showcases the higher recall, or higher coverage of the retrieved items, I mentioned earlier. Indeed, more cats is always better!
The Real Power of Semantic Search 
The previous "cat smiling" example shows how embedding-based semantic search can retrieve a broader and more meaningful set of items, improving the overall search experience. However, I donโt think this example truly shows the power of semantic search.
Imagine looking for something but not knowing its name. For example, take the object. Do you know what itโs called in English? I sure didnโt. But I know a bit about it. In Middle Eastern and Central Asian cultures, the
is believed to protect against the evil eye. So, I knew what it does but not what itโs called.
Letโs see if we can find the emoji with our search engine by describing it using the query "protect from evil eye".
>>> show_top_10('protect from evil eye')
1
0.409 NAZAR AMULET
2
0.405 GLASSES
3
0.387 GOGGLES
4
0.383 EYE
5
0.382 SUPERVILLAIN LIGHT SKIN TONE
6
0.374 EYES
7
0.370 SUPERVILLAIN DARK SKIN TONE
8
๏ธ 0.369 SHIELD
9
0.366 SUPERVILLAIN MEDIUM-LIGHT SKIN TONE
10
โ
0.364 MAN SUPERVILLAIN LIGHT SKIN TONE
And Viola! It turns out that the is actually called Nazar Amulet. I learned something new
Going Beyond English

One of the features I really wanted for this search engine to have is for it to support as many languages besides English as possible. So far, we have not tested that. Letโs test the multilingual capabilities using the description of the Nazar Amulet emoji by translating the phrase "protection from evil eyes" into other languages and using them as queries one language at a time. Here are the result below for some languages.
Arabic
>>> show_top_10('ูุญู
ู ู
ู ุงูุนูู ุงูุดุฑูุฑุฉ') # Arabic
1
0.442 NAZAR AMULET
2
0.430 GLASSES
3
0.414 EYE
4
0.403 GOGGLES
5
0.403 EYES
6
0.398 SUPERVILLAIN LIGHT SKIN TONE
7
0.394 SEE-NO-EVIL MONKEY
8
0.387 FACE WITH PEEKING EYE
9
0.385 VAMPIRE LIGHT SKIN TONE
10
0.383 SUPERVILLAIN MEDIUM-LIGHT SKIN TONE
German
>>> show_top_10('Vor dem bรถsen Blick schรผtzen') # Deutsch
1
0.369 FACE WITH MEDICAL MASK
2
0.364 FACE WITH PEEKING EYE
3
๏ธ 0.360 SHIELD
4
0.359 SEE-NO-EVIL MONKEY
5
0.353 EYES
6
0.350 HEAR-NO-EVIL MONKEY
7
0.346 EYE
8
0.345 NAZAR AMULET
9
โ
0.345 WOMAN GUARD DARK SKIN TONE
10
โ
0.345 WOMAN GUARD DARK SKIN TONE
Greek
>>> show_top_10('ฮ ฯฮฟฯฯฮฑฯฮญฯฯฮต ฮฑฯฯ ฯฮฟ ฮบฮฑฮบฯ ฮผฮฌฯฮน') #Greek
1
0.497 GLASSES
2
0.484 GOGGLES
3
0.452 EYE
4
๏ธ 0.430 SUNGLASSES
5
0.430 SUNGLASSES
6
0.429 EYES
7
๏ธ 0.415 EYE
8
0.411 NAZAR AMULET
9
0.404 FACE WITH PEEKING EYE
10
0.391 FACE WITH MEDICAL MASK
Bulgarian
>>> show_top_10('ะะฐัะธัะตัะต ะพั ะปะพัะพัะพ ะพะบะพ') # Bulgarian
1
0.475 GLASSES
2
0.452 GOGGLES
3
0.448 EYE
4
0.418 EYES
5
๏ธ 0.412 EYE
6
0.397 FACE WITH PEEKING EYE
7
๏ธ 0.387 SUNGLASSES
8
0.387 SUNGLASSES
9
0.375 SQUINTING FACE WITH TONGUE
10
0.373 NAZAR AMULET
Chinese
>>> show_top_10('้ฒๆญข้ช็ผ') # Chinese
1
0.425 GLASSES
2
0.397 GOGGLES
3
0.392 EYE
4
0.383 NAZAR AMULET
5
0.380 EYES
6
0.370 SEE-NO-EVIL MONKEY
7
0.369 FACE WITH MEDICAL MASK
8
๏ธ 0.363 SUNGLASSES
9
0.363 SUNGLASSES
10
0.360 FACE WITH PEEKING EYE
Japanese
>>> show_top_10('้ช็ผใใๅฎใ') # Japanese
1
0.379 SEE-NO-EVIL MONKEY
2
0.379 NAZAR AMULET
3
0.370 HEAR-NO-EVIL MONKEY
4
0.363 FACE WITH MEDICAL MASK
5
0.363 SPEAK-NO-EVIL MONKEY
6
0.355 FACE WITH PEEKING EYE
7
๏ธ 0.355 SHIELD
8
0.351 EYE
9
0.350 SUPERVILLAIN MEDIUM-LIGHT SKIN TONE
10
0.350 GLASSES
For languages as diverse as Arabic, German, Greek, Bulgarian, Chinese, and Japanese, the emoji always appears in the top 10! This is pretty fascinating since these languages have different linguistic features and writing scripts, thanks to the massive multilinguality of our
sentence Transformer.
Limits of AI 
The last thing I want to mention is that no technology, no matter how advanced, is perfect. Semantic search is great for improving the recall of information retrieval systems. This means we can retrieve more relevant items even if there is no keyword overlap between the query and the items in the search index. However, this comes at the expense of precision. Remember from the emoji example that in some languages, the emoji we were looking for didnโt show up in the top 5 results. For this application, this is not a big problem since itโs not cognitively demanding to quickly scan through emojis to find the one we desire, even if itโs ranked at the 50th position. But in other cases such as searching through long documents, users may not have the patience nor the resources to skim through dozens of documents. Developers need to keep in mind user cognitive as well as resource constraints when building search engines. Some of the design choices I made for the Emojeez
search engine may not be work as well for other applications.
Another thing to mention is that AI models are known to learn socio-cultural biases from their training data. There is a large volume of documented research showing how modern language technology can amplify gender stereotypes and be unfair to minorities. So, we need to be aware of these issues and do our best to tackle them when deploying AI in the real world. If you notice such unwanted biases and unfair behaviors in Emojeez , please let me know and I will do my best to address them.
Conclusion
Working on the Emojeez project was a fascinating journey that taught me a lot about how modern AI and NLP technologies can be employed to address the limitations of traditional keyword search. By harnessing the power of Large Language Models for enriching emoji metadata, multilingual transformers for creating semantic embeddings, and Qdrant for efficient vector search, I was able to create a search engine that makes emoji search more fun and accessible across 50+ languages. Although this project focuses on emoji search, the underlying technology has potential applications in multimodal search and recommendation systems.
For readers who are proficient in languages other than English, I am particularly interested in your feedback. Does Emojeez perform equally well in English and your native language? Did you notice any differences in quality or accuracy? Please give it a try and let me what you think. Your insights are quite invaluable.
Thank you for reading, and I hope you enjoy exploring Emojeez as much as I enjoyed building it.
Happy Emoji search!
Note: Unless otherwise noted, all images are created by the author.