Semantic Search Engine for Emojis in 50+ Languages Using AI 😊🌍🚀

If you are on social media like Twitter or LinkedIn, you have probably noticed that emojis are creatively used in both informal and professional text-based communication. For example, the Rocket emoji is often used on LinkedIn to symbolize high aspirations and ambitious goals, while the Bullseye emoji is used in the context of achieving goals. Despite this growth of creative emoji use, most social media platforms lack a utility that assists users in choosing the right emoji to effectively communicate their message. I therefore decided to invest some time to work on a project I called Emojeez , an AI-powered engine for emoji search and retrieval. You can experience Emojeez live using this fun interactive demo.

In this article, I will discuss my experience and explain how I employed advanced natural language processing (NLP) technologies to develop a semantic search engine for emojis. Concretely, I will present a case study on embedding-based semantic search with the following steps

How to use LLMs to generate semantically rich emoji descriptions
How to use Hugging Face Transformers for multilingual embeddings
How to integrate Qdrant ‍ vector database to perform efficient semantic search

I made the full code for this project available on GitHub.

Inspiration

Every new idea often begins with a spark of inspiration. For me, the spark came from Luciano Ramalho’s book Fluent Python. It is a fantastic read that I highly recommend for anyone who likes to write truly Pythonic code. In chapter 4 of his book, Luciano shows how to search over Unicode characters by querying their names in the Unicode standards. He created a Python utility that takes a query like "cat smiling" and retrieves all Unicode characters that have both "cat" and "smiling" in their names. Given the query "cat smiling", the utility retrieves three emojis: , , and . Pretty cool, right?

From there, I started thinking how modern AI technology could be used to build an even better emoji search utility. By "better," I envisioned a search engine that not only has better emoji coverage but also supports user queries in multiple languages beyond English.

Limitations of Keyword Search

If you are an emoji enthusiast, you know that , , and aren’t the only smiley cat emojis out there. Some cat emojis are missing, notably and . This is a known limitation of keyword search algorithms, which rely on string matching to retrieve relevant items. Keyword, or lexical search algorithms, are known among information retrieval practitioners to have high precision but low recall. High precision means the retrieved items usually match the user query well. One the other hand, low recall means the algorithm might not retrieve all relevant items. In many cases, the lower recall is due to string matching. For example, the emoji does not have "smiling" in its name – cat with tears of joy. Therefore, it cannot be retrieved with the query "cat smiling" if we search for both terms cat _and smi_ling in its name.

Another issue with lexical search is that it is usually language-specific. In Luciano’s Fluent Python example, you can’t find emojis using a query in another language because all Unicode characters, including emojis, have English names. To support other languages, we would need to translate each query into English first using machine translation. This will add more complexity and might not work well for all languages.

But hey, it’s 2024 and AI has come a long way. We now have solutions to address these limitations. In the rest of this article, I will show you how.

Embedding-based Semantic Search

In recent years, a new search paradigm has emerged with the popularity of deep neural networks for NLP. In this paradigm, the search algorithm does not look at the strings that make up the items in the search database or the query. Instead, it operates on numerical representations of text, known as vector embeddings. In embedding-based search algorithms, the search items, whether text documents or visual images, are first converted into data points in a vector space such that semantically relevant items are nearby. Embeddings enable us to perform similarity search based on the meaning of the emoji description rather than the keywords in its name. Because they retrieve items based on semantic similarity rather than keyword similarity, embedding-based search algorithms are known as semantic search.

Using semantic search for emoji retrieval solves two problems:

We can go beyond keyword matching and use semantic similarity between emoji descriptions and user queries. This improves the coverage of the retrieved emojis, leading to higher recall.
If we represent emojis as data points in a multilingual embedding space, we can enable user queries written in languages other than English, without needing translation into English. That is very cool, isn’t it? Let’s see how

Step 1: Generating Rich Emoji Descriptions using LLMs

If you use social media, you probably know that many emojis are almost never used literally. For example, and rarely denote an eggplant and peach. Social media users are very creative in assigning meanings to emojis that go beyond their literal interpretation. This creativity limits the expressiveness of emoji names in the Unicode standards. A notable example is the emoji, which is described in the Unicode name simply as rainbow, yet it is commonly used in contexts related to diversity, peace, and LGBTQ+ community.

To build a useful search engine, we need a rich semantic description for each emoji that defines what the emoji represents and what it symbolizes. Given that there are more than 5000 emojis in the current Unicode standards, doing this manually is not feasible. Luckily, we can employ Large Language Models (LLMs) to assist us in generating metadata for each emoji. Since LLMs are trained on the entire web, they have likely seen how each emoji is used in context.

For this task, I used the Llama 3 LLM to generate metadata for each emoji. I wrote a prompt to define the task and what the LLM is expected to do. As illustrated in the figure below, the LLM generated a rich semantic description for the Bullseye emoji. These descriptions are more suitable for semantic search compared to Unicode names. I released the LLM-generated descriptions as a Hugging Face dataset.

Step 2: Representing Emojis as Embeddings using Sentence Transformers

Now that we have a rich semantic description for each emoji in the Unicode standard, the next step is to represent each emoji as a vector embedding in a multidimensional space that captures the meaning of the emoji description. For this task, I used a multilingual transformer based on the BERT architecture, fine-tuned for sentence similarity across 50 languages. You can see the supported languages in the model card in the Hugging Face library.

So far, I have only discussed the embedding of emoji descriptions generated by the LLM, which are in English. But how can we support languages other than English?

Well, here’s where the magic of multilingual transformers comes in. The multilingual support is enabled through the embedding space itself. This means we can take user queries in any of the 50 supported languages and match them to emojis based on their English descriptions. The multilingual sentence encoder (or embedding model) maps semantically similar text phrases to nearby points in its embedding space. Let me show you what I mean with the following illustration.

In the figure above, we see that semantically similar phrases end up being data points that are nearby in the embedding space, even if they are expressed in different languages. Multilingual sentence Transformers enable cross-lingual search applications, therefore user queries and indexed search items do not have to be expressed in the same language.

Step 3: Integrating Qdrant’s Vector Database ‍

Once we have our emojis represented as vector embeddings, the next step is to build an index over these embeddings in a way that allows for efficient search operations. For this purpose, I chose to use Qdrant, an open-source vector similarity search engine that provides high-performance search capabilities.

Setting up Qdrant for this task is a simple as the code snippet below (you can also check out this Jupyter Notebook).

# Load the emoji dictionary from a pickle file
with open(file_path, 'rb') as file:
    emoji_dict: Dict[str, Dict[str, Any]] = pickle.load(file)

# Setup the Qdrant client and populate the database
vector_DB_client = QdrantClient(":memory:")

embedding_dict = {
    emoji: np.array(metadata['embedding']) 
    for emoji, metadata in emoji_dict.items()
}

# Remove the embeddings from the dictionary so it can be used 
# as payload in Qdrant
for emoji in list(emoji_dict):
    del emoji_dict[emoji]['embedding']

embedding_dim: int = next(iter(embedding_dict.values())).shape[0]

# Create a new collection in Qdrant
vector_DB_client.create_collection(
    collection_name="EMOJIS",
    vectors_config=models.VectorParams(
        size=embedding_dim, 
        distance=models.Distance.COSINE
    ),
)

# Upload vectors to the collection
vector_DB_client.upload_points( 
    collection_name="EMOJIS",
    points=[
        models.PointStruct(
            id=idx, 
            vector=embedding_dict[emoji].tolist(),
            payload=emoji_dict[emoji]
        )
        for idx, emoji in enumerate(emoji_dict)
    ],
)

Now the search index _vector_DBclient is ready to take queries. All we need to do is to transform the coming user query into a vector embedding using the same embedding model we used to embed the emoji descriptions. This can be done through the function below.

def retrieve_relevant_emojis(
        embedding_model: SentenceTransformer,
        vector_DB_client: QdrantClient,
        query: str, 
        num_to_retrieve: int) -&gt; List[str]:
    """
    Return emojis relevant to the query using sentence encoder and Qdrant. 
    """

    # Embed the query
    query_vector = embedding_model.encode(query).tolist()

    hits = vector_DB_client.search(
        collection_name="EMOJIS",
        query_vector=query_vector,
        limit=num_to_retrieve,
    )

    return hits

To further show the retrieved emojis, their similarity score with the query, and their Unicode names, I wrote the following helper function.

def show_top_10(query: str) -&gt; None:
    """
    Show emojis that are most relevant to the query.
    """
    emojis = retrieve_relevant_emojis(
        sentence_encoder, 
        vector_DB_clinet, 
        query, 
        num_to_retrieve=10
    )

    for i, hit in enumerate(emojis, start=1):

        emoji_char = hit.payload['Emoji']
        score = hit.score

        space = len(emoji_char) + 3

        unicode_desc = ' '.join(
           em.demojize(emoji_char).split('_')
        ).upper()

        print(f"{i:&lt;3} {emoji_char:&lt;{space}}", end='')
        print(f"{score:&lt;7.3f}", end= '')
        print(f"{unicode_desc[1:-1]:&lt;55}")

Now everything is set up, and we can look at a few examples. Remember the "cat smiling" query from Luciano’s book? Let’s see how Semantic Search is different from keyword search.

&gt;&gt;&gt; show_top_10('cat smiling')
1       0.651  CAT WITH WRY SMILE                                     
2       0.643  GRINNING CAT WITH SMILING EYES                         
3       0.611  CAT WITH TEARS OF JOY                                  
4       0.603  SMILING CAT WITH HEART-EYES                            
5       0.596  GRINNING CAT                                           
6       0.522  CAT FACE                                               
7        0.513  CAT                                                    
8     ‍   0.495  BLACK CAT                                              
9       0.468  KISSING CAT                                            
10      0.452  LEOPARD

Awesome! Not only did we get the expected cat emojis like , , and , which the keyword search retrieved, but it also the smiley cats , , , and . This showcases the higher recall, or higher coverage of the retrieved items, I mentioned earlier. Indeed, more cats is always better!

The Real Power of Semantic Search

The previous "cat smiling" example shows how embedding-based semantic search can retrieve a broader and more meaningful set of items, improving the overall search experience. However, I don’t think this example truly shows the power of semantic search.

Imagine looking for something but not knowing its name. For example, take the object. Do you know what it’s called in English? I sure didn’t. But I know a bit about it. In Middle Eastern and Central Asian cultures, the is believed to protect against the evil eye. So, I knew what it does but not what it’s called.

Let’s see if we can find the emoji with our search engine by describing it using the query "protect from evil eye".

&gt;&gt;&gt; show_top_10('protect from evil eye')
1      0.409  NAZAR AMULET                                           
2       0.405  GLASSES                                                
3      0.387  GOGGLES                                                
4       0.383  EYE                                                    
5        0.382  SUPERVILLAIN LIGHT SKIN TONE                           
6       0.374  EYES                                                   
7       0.370  SUPERVILLAIN DARK SKIN TONE                            
8    ️   0.369  SHIELD                                                 
9       0.366  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10    ‍   0.364  MAN SUPERVILLAIN LIGHT SKIN TONE

And Viola! It turns out that the is actually called Nazar Amulet. I learned something new

Going Beyond English

One of the features I really wanted for this search engine to have is for it to support as many languages besides English as possible. So far, we have not tested that. Let’s test the multilingual capabilities using the description of the Nazar Amulet emoji by translating the phrase "protection from evil eyes" into other languages and using them as queries one language at a time. Here are the result below for some languages.

Arabic

&gt;&gt;&gt; show_top_10('يحمي من العين الشريرة') # Arabic
1      0.442  NAZAR AMULET                                           
2       0.430  GLASSES                                                
3       0.414  EYE                                                    
4      0.403  GOGGLES                                                
5       0.403  EYES                                                   
6        0.398  SUPERVILLAIN LIGHT SKIN TONE                           
7       0.394  SEE-NO-EVIL MONKEY                                     
8      0.387  FACE WITH PEEKING EYE                                  
9        0.385  VAMPIRE LIGHT SKIN TONE                                
10      0.383  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE

German

&gt;&gt;&gt; show_top_10('Vor dem bösen Blick schützen') # Deutsch 
1       0.369  FACE WITH MEDICAL MASK                                 
2      0.364  FACE WITH PEEKING EYE                                  
3    ️   0.360  SHIELD                                                 
4       0.359  SEE-NO-EVIL MONKEY                                     
5       0.353  EYES                                                   
6       0.350  HEAR-NO-EVIL MONKEY                                    
7       0.346  EYE                                                    
8      0.345  NAZAR AMULET                                           
9     ‍   0.345  WOMAN GUARD DARK SKIN TONE                             
10    ‍   0.345  WOMAN GUARD DARK SKIN TONE

Greek

&gt;&gt;&gt; show_top_10('Προστατέψτε από το κακό μάτι') #Greek
1       0.497  GLASSES                                                
2      0.484  GOGGLES                                                
3        0.452  EYE                                                    
4     ️   0.430  SUNGLASSES                                             
5        0.430  SUNGLASSES                                             
6       0.429  EYES                                                   
7     ️   0.415  EYE                                                    
8      0.411  NAZAR AMULET                                           
9      0.404  FACE WITH PEEKING EYE                                  
10      0.391  FACE WITH MEDICAL MASK

Bulgarian

&gt;&gt;&gt; show_top_10('Защитете от лошото око') # Bulgarian
1       0.475  GLASSES                                                
2      0.452  GOGGLES                                                
3        0.448  EYE                                                    
4       0.418  EYES                                                   
5     ️   0.412  EYE                                                    
6      0.397  FACE WITH PEEKING EYE                                  
7     ️   0.387  SUNGLASSES                                             
8        0.387  SUNGLASSES                                             
9       0.375  SQUINTING FACE WITH TONGUE                             
10     0.373  NAZAR AMULET

Chinese

&gt;&gt;&gt; show_top_10('防止邪眼') # Chinese
1       0.425  GLASSES                                                
2      0.397  GOGGLES                                                
3       0.392  EYE                                                    
4      0.383  NAZAR AMULET                                           
5       0.380  EYES                                                   
6       0.370  SEE-NO-EVIL MONKEY                                     
7       0.369  FACE WITH MEDICAL MASK                                 
8     ️   0.363  SUNGLASSES                                             
9        0.363  SUNGLASSES                                             
10     0.360  FACE WITH PEEKING EYE

Japanese

&gt;&gt;&gt; show_top_10('邪眼から守る') # Japanese 
1       0.379  SEE-NO-EVIL MONKEY                                     
2      0.379  NAZAR AMULET                                           
3       0.370  HEAR-NO-EVIL MONKEY                                    
4       0.363  FACE WITH MEDICAL MASK                                 
5       0.363  SPEAK-NO-EVIL MONKEY                                   
6      0.355  FACE WITH PEEKING EYE                                  
7    ️   0.355  SHIELD                                                 
8       0.351  EYE                                                    
9       0.350  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10      0.350  GLASSES

For languages as diverse as Arabic, German, Greek, Bulgarian, Chinese, and Japanese, the emoji always appears in the top 10! This is pretty fascinating since these languages have different linguistic features and writing scripts, thanks to the massive multilinguality of our sentence Transformer.

Limits of AI

The last thing I want to mention is that no technology, no matter how advanced, is perfect. Semantic search is great for improving the recall of information retrieval systems. This means we can retrieve more relevant items even if there is no keyword overlap between the query and the items in the search index. However, this comes at the expense of precision. Remember from the emoji example that in some languages, the emoji we were looking for didn’t show up in the top 5 results. For this application, this is not a big problem since it’s not cognitively demanding to quickly scan through emojis to find the one we desire, even if it’s ranked at the 50th position. But in other cases such as searching through long documents, users may not have the patience nor the resources to skim through dozens of documents. Developers need to keep in mind user cognitive as well as resource constraints when building search engines. Some of the design choices I made for the Emojeez search engine may not be work as well for other applications.

Another thing to mention is that AI models are known to learn socio-cultural biases from their training data. There is a large volume of documented research showing how modern language technology can amplify gender stereotypes and be unfair to minorities. So, we need to be aware of these issues and do our best to tackle them when deploying AI in the real world. If you notice such unwanted biases and unfair behaviors in Emojeez , please let me know and I will do my best to address them.

Conclusion

Working on the Emojeez project was a fascinating journey that taught me a lot about how modern AI and NLP technologies can be employed to address the limitations of traditional keyword search. By harnessing the power of Large Language Models for enriching emoji metadata, multilingual transformers for creating semantic embeddings, and Qdrant for efficient vector search, I was able to create a search engine that makes emoji search more fun and accessible across 50+ languages. Although this project focuses on emoji search, the underlying technology has potential applications in multimodal search and recommendation systems.

For readers who are proficient in languages other than English, I am particularly interested in your feedback. Does Emojeez perform equally well in English and your native language? Did you notice any differences in quality or accuracy? Please give it a try and let me what you think. Your insights are quite invaluable.

Thank you for reading, and I hope you enjoy exploring Emojeez as much as I enjoyed building it.

Happy Emoji search!