The worldโ€™s leading publication for data science, AI, and ML professionals.

Semantic Search Engine for Emojis in 50+ Languages Using AI ๐Ÿ˜Š๐ŸŒ๐Ÿš€

Develop an AI-powered semantic search for emojis using Python and open-source NLP libraries

If you are on social media like Twitter or LinkedIn, you have probably noticed that emojis are creatively used in both informal and professional text-based communication. For example, the Rocket emoji ๐Ÿš€ is often used on LinkedIn to symbolize high aspirations and ambitious goals, while the Bullseye ๐ŸŽฏ emoji is used in the context of achieving goals. Despite this growth of creative emoji use, most social media platforms lack a utility that assists users in choosing the right emoji to effectively communicate their message. I therefore decided to invest some time to work on a project I called Emojeez ๐Ÿ’Ž , an AI-powered engine for emoji search and retrieval. You can experience Emojeez ๐Ÿ’Ž live using this fun interactive demo.

In this article, I will discuss my experience and explain how I employed advanced natural language processing (NLP) technologies to develop a semantic search engine for emojis. Concretely, I will present a case study on embedding-based semantic search with the following steps

  1. How to use LLMs ๐Ÿฆœto generate semantically rich emoji descriptions
  2. How to use Hugging Face ๐Ÿค— Transformers for multilingual embeddings
  3. How to integrate Qdrant ๐Ÿง‘๐Ÿป โ€๐Ÿš€ vector database to perform efficient semantic search

I made the full code for this project available on GitHub.

Inspiration๐Ÿ’ก

Every new idea often begins with a spark of inspiration. For me, the spark came from Luciano Ramalhoโ€™s book Fluent Python. It is a fantastic read that I highly recommend for anyone who likes to write truly Pythonic code. In chapter 4 of his book, Luciano shows how to search over Unicode characters by querying their names in the Unicode standards. He created a Python utility that takes a query like "cat smiling" and retrieves all Unicode characters that have both "cat" and "smiling" in their names. Given the query "cat smiling", the utility retrieves three emojis: ๐Ÿ˜ป , ๐Ÿ˜บ , and ๐Ÿ˜ธ . Pretty cool, right?

From there, I started thinking how modern AI technology could be used to build an even better emoji search utility. By "better," I envisioned a search engine that not only has better emoji coverage but also supports user queries in multiple languages beyond English.

Limitations of Keyword Search ๐Ÿ˜“

If you are an emoji enthusiast, you know that ๐Ÿ˜ป , ๐Ÿ˜บ , and ๐Ÿ˜ธ arenโ€™t the only smiley cat emojis out there. Some cat emojis are missing, notably ๐Ÿ˜ธ and ๐Ÿ˜น . This is a known limitation of keyword search algorithms, which rely on string matching to retrieve relevant items. Keyword, or lexical search algorithms, are known among information retrieval practitioners to have high precision but low recall. High precision means the retrieved items usually match the user query well. One the other hand, low recall means the algorithm might not retrieve all relevant items. In many cases, the lower recall is due to string matching. For example, the emoji ๐Ÿ˜น does not have "smiling" in its name โ€“ cat with tears of joy. Therefore, it cannot be retrieved with the query "cat smiling" if we search for both terms cat _and smi_ling in its name.

Another issue with lexical search is that it is usually language-specific. In Lucianoโ€™s Fluent Python example, you canโ€™t find emojis using a query in another language because all Unicode characters, including emojis, have English names. To support other languages, we would need to translate each query into English first using machine translation. This will add more complexity and might not work well for all languages.

But hey, itโ€™s 2024 and AI has come a long way. We now have solutions to address these limitations. In the rest of this article, I will show you how.

Embedding-based Semantic Search โœจ

In recent years, a new search paradigm has emerged with the popularity of deep neural networks for NLP. In this paradigm, the search algorithm does not look at the strings that make up the items in the search database or the query. Instead, it operates on numerical representations of text, known as vector embeddings. In embedding-based search algorithms, the search items, whether text documents or visual images, are first converted into data points in a vector space such that semantically relevant items are nearby. Embeddings enable us to perform similarity search based on the meaning of the emoji description rather than the keywords in its name. Because they retrieve items based on semantic similarity rather than keyword similarity, embedding-based search algorithms are known as semantic search.

Using semantic search for emoji retrieval solves two problems:

  1. We can go beyond keyword matching and use semantic similarity between emoji descriptions and user queries. This improves the coverage of the retrieved emojis, leading to higher recall.
  2. If we represent emojis as data points in a multilingual embedding space, we can enable user queries written in languages other than English, without needing translation into English. That is very cool, isnโ€™t it? Letโ€™s see how ๐Ÿ‘€

Step 1: Generating Rich Emoji Descriptions using LLMs ๐Ÿฆœ

If you use social media, you probably know that many emojis are almost never used literally. For example, ๐Ÿ† and ๐Ÿ‘ rarely denote an eggplant and peach. Social media users are very creative in assigning meanings to emojis that go beyond their literal interpretation. This creativity limits the expressiveness of emoji names in the Unicode standards. A notable example is the ๐ŸŒˆ emoji, which is described in the Unicode name simply as rainbow, yet it is commonly used in contexts related to diversity, peace, and LGBTQ+ community.

To build a useful search engine, we need a rich semantic description for each emoji that defines what the emoji represents and what it symbolizes. Given that there are more than 5000 emojis in the current Unicode standards, doing this manually is not feasible. Luckily, we can employ Large Language Models (LLMs) to assist us in generating metadata for each emoji. Since LLMs are trained on the entire web, they have likely seen how each emoji is used in context.

For this task, I used the ๐Ÿฆ™ Llama 3 LLM to generate metadata for each emoji. I wrote a prompt to define the task and what the LLM is expected to do. As illustrated in the figure below, the LLM generated a rich semantic description for the Bullseye ๐ŸŽฏ emoji. These descriptions are more suitable for semantic search compared to Unicode names. I released the LLM-generated descriptions as a Hugging Face dataset.

Step 2: Representing Emojis as Embeddings using Sentence Transformers ๐Ÿ”„

Now that we have a rich semantic description for each emoji in the Unicode standard, the next step is to represent each emoji as a vector embedding in a multidimensional space that captures the meaning of the emoji description. For this task, I used a multilingual transformer based on the BERT architecture, fine-tuned for sentence similarity across 50 languages. You can see the supported languages in the model card in the Hugging Face ๐Ÿค— library.

So far, I have only discussed the embedding of emoji descriptions generated by the LLM, which are in English. But how can we support languages other than English?

Well, hereโ€™s where the magic of multilingual transformers comes in. The multilingual support is enabled through the embedding space itself. This means we can take user queries in any of the 50 supported languages and match them to emojis based on their English descriptions. The multilingual sentence encoder (or embedding model) maps semantically similar text phrases to nearby points in its embedding space. Let me show you what I mean with the following illustration.

In the figure above, we see that semantically similar phrases end up being data points that are nearby in the embedding space, even if they are expressed in different languages. Multilingual sentence Transformers enable cross-lingual search applications, therefore user queries and indexed search items do not have to be expressed in the same language.

Step 3: Integrating Qdrantโ€™s Vector Database ๐Ÿง‘๐Ÿป โ€๐Ÿš€

Once we have our emojis represented as vector embeddings, the next step is to build an index over these embeddings in a way that allows for efficient search operations. For this purpose, I chose to use Qdrant, an open-source vector similarity search engine that provides high-performance search capabilities.

Setting up Qdrant for this task is a simple as the code snippet below (you can also check out this Jupyter Notebook).

# Load the emoji dictionary from a pickle file
with open(file_path, 'rb') as file:
    emoji_dict: Dict[str, Dict[str, Any]] = pickle.load(file)

# Setup the Qdrant client and populate the database
vector_DB_client = QdrantClient(":memory:")

embedding_dict = {
    emoji: np.array(metadata['embedding']) 
    for emoji, metadata in emoji_dict.items()
}

# Remove the embeddings from the dictionary so it can be used 
# as payload in Qdrant
for emoji in list(emoji_dict):
    del emoji_dict[emoji]['embedding']

embedding_dim: int = next(iter(embedding_dict.values())).shape[0]

# Create a new collection in Qdrant
vector_DB_client.create_collection(
    collection_name="EMOJIS",
    vectors_config=models.VectorParams(
        size=embedding_dim, 
        distance=models.Distance.COSINE
    ),
)

# Upload vectors to the collection
vector_DB_client.upload_points( 
    collection_name="EMOJIS",
    points=[
        models.PointStruct(
            id=idx, 
            vector=embedding_dict[emoji].tolist(),
            payload=emoji_dict[emoji]
        )
        for idx, emoji in enumerate(emoji_dict)
    ],
)

Now the search index _vector_DBclient is ready to take queries. All we need to do is to transform the coming user query into a vector embedding using the same embedding model we used to embed the emoji descriptions. This can be done through the function below.

def retrieve_relevant_emojis(
        embedding_model: SentenceTransformer,
        vector_DB_client: QdrantClient,
        query: str, 
        num_to_retrieve: int) -> List[str]:
    """
    Return emojis relevant to the query using sentence encoder and Qdrant. 
    """

    # Embed the query
    query_vector = embedding_model.encode(query).tolist()

    hits = vector_DB_client.search(
        collection_name="EMOJIS",
        query_vector=query_vector,
        limit=num_to_retrieve,
    )

    return hits

To further show the retrieved emojis, their similarity score with the query, and their Unicode names, I wrote the following helper function.

def show_top_10(query: str) -> None:
    """
    Show emojis that are most relevant to the query.
    """
    emojis = retrieve_relevant_emojis(
        sentence_encoder, 
        vector_DB_clinet, 
        query, 
        num_to_retrieve=10
    )

    for i, hit in enumerate(emojis, start=1):

        emoji_char = hit.payload['Emoji']
        score = hit.score

        space = len(emoji_char) + 3

        unicode_desc = ' '.join(
           em.demojize(emoji_char).split('_')
        ).upper()

        print(f"{i:<3} {emoji_char:<{space}}", end='')
        print(f"{score:<7.3f}", end= '')
        print(f"{unicode_desc[1:-1]:<55}")

Now everything is set up, and we can look at a few examples. Remember the "cat smiling" query from Lucianoโ€™s book? Letโ€™s see how Semantic Search is different from keyword search.

>>> show_top_10('cat smiling')
1   ๐Ÿ˜ผ    0.651  CAT WITH WRY SMILE                                     
2   ๐Ÿ˜ธ    0.643  GRINNING CAT WITH SMILING EYES                         
3   ๐Ÿ˜น    0.611  CAT WITH TEARS OF JOY                                  
4   ๐Ÿ˜ป    0.603  SMILING CAT WITH HEART-EYES                            
5   ๐Ÿ˜บ    0.596  GRINNING CAT                                           
6   ๐Ÿฑ    0.522  CAT FACE                                               
7   ๐Ÿˆ     0.513  CAT                                                    
8   ๐Ÿˆ  โ€โฌ›   0.495  BLACK CAT                                              
9   ๐Ÿ˜ฝ    0.468  KISSING CAT                                            
10  ๐Ÿ†    0.452  LEOPARD

Awesome! Not only did we get the expected cat emojis like ๐Ÿ˜ธ , ๐Ÿ˜บ , and ๐Ÿ˜ป , which the keyword search retrieved, but it also the smiley cats ๐Ÿ˜ผ , ๐Ÿ˜น , ๐Ÿฑ , and ๐Ÿ˜ฝ . This showcases the higher recall, or higher coverage of the retrieved items, I mentioned earlier. Indeed, more cats is always better!

The Real Power of Semantic Search ๐Ÿช„

The previous "cat smiling" example shows how embedding-based semantic search can retrieve a broader and more meaningful set of items, improving the overall search experience. However, I donโ€™t think this example truly shows the power of semantic search.

Imagine looking for something but not knowing its name. For example, take the ๐Ÿงฟ object. Do you know what itโ€™s called in English? I sure didnโ€™t. But I know a bit about it. In Middle Eastern and Central Asian cultures, the ๐Ÿงฟ is believed to protect against the evil eye. So, I knew what it does but not what itโ€™s called.

Letโ€™s see if we can find the emoji ๐Ÿงฟ with our search engine by describing it using the query "protect from evil eye".

>>> show_top_10('protect from evil eye')
1   ๐Ÿงฟ   0.409  NAZAR AMULET                                           
2   ๐Ÿ‘“    0.405  GLASSES                                                
3   ๐Ÿฅฝ   0.387  GOGGLES                                                
4   ๐Ÿ‘    0.383  EYE                                                    
5   ๐Ÿฆน๐Ÿป     0.382  SUPERVILLAIN LIGHT SKIN TONE                           
6   ๐Ÿ‘€    0.374  EYES                                                   
7   ๐Ÿฆน๐Ÿฟ    0.370  SUPERVILLAIN DARK SKIN TONE                            
8   ๐Ÿ›ก ๏ธ   0.369  SHIELD                                                 
9   ๐Ÿฆน๐Ÿผ    0.366  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10  ๐Ÿฆน๐Ÿป  โ€โ™‚   0.364  MAN SUPERVILLAIN LIGHT SKIN TONE                       

And Viola! It turns out that the ๐Ÿงฟ is actually called Nazar Amulet. I learned something new ๐Ÿ˜„

Going Beyond English ๐ŸŒ ๐ŸŒ ๐ŸŒŽ

One of the features I really wanted for this search engine to have is for it to support as many languages besides English as possible. So far, we have not tested that. Letโ€™s test the multilingual capabilities using the description of the Nazar Amulet ๐Ÿงฟ emoji by translating the phrase "protection from evil eyes" into other languages and using them as queries one language at a time. Here are the result below for some languages.

Arabic

>>> show_top_10('ูŠุญู…ูŠ ู…ู† ุงู„ุนูŠู† ุงู„ุดุฑูŠุฑุฉ') # Arabic
1   ๐Ÿงฟ   0.442  NAZAR AMULET                                           
2   ๐Ÿ‘“    0.430  GLASSES                                                
3   ๐Ÿ‘    0.414  EYE                                                    
4   ๐Ÿฅฝ   0.403  GOGGLES                                                
5   ๐Ÿ‘€    0.403  EYES                                                   
6   ๐Ÿฆน๐Ÿป     0.398  SUPERVILLAIN LIGHT SKIN TONE                           
7   ๐Ÿ™ˆ    0.394  SEE-NO-EVIL MONKEY                                     
8   ๐Ÿซฃ   0.387  FACE WITH PEEKING EYE                                  
9   ๐Ÿง›๐Ÿป     0.385  VAMPIRE LIGHT SKIN TONE                                
10  ๐Ÿฆน๐Ÿผ    0.383  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE

German

>>> show_top_10('Vor dem bรถsen Blick schรผtzen') # Deutsch 
1   ๐Ÿ˜ท    0.369  FACE WITH MEDICAL MASK                                 
2   ๐Ÿซฃ   0.364  FACE WITH PEEKING EYE                                  
3   ๐Ÿ›ก ๏ธ   0.360  SHIELD                                                 
4   ๐Ÿ™ˆ    0.359  SEE-NO-EVIL MONKEY                                     
5   ๐Ÿ‘€    0.353  EYES                                                   
6   ๐Ÿ™‰    0.350  HEAR-NO-EVIL MONKEY                                    
7   ๐Ÿ‘    0.346  EYE                                                    
8   ๐Ÿงฟ   0.345  NAZAR AMULET                                           
9   ๐Ÿ’‚๐Ÿฟ  โ€โ™€๏ธ   0.345  WOMAN GUARD DARK SKIN TONE                             
10  ๐Ÿ’‚๐Ÿฟ  โ€โ™€   0.345  WOMAN GUARD DARK SKIN TONE

Greek

>>> show_top_10('ฮ ฯฮฟฯƒฯ„ฮฑฯ„ฮญฯˆฯ„ฮต ฮฑฯ€ฯŒ ฯ„ฮฟ ฮบฮฑฮบฯŒ ฮผฮฌฯ„ฮน') #Greek
1   ๐Ÿ‘“    0.497  GLASSES                                                
2   ๐Ÿฅฝ   0.484  GOGGLES                                                
3   ๐Ÿ‘     0.452  EYE                                                    
4   ๐Ÿ•ถ  ๏ธ   0.430  SUNGLASSES                                             
5   ๐Ÿ•ถ     0.430  SUNGLASSES                                             
6   ๐Ÿ‘€    0.429  EYES                                                   
7   ๐Ÿ‘  ๏ธ   0.415  EYE                                                    
8   ๐Ÿงฟ   0.411  NAZAR AMULET                                           
9   ๐Ÿซฃ   0.404  FACE WITH PEEKING EYE                                  
10  ๐Ÿ˜ท    0.391  FACE WITH MEDICAL MASK

Bulgarian

>>> show_top_10('ะ—ะฐั‰ะธั‚ะตั‚ะต ะพั‚ ะปะพัˆะพั‚ะพ ะพะบะพ') # Bulgarian
1   ๐Ÿ‘“    0.475  GLASSES                                                
2   ๐Ÿฅฝ   0.452  GOGGLES                                                
3   ๐Ÿ‘     0.448  EYE                                                    
4   ๐Ÿ‘€    0.418  EYES                                                   
5   ๐Ÿ‘  ๏ธ   0.412  EYE                                                    
6   ๐Ÿซฃ   0.397  FACE WITH PEEKING EYE                                  
7   ๐Ÿ•ถ  ๏ธ   0.387  SUNGLASSES                                             
8   ๐Ÿ•ถ     0.387  SUNGLASSES                                             
9   ๐Ÿ˜    0.375  SQUINTING FACE WITH TONGUE                             
10  ๐Ÿงฟ   0.373  NAZAR AMULET

Chinese

>>> show_top_10('้˜ฒๆญข้‚ช็œผ') # Chinese
1   ๐Ÿ‘“    0.425  GLASSES                                                
2   ๐Ÿฅฝ   0.397  GOGGLES                                                
3   ๐Ÿ‘    0.392  EYE                                                    
4   ๐Ÿงฟ   0.383  NAZAR AMULET                                           
5   ๐Ÿ‘€    0.380  EYES                                                   
6   ๐Ÿ™ˆ    0.370  SEE-NO-EVIL MONKEY                                     
7   ๐Ÿ˜ท    0.369  FACE WITH MEDICAL MASK                                 
8   ๐Ÿ•ถ  ๏ธ   0.363  SUNGLASSES                                             
9   ๐Ÿ•ถ     0.363  SUNGLASSES                                             
10  ๐Ÿซฃ   0.360  FACE WITH PEEKING EYE

Japanese

>>> show_top_10('้‚ช็œผใ‹ใ‚‰ๅฎˆใ‚‹') # Japanese 
1   ๐Ÿ™ˆ    0.379  SEE-NO-EVIL MONKEY                                     
2   ๐Ÿงฟ   0.379  NAZAR AMULET                                           
3   ๐Ÿ™‰    0.370  HEAR-NO-EVIL MONKEY                                    
4   ๐Ÿ˜ท    0.363  FACE WITH MEDICAL MASK                                 
5   ๐Ÿ™Š    0.363  SPEAK-NO-EVIL MONKEY                                   
6   ๐Ÿซฃ   0.355  FACE WITH PEEKING EYE                                  
7   ๐Ÿ›ก ๏ธ   0.355  SHIELD                                                 
8   ๐Ÿ‘    0.351  EYE                                                    
9   ๐Ÿฆน๐Ÿผ    0.350  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10  ๐Ÿ‘“    0.350  GLASSES

For languages as diverse as Arabic, German, Greek, Bulgarian, Chinese, and Japanese, the ๐Ÿงฟ emoji always appears in the top 10! This is pretty fascinating since these languages have different linguistic features and writing scripts, thanks to the massive multilinguality of our ๐Ÿค— sentence Transformer.

Limits of AI ๐Ÿ™ˆ

The last thing I want to mention is that no technology, no matter how advanced, is perfect. Semantic search is great for improving the recall of information retrieval systems. This means we can retrieve more relevant items even if there is no keyword overlap between the query and the items in the search index. However, this comes at the expense of precision. Remember from the ๐Ÿงฟ emoji example that in some languages, the emoji we were looking for didnโ€™t show up in the top 5 results. For this application, this is not a big problem since itโ€™s not cognitively demanding to quickly scan through emojis to find the one we desire, even if itโ€™s ranked at the 50th position. But in other cases such as searching through long documents, users may not have the patience nor the resources to skim through dozens of documents. Developers need to keep in mind user cognitive as well as resource constraints when building search engines. Some of the design choices I made for the Emojeez ๐Ÿ’Ž search engine may not be work as well for other applications.

Another thing to mention is that AI models are known to learn socio-cultural biases from their training data. There is a large volume of documented research showing how modern language technology can amplify gender stereotypes and be unfair to minorities. So, we need to be aware of these issues and do our best to tackle them when deploying AI in the real world. If you notice such unwanted biases and unfair behaviors in Emojeez ๐Ÿ’Ž , please let me know and I will do my best to address them.

Conclusion

Working on the Emojeez ๐Ÿ’Ž project was a fascinating journey that taught me a lot about how modern AI and NLP technologies can be employed to address the limitations of traditional keyword search. By harnessing the power of Large Language Models for enriching emoji metadata, multilingual transformers for creating semantic embeddings, and Qdrant for efficient vector search, I was able to create a search engine that makes emoji search more fun and accessible across 50+ languages. Although this project focuses on emoji search, the underlying technology has potential applications in multimodal search and recommendation systems.

For readers who are proficient in languages other than English, I am particularly interested in your feedback. Does Emojeez ๐Ÿ’Ž perform equally well in English and your native language? Did you notice any differences in quality or accuracy? Please give it a try and let me what you think. Your insights are quite invaluable.

Thank you for reading, and I hope you enjoy exploring Emojeez ๐Ÿ’Ž as much as I enjoyed building it.

Happy Emoji search! ๐Ÿ“†๐Ÿ˜Š๐ŸŒ๐Ÿš€

Note: Unless otherwise noted, all images are created by the author.


Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.