Emojis in Your Data 🤓

Everything you need to know about emoji in your database or dataset.

Jonathan Law
Towards Data Science

--

Header created by the author

Let us face it, it is 2021 and emojis are inevitable. You see it everywhere from chats to product reviews, and in some cases usernames. Today we will be answering some questions:

What are emojis

Everyone knows what emojis are, but what are they really? Are they an image or font, why do different systems show the same laughing emoji differently? For beginners, emoji is a glyph, think of it as a font. Behind each laughing face, the emoji is a hexadecimal code point.

Taking 🤓 “nerd face” emoji for an example, its hexadecimal code point (denoted by U+) is U+1F913 as listed in Emojipedia. Each code point refers to something on a universally understood dictionary called Unicode. If the dictionary has the word, you would get the definition of it.

If you try looking up a Chinese word in an English dictionary, you will not get that word definition. This concept applies to how Unicode would work on your system. If your system does not contain the glyph for your code point, it would not be able to show 🤓. And just like how you have different definitions for the same English word from Cambridge or Oxford, different systems have their glyph designs too without deviating from the original meaning.

How are emojis stored in a database

Now that you know how emojis are represented by a computer system, how can we store or how is it stored into databases? We know that emojis are just hexadecimal code points, do we need to have a special Unicode column type or do we store it as a string and parse it as an emoji later?

In most database, you can inform the engine that this particular block of characters is supposed to be stored as Unicode rather than strings, and this process is called \escape. Below are a few examples of escaping and storing emojis in databases. Different databases have a slightly different requirement in storing escaped characters, so do read carefully each database documentation on escaping.

-- POSTGRES
SELECT ('\+01F913'), (U&'\+01F913'), (U&'\d83e\dd13')
-- POSTGRES RESULT: "\+01F913", "🤓", 🤓"
-- BIG QUERY
SELECT ('\U0001F913')
-- BIG QUERY RESULT: "🤓"

As shown in the Postgres example, the first column is stored literally as a string ‘\+01F913’. By escaping using U& as the second column, the database now understands that this block of characters is not a regular string, but should be understood as a hexadecimal sequence.

Image by author

Notice we added a 0 in front of 1F913, this is because the Postgres documentation here specified that Unicode escapes require “a backslash followed by a plus sign followed by a six-digit hexadecimal code point number”. Our original code point is only 5 digits, therefore we will have to zero pad the code point to make it 6 digits without changing the meaning. The third column is an example of a surrogate pair of code units. A surrogate pair of code units make up a code point. As mentioned in the same Postgres documentation, surrogate pairs (16bit + 16bit) exist to compose code points larger than U+FFFF (16 bit).

F (hex) = 1111 (binary) = 4 bit
FFFF = 1111 1111 1111 1111 = 16 bit

This increased the amount available of code points to over a million. However that is the least of our concern, since Postgres would combine the surrogate pairs into one code point before storing it. The final column is basically just storing the hexadecimal code in bytes.

The same goes for BigQuery, where in their documentation here, they indicated their Unicode escape requires 8 digit. Therefore as shown in the example above, we will zero pad the Unicode.

How important is it to retain emojis in my dataset or database

Image by author

Two very similar English texts, yet very different in terms of expression. As such trend continues where emojis increasingly become the goto way of expressing feelings, NLP models and datasets should adapt to accommodate storing and processing such information. New emojis are constantly added, and the complexity has increased with the introduction of variants or skin tone. Adapting to every emoji may not be easy, however it is worth noting that unicode.org has a list of emoji frequency that helps us understand what are the few emojis that we should take note of.

No doubt storing emojis in databases has become a common practice, and it should. Google has a whitepaper which can be found here that states enabling the largest character sets available which includes emoji is one of the good practices in storing user passwords. There are product names and descriptions in most online shopping platform that contains emojis too, and we definitely would not want to strip those out when users are saving it or when you are performing inference for your business case.

What's next 🤷

Emojis are here to stay, and it is up to developers to manage that extra information, and for data practitioners to make sense out of it.

Image taken from the author’s Telegram message

Emojis has become an integral part of our daily conversation, and in some cases is the answer to questions. Managing this information, whether by replacing emoji with its relevant English word meaning, or encoding it as another token would be something to think about.

Here are some cool tools that might help you understand Unicode a little better (in no particular order):

Unicode to hex converter

Unicode Code Converter

Unicode and its bytes table

Emojipedia

unicode.org

Unicode surrogate pair understanding and calculator

--

--

I am Jonathan Law Hui Hao, a Business Support Specialist in Malaysia. I combine logistics and process improvement with technology.