Hands-on Tutorials

Choosing a Baby Name Using Python

Using phonetic translators in Python to compare how well a first name sounds with a middle and family names in English and Spanish.

Victor Cuspinera
Towards Data Science
8 min readAug 26, 2021

--

Elisa on the keyboard. Image by author.

Getting a name for a baby is not as trivial as people could think, or at least not for my wife and me. We look for names for our baby girl everywhere and we pick some options. However, I started to wonder if we were looking in the right places and if there could be a way to measure which would be the best name for our daughter. So, I found three databases with names from Spanish and English-speaking countries, analyzed trends and top frequent names, and wrote down a list of all possible names. Finally, I developed a tool that transforms the names using the International Phonetic Alphabet and measures how well a first name sounds with another names and/or family names in both English and Spanish, by returning a score to make it easier finding possible names for a baby.

⭐️ Click here to see the GitHub repository with the complete analysis of this project.

Introduction

When we were expecting our first baby girl, during the last five months of the pregnancy, we looked for options of names on different sources as websites and books with popular baby names, websites with international names, asked recommendations between our friends, or even wrote down names of movies and tv series characters. After months of searching names, on early April of 2021 we came with a list of our favourite names for the baby girl. My wife’s choices were: Elisa and Macarena. In my side, I had a wider list: Aisha, Amanda, Carlina, Gina and Victoria.

However, we didn’t agree in a name for the baby. So, I started to wonder how we could make the best selection of the name for our baby? How popular are the names we like? Is there a way to measure how well a first name sounds combined with the family name?

Databases

The first step was to look for names. In our case, because we live in Mexico we were interested on names in Spanish; however, we also were open on names from an English-speaking country.

For names in Spanish, I found data of the Statistics National Institute from Spain 🇪🇸 with the most popular 100 names (2002–2019), and names with frequency equal to 20 people or more (2019). In the other hand, for English-speaking countries, I found a couple of databases with popular baby names, one from the U.S. 🇺🇸 published by the Social Security Agency (1880–2019); and another from state of British Columbia in Canada 🇨🇦, shared by the Government of B.C. (1920–2019).

EDA

When comparing the databases of popular names from Spain, the U.S. and Canada, it would be observed that the number of distinct names were very different among databases. These differences probably could be explained by the structure of the databases because the range of years and their characteristics:

Description of the different databases. Image by author.
Growth of number of occurrences per name for each database. Image by author.

While the growth in the average number of people per name is stable for the USA and Canada around 0% level, it is not the case for Spain, probably because the database from Spain contains only the list of 100 most popular names compared with the extensive list of names from the USA and Canada data.

Descriptive analysis of our favorite names

Spain 🇪🇸

Frequency of our favorite names in the Spain database. Image by author.

Between our favourite names, on the database from Spain I found that the most used name is victoria with 59.6 thousand people, followed by elisa with 36.3 thousand people, macarena 14.4 thousand, amanda 12.9 thousand, gina 2.3 thousand, aisha 2.0 thousand and carlina with only 147 people.

In the database from Spain, we can also found the most frequent compound names in 2019. So, looking for the compound names that include one of our seven favourite names, we could find 243 compound names that includes victoria, 82 with elisa, 31 with macarena, 28 with amanda, 10 with gina, and 3 with aisha and 2 with carlina. This database, also includes the average age of all single and compound names.

Frequency and average age of compound names that include one of our favourite names. Image by author.
Word cloud of compound names that include one of our favourite names. Image by author.

Finally, the word cloud plot presents the compound names in Spain with frequency equal to 75 or more observations. While the nodes represent each name, the arrows show the order of connection between names and the darker the arrow, the stronger is the connection between names.

For example, the names maria and victoria are connected in both ways and forms the compound names maria victoria and victoria maria.

USA 🇺🇸

In the U.S., we observe that while victoria has been popular for all years, gina became the most popular name in the late 60’s and amanda in the 80’s. The name elisa has been in the middle range of our favorite names with values around 500 observations per year, while aisha popularity increased from around 10 observations on the 60’s to 1,000 on the 70’s and became steady on that level. Finally, the names carlina and macarena are the least popular from our list in records from the U.S.

Frequency of our favourite names by year, in the US database. Image by author.

B.C., Canada 🇨🇦

This plot presents the trend of newborn babies per year in British Columbia, Canada, that had any of our favourite names. In this case, the database has information of only five of our seven selected names: victoria, elisa, amanda, gina and aisha.

Similar to the plot from the U.S., here we find that victoria has been popular for all years and amanda became the most popular name in the 80’s and 90’s. The popularity of gina decreased over time. And finally, the names elisa and aisha have low records.

Frequency of our favourite names by year, in British Columbia (Canada) database. Image by author.

Scoring names using the International Phonetic Alphabet (IPA)

The last effort of this analysis is to construct a tool, in this case a Class in Python, that measures how well a first name combines with a middle name and/or family names. For this purpose, I transform the names to their Spanish and English phonetic notation using the International Phonetic Alphabet.

For English transformation, the eng_to_ipa library was used. In the following code chunk an example on how to use it:

# call the eng_to_ip library
import eng_to_ipa as ipa
# transform a text from English to IPA
ipa.convert("Maria")
### outcome: 'mərˈiə'

In the case of Spanish, the epitran library was useful to transform the names, and it can be used as follows:

# call the epitran library
import epitran
# select an ISO 639-3 code of the language
epi = epitran.Epitran('spa-Latn')
# transform a text from selected language to IPA
epi.transliterate("Maria")
### outcome: 'maɾja'

After transforming the names to it’s phonetic notation, a customed function look for the consonance and assonance rhymes between two given strings (names and/or surnames), and compares the last vowel, last two vowels, the last syllable and the initial letters for each name, returning a boolean from each comparison. For each language, the function gets a score asigning more weight to the comparisons of the last two vowels and the last syllable and normalizing the result. As an example, this is the formula to get the score from comparisons in Spanish between two names.

Formula to get the score. Image by author.

To get the Total score, the function only average the final English and Spanish scores.

Finally, for complete names with more than two strings, another customed function compares each string with the others strings of the full name, and averages the quantities for each type of Score, returning three final results for the complete name: English, Spanish and Total scores.

For example, in the name maria victoria smith this function gets the score for the comparison between (1) maria with victoria, (2) maria with smith, and (3) victoria with smith.

⭐️ Click here to look into more details of the custom Python Class, and their functions, developed to get these scores by comparing the rhymes of names.

😅 Fun fact: when I finally arrived to this point of the analysis on mid-April 2021, my wife already choose a name for our baby girl: elisa. From this point, it was on my side to choose if I wanted a compound name for the baby, or give her only one name.

Scores of “elisa cuspinera martinez”. Image by author.

Among the compound female names that includes elisa combined with our family names cuspinera and martinez (this is because in Mexico and Latin America we use two family names), the top of possible second names with higher scores were: aisa, akira, alisa, ariza, corisa, delisa, elfrida, elissa, elvina, elyria, elysia, erisa, isa, jazeera, lisa, liza, louisa, luisa, macarena, maeda, magdalena, makita, malina, malinda, malvina, marcelia, marcellina, marchita, margarita, marilda, marina, marquita, martina, martita, mathea, matthea, maurita, mayeda, miera, misa, raisa, riera, risa, shakira, viera.

As you could realize, one of these names is macarena. So, as in the movie Inception, I felt that my wife implanted on my head the idea of this name and, among the top options of compound names combined with elisa, this one was my favorite mainly becase the way it sounds when combining macarena and cuspinera. Actually, macarena cuspinera scores 0.5 in Spanish and 0.6667 in English.

Finally, we decide that the name of our baby girl would be elisa macarena cuspinera martinez, which has an even better score than elisa cuspinera martinez.

Scores of “elisa macarena cuspinera martinez”. Image by author.

Final comments

Choosing the name of your baby could be not as easy as people think. The aim of this project was to build a tool that measures how well a complete name would sound assigning a score to make it easier finding possible forenames for a baby. In this case, the customed function could be a starting point to build a more robust model that considers others characteristics as the harmony of the sounds.

In the other hand, as a tool, this model is only a piece of the puzzle and this could (and should) be complemented with other qualitative variables, for example, the meaning of names, regional customs, family traditions, trends, personal preferences and/or other ideas from the future parents.

--

--

Data Scientist with Actuary and Economics background, interested in learning, traveling and cooking.