Using data to motivate language preservation
This piece is best experienced on desktop or tablet.

At my high school, all seniors had to produce a final research paper on a social sciences topic of their choice to graduate. My piece on the benefits of bilingual teaching, inspired by my inability to maintain fluency in more than one language as I progressed through formal education, brought me a lasting appreciation for the world’s linguistic diversity.
This week, a random internet rabbit-hole brought me back to the subject. I found that of the ~7,000 languages that exist today, more than 40% fall somewhere in the "Endangered" category. That means that they’re at risk of disappearing from our collective knowledge.
These languages, claimed by indigenous communities throughout the globe, embody the histories, traditions, literatures, and knowledge of their speakers. They are the vehicle for these communities’ awareness of traditional herbal medicines, their oral storytelling practices, and other invaluable cultural and scientific phenomena, as well as unique linguistic constructs (such as the placement of subjects, objects, and verbs within phrases) that are rarely seen in more common tongues. Experts in the field of language preservation have compared losing a language family to losing a subsection of the animal kingdom – imagine a world where all birds or whales were extinct. And methods of communication aren’t the only things that become endangered when languages are classified as such – language loss is rarely voluntary, and can stem from power imbalances that strip their speakers of their rights and identities.
To read more about preserving linguistic diversity, check out this resource from the Endangered Languages Project (ELP) and this paper about a specific intervention for language survival.
So what about the data?
The ELP’s catalogue contains more than 3,000 endangered languages with varying levels of information about each one. This dataset’s size strikes a lucky balance between being small enough to tell the story of each individual point and being big enough to draw tentative conclusions about the whole.
Before I present my visualisation, here is a table summarising the ELP’s classification of endangered languages. Of those that still exist today, the languages labelled "critically endangered" are at the greatest risk of extinction. Other factors not listed that are used by the ELP to classify languages include their number of speakers and trends in speaker numbers.

I chose to present a beeswarm plot of most of the ELP’s catalogued languages (more information about which ones I excluded is in this article’s appendix), with each dot representing one endangered language. You can interact with the final piece below.
The levels of certainty indicate how much information the ELP has about each factor listed above (for example, if only the number of speakers of a language is known, its level of certainty would be 20%).
To learn more about each language, hover your cursor over each coloured dot. To explore the languages classified at each level of endangerment in more detail, use the drop-down feature in the top left of the graphic.
The clustering of the points in each endangerment level indicates that absolute speaker numbers highly influence a language’s categorisation. However, some languages with many speakers are designated as critically endangered, while some with fewer speakers are only vulnerable – this might be because some communities are less populated but retain a thriving knowledge and dissemination of their traditional languages.
A visualisation experiment
After designing this straightforward quantitative approach to visualising the ELP’s catalogue, I asked myself how else I could demonstrate both the urgency and hope of the linguistic preservation cause.
Though not all languages (especially at-risk ones) have written scripts, I assumed that many rely on writing to pass down knowledge and traditions. I then created a series of slides giving basic information about seven critically endangered languages, using the words themselves to communicate how close we are to losing these pockets of tradition.
In revealing the full text of each description, you may notice that revival projects are in place for some of the most threatened languages. Even though languages are disappearing faster than ever, communities are recognising the importance of their linguistic heritage and investing in research and education for future generations. You can read about numerous examples of this work here via the Solutions Journalism Network’s Story Tracker.
The issue of endangered languages is an interdisciplinary one, and it’s an intersectional one too. While I’ve given only an overview of the field here, I hope you’re now curious to find out more.
Thank you for reading!
I am deeply indebted to the Endangered Language Project for their assistance and support during my drafting of this work.
Data source:
Catalogue of Endangered Languages. 2021. University of Hawaii at Manoa. http://www.endangeredlanguages.com.
Appendix: Notes, caveats, and past iterations
- For my first visualisation, I tried to adhere to statistician Edward Tufte’s principle of optimising "data-ink" ratios. This refers to the idea that design elements not directly relevant to the data presented should be removed (i.e. axis lines, gridlines, and annotations). For more information on his other principles, please see here.
- When a range of absolute speaker numbers were given in the database, I took their average to generate one value. This is why some languages are shown to have a decimal number of speakers.
- I removed languages listed as "dormant", "awakening", or "at risk". "Dormant" languages generally had zero speakers, while only one "awakening" language was included in the dataset. "At risk" languages are classified separately from the levels of endangerment described above (their vulnerability is between non-endangered languages and "vulnerable" languages).
Here are the previous iterations of the first visualisation:
Each line here is meant to represent one language, going left to right in alphabetical order. However, there is too much chaos for the viewer to engage meaningfully with the graphic.
- From my first iteration, I used a logarithmic scale for the "number of speakers" axis because of the massive range in values observed.
- I then ordered the languages from most to least spoken, which resulted in a neater plot. However, this is hardly appealing to the eye.
- Next, I experimented with circular representations of each language, which resulted in this centipede-like mess.
- The two dots in the bottom left are data cleaning errors.
- In the final iteration of the visualisation before the one presented in the article, I used a bubble chart with categorical values on the y-axis and continuous values on the x-axis for a plot that looked more organised. However, this piece suffered from an unbalanced data-ink ratio – I was using three different visual elements (dot positioning, dot size, and dot colour) to indicate the same variable (number of speakers)! It was also difficult to distinguish the languages.
How can I improve upon both visualisations? Does my second, more abstract representation of endangered languages make sense to you? Please let me know by responding to this article!
