From chatbots to sentiment analysis, we’re seeing an explosion of real-world use cases for textual data. Some of the buzziest innovations in AI revolve around models trained with ever-increasing quantities of text; on the flip side, we can trace many of the challenges the field is facing to limited, unrepresentative, or flat-out biased language datasets.
This week, we share six recent posts that cover data and language through a wide range of topics and approaches—NLP fans will have a blast, but so will programmers, data engineers, and AI enthusiasts. Let’s dive in!
- The wall all large language models run into (for now). GPT-3 and similar generative models can produce text that sounds truthful even when it lacks factuality. Iulia Turc explores the issue of these models’ groundedness – "the ability to ground their statements into reality, or at least attribute them to some external source"—and why it’s been so difficult to develop models that come close to human performance.
- Natural language querying is making a splash. Up until recently, humans had to invent (and then learn) complex languages in order to communicate with computers and manipulate digital data. Andreas Martinson discusses the emerging world of NLQ—natural language querying—and how it might transform the work of data professionals for the better, as well as democratize access to databases.
- Choosing the right tools to simplify complex NLP tasks. The difference between clunky and streamlined workflows can sometimes come down to seemingly trivial choices. Kat Li surveys five less-known Python libraries—from Pyspellchecker to Next Word Prediction—and explains how they can save time and effort when used in the right NLP context.
- How to translate your text-derived findings into compelling visuals. If you’re in a tinkering mood, you’ll enjoy Petr Korab‘s latest tutorial. Going beyond the usual suspects—looking at you, word cloud!—this tutorial walks us through the creation of more advanced (and more polished) visualizations, including chord diagrams and packed bubble charts.
- On algorithms and spelling. In a neat new project, Socret Lee built a Khmer spellchecker as part of a bigger keyboard app. Socret’s writeup patiently explains the process, and zooms in on two concepts that proved crucial for the implementation of the tool: BK-tree and Edit Distance.
- When Sentiment analysis, pop culture, and social media collide. A new season of a popular Netflix show provides the perfect opportunity to analyze text (and emoji!) at scale. Amanda Iglesias Moreno‘s latest article leverages the Twitter API to study polarity in tweets about Regency Era-set Bridgerton.
There’s always more to explore on TDS, so we hope you still have some time and stamina for a handful of excellent reads on other topics; we just couldn’t not share these with you.
- If you’ve been missing Carolina Bento‘s crystal-clear, expertly illustrated tutorials, you’re in luck: a new one just landed on TDS, and it explains RNNs (recurring neural networks) with a real-life example and ample code snippets.
- Bird-loving data scientists will find Benedict Neo‘s new project both fun and interesting: it attempts to classify bird species based on genetic attributes and location.
- Learn about a fast and efficient approximate nearest neighbor search by following along Peggy Chang‘s latest tutorial. It covers similarity search through a combination of an inverted file index (IVF) and product quantization (PQ).
- To end on a dose of practical inspiration, Pau Labarta Bajo shared several hard-earned insights on boosting your ML skills so you can excel in real-world contexts, which are always messier and more complex than what you learned in courses or bootcamps.
Thank you, as always, for your passion and curiosity. To support the work we publish, consider sharing your favorite article on Twitter or LinkedIn, telling your Data Science colleagues about us, and/or becoming a Medium member.
Until the next Variable,
TDS Editors