With so much of our collective attention still focused on chatbots, it’s easy to forget just how vast and diverse the field of natural language processing (NLP) actually is. From translation to text classification and beyond, data scientists and machine learning engineers have worked (and are still working) on exciting projects whose names don’t start with a C and end with a T.
To help bring some of these subfields back into the limelight, we’re thrilled to share a selection of our recent NLP favorites. They’ll appeal to anyone who enjoys working with textual data—and also to data practitioners who are curious to learn and experiment with it.
- Digging into the key ingredients of an NLP project. Erwin van Crasbeek‘s thorough walkthrough is a perfect starting point for beginners and seasoned pros alike. It covers the history and basic concepts of natural language processing, and then moves on to explain the inner workings of Erwin’s Dutch question-answering machine learning model.
- How does language detection work? If you’ve ever used an online translator, you’ve likely experienced that magical split second where the tool recognized which language your input text was in. Katherine Munro brings together Python, the NLTK (Natural Language Toolkit) platform, and a dash of statistics to unpack the process behind the magic.
- Don’t let the complexity of text classifiers scare you away. Learning how to build a text classifier "can be a bit of a minefield," says Lucy Dickinson. That’s precisely why Lucy’s 10-step guide is so helpful: it breaks down a potentially unwieldy process into well-defined and clearly illustrated tasks (with all the code you’ll need to start tinkering on your own).
- The art of finding the right data for specific NLP tasks. In most ML pipelines, the difference between success and failure hinges on the quality of the data you work with. As Benjamin Marie shows, when it comes to machine-translation projects, we have to think not just about quality but also about fit; it also really helps to know how to squeeze as much value as possible from the data we have.
February flew by so quickly; here are a few standouts that we didn’t get a chance to highlight yet, but are well worth your time as we gently tiptoe into March.
- If you’re neither a deep-learning novice nor exactly an expert, don’t miss Leonie Monigatti‘s new intermediate-level guide on fine-tuning models.
- For a lucid exploration of language models’ limitations—especially when it comes to their potential to supplant search engines—we warmly recommend Noble Ackerson‘s debut TDS article.
- Looking to gain a solid understanding of how the denoising diffusion model works? Wei Yi‘s new deep dive is the complete resource you need.
- In case you’re just starting to work with geospatial data, Eugenia Anello‘s latest contribution is a hands-on introduction to GIS (and a handy primer on key terms and concepts).
- Tired of using the same packages over and over again? Miriam Santos invites you to give five open source Python packages a try in your visualization and EDA workflows.
Thank you for your time and your support this week! If you enjoy the work we publish (and want to access all of it), consider becoming a Medium member.
Until the next Variable,
TDS Editors