Lately, I’ve been wondering, how would I explain my research on natural language generation for Automated Journalism to my grandfather. I figured, I could create a simple robojournalist to write reports of his favourite sport – soccer. Unfortunately he has already passed away several years ago, but I’ll show you instead.
What to expect?
In this article I’ll show you how to create a template-based news automation system with Python. The method itself is super simple, but it definitely does the trick when you want to represent some repetitive data as written reports, and to be sure that the output truthfully represents the underlying data.
Pros:
- Easy – No expertise on NLP required, you just need to know your data.
- Fluent – The sentences will be as grammatically correct as you make them.
- Accurate – No black box solutions included, the output will always be truthful.
Cons:
- Not transferable – For the next dataset you’ll need to repeat the whole process.
- Laborious -Designing the templates and fine tuning the output requires time and energy, and it requires it every time you have data in different shape or from a different domain.
- Repetitive – All reports will be the same. This can be tackled with including alternative templates or by adding words/replacing words, check my article Robojournalism – How to tackle the repeptition.
Step-by-step guide to creating a simple news automation system
Do you have some repetitive data coming in that you’d like to present as text for you friends/followers/whatnot? Follow these 6 to 7 steps to achieve that.
Step 1: Select the data you want to report
I’ll be using a dataset describing the recent UEFA European Championships offered by Sports Reference. They offer a CSV export of the data ready-made. You can find it here.
Step 2: Load the data in and replace missing values
Step 3: Decide what to report
I’ve chosen the following data fields to report
- Round
- Day
- Date
- Home (team)
- Score
- Away (team)
- Venue
- Notes
Step 4: Create your templates
My idea for the templates is simple: I make a list of sentences I want to include. In the sentence templates, places where the defined attributes will be located are indicated as {Column Name}.
Step 5: Embed your data to the templates
We have a minor issue. Our third template where cadence is mentioned, is not applicable for all activities, only for runs. These missing values are represented in our data as two dashes. Solution: any sentence that finally contains two dashes is removed before outputting the report.
Step 6: The output
And here we have it: reports written about the UEFA 2021 soccer European Championships. I’ve only included reports about the final five matches.
Step 7: Transform your data to fine-tune the output
With this approach, you’ll get as good output as are the templates and the input data. Thus, by making some simple transformations for your input data, you’ll get a nicer output.
Thank you for reading! If you learned something new or enjoyed this article, follow me on Medium. I publish articles about data engineering and data science. You can learn more about me from be web page.
Did I leave you wondering, how to make the outputs more vivid? Check out my article Robojournalism – How to tackle the repeptition, and stay tuned for step-by-step tutorials on how to apply my ideas on your own Nlg systems 😊