Transforming Text Files to Data Tables with Python

A reusable approach to extract information from any text file

Sebastian Guggisberg
Towards Data Science

--

Photo by Maksym Kaharlytskyi on Unsplash

In this article, I describe how to transform a set of text files into a data table which can be used for natural language processing and machine learning. To showcase my approach I use the raw BBC News Article dataset published by D. Greene and P. Cunningham in 2006.

Before jumping into the IDE and start coding, I usually follow a process consisting of understanding the data, defining an output, and translating everything into code. I consider the tasks before coding usually as the most important since they help to structure and follow the coding process more efficiently.

My three-step process for this project

1. Data Understanding

Before being able to extract any information from a text file, we want to know how its information is structured as well as how and where the text files are stored (e.g. name, directory).

Structure

To understand the structure, we take a look at some of the text file to get a sense of how the data is structured.

Claxton hunting first major medal

British hurdler Sarah Claxton is confident she can win her first major medal at next month's European Indoor Championships in Madrid.

The 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long as I keep up my training but not do too much I think there is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years but has struggled to translate her domestic success to the international stage.
...

In the context of news articles, it can be easily assumed that the first and second section correspond to the title and the subtitle respectively. The following paragraphs represent the article’s text. Looking at the sample data, we also recognise that the segments are separated by new lines which can be used for splitting the text.

Storage

To write a script that automatically runs through every text file, we need to know how the text files are stored. Thus, we are interested in the naming and organisation of the directories. Potentially, we need to restructure things, so we can loop more easily through the files.

Naming and organisation of the text files

Luckily for us, the BBC news dataset is already well structured for automating the information extraction. As it can be seen in the screenshots above, the text files are stored in directories according to their genre. The names are also similar for every genre and are made up by leading zeros (if the file number is below 100), the file number, and “.txt”.

2. Output Definition

Based on the insights of the data understanding step, we can define what information should be included in the output. In order to determine the output, we have to consider the learnings of the previous step as well as think about potential use cases for the output.

Based on the information we can potentially extract from the text files, I come up with two different use cases for machine learning training:

  • Text classification (genre prediction based on the text)
  • Text generation (title or subtitle generation based on the text)

In order to fulfil the requirements for both potential use cases, I would suggest extracting the following information.

Extracting text files to gain informations about the genre, title, subtitle, text, and token count.
Targeted output for the text file information extraction

I would also include the length of the text (in number of tokens), to make it easier to filter for shorter or longer texts later on. To store the extracted data, I would suggest a tab-separated-values (.tsv) file, since commas or semicolons could be present in the text column.

3. Coding

Thanks to the previous steps, we know the data we are dealing with and what kind of information we want to output at the end of the transformation process. As you might know by now, I like to break tasks into smaller parts. The coding step doesn’t constitute an exception to this :) Generally, I would split the coding into at least three different parts and wrap them in individual functions:

  • Reading and splitting a file
  • Extracting the information
  • Building the data frame

In order to make this news article extractor reusable, I create a new class that implements the functions.

Reading and splitting a file

In order to read a file with python, we need the corresponding path consisting of the directory and the filename. As we observed in the Data Understanding step, the files are stored in their corresponding genre’s directory. This means that to access a file, we need the base path (‘data’ for me), its genre and its name.

If the file exists, we want to read it, split it by the new line characters (‘\n’), filter empty strings and return the remaining text sections as a list. In the case, that the file doesn’t exist (e.g. the file number is larger than the number of available files), we want to return an empty list. I prefer this rather than working with exceptions or returning none, if the file doesn’t exist.

def read_and_split_file(self, genre: str, file_name: str) -> list:
text_data = list()
current_file = os.path.abspath(os.path.join('data', genre, file_name))
if os.path.exists(current_file):
open_file = open(current_file, 'r', encoding="latin-1")
text_data = open_file.read().split('\n')
text_data = list(filter(None, text_data))
return text_data

As you can see in the code above, uses the os package. Thus, we need to import this package.

Extracting the information

In order to extract the information of the text files and prepare these for the next step, I would suggest pursuing this for every genre. This means, that we loop over every file in the corresponding genre’s directory. By keeping a current_number variable, we can format the filename with the leading zeros and then read and split the file by calling the above-implemented method.

If the returned list is empty, we want to stop the loop, since this means, that we reached the end of the loop and that there aren’t any new files left in the directory.

Otherwise, we add the returned information of the reading and splitting function to specific data containers, such as titles, subtitles, and texts. Since I suggested to also provide the token count of the text in the final output, we can use the nltk package to tokenize the text and add the length of the list of tokens to our token_counts list. Finally, we increment the current_number by 1 to continue the extraction process with the next file.

def extract_genre_files(self, genre: str) -> pd.DataFrame:
found = True
current_number = 1
titles = list()
subtitles = list()
texts = list()
token_counts = list()
while found:
file_name = "{:03d}.txt".format(current_number)
text_data = self.read_and_split_file(genre, file_name)
if len(text_data) != 0:
titles.append(text_data[0])
subtitles.append(text_data[1])
article_text = ' '.join(text_data[2:])
texts.append(article_text)
token_counts.append(len(nltk.word_tokenize(article_text)))
current_number += 1
else:
found = False

genres = [genre] * len(titles)
data = {'genre': genres, 'title': titles, 'subtitle': subtitles, 'text': texts, 'token_counts': token_counts}
data_frame = pd.DataFrame(data)
return data_frame

After finishing the loop through the genre files, we create a data frame based on the extracted information that was stored inside the specific lists. Similar to the previous step, we need to import two packages (nltk and pandas). Please also make sure, that you have downloaded the ‘punkt’ data of the nltk package, since it is required to tokenize texts.

import nltk
# nltk.download('punkt')
import pandas as pd

Building the data frame

In a final step, we have to create a loop over the existing genres, extract the information per genre by calling the above-implemented method, concatenating the output for every genre and finally saving the concatenated data frame as a csv with the desired separator.

def transform_texts_to_df(self, name, genre_list, delimiter = '\t'):
article_df_list = list()
for genre in genre_list:
article_df_list.append(self.extract_genre_files(genre))
df = pd.concat(article_df_list)
df.to_csv(name, sep=delimiter)
return df

After implementing the class and its method, we need to create an instance of the ArticleCSVParser class and call the transform_texts_to_df method by providing the desired name for the resulting csv and a list containing every genre. Et voilà.

if __name__ == "__main__":
genre_list = ['business', 'entertainment', 'politics', 'sport', 'tech']
parser = ArticleCSVParser()
df = parser.transform_texts_to_df('bbc_articles.csv', genre_list)
print(df.head())

Conclusion

In this article, I showed how to transform text files into a data frame and save it as a csv/tsv. In order to reuse the class for a different data set, just create a new class that inherits from the ArticleCSVParser and override the methods that have to be changed.

You can find the complete code and dataset also in this repository.

I hope you enjoyed and happy coding!

--

--

Software engineer with business degree, rock climber and lifelong learner from Switzerland.