The Way of the Serpent

Pythonic Tip & Tricks — Frequency Extraction

How to construct a DataFrame from a List

Tonichi Edeza

Published in

Towards Data Science

5 min readFeb 8, 2021

Python is an excellent language for new programmers to learn. Its intuitive nature makes it easy for most people to quickly understand what the codes and algorithms are actually doing. In this article and many others in the future, we will be going over the many ways one can use Python to solve beginner exercises to more complicated problems. Note that for all these articles I will be making use of a Jupyter Notebook to show the results of the code.

Let’s begin!

So lets imagine you were given a list of words and had to extract specific metrics from it. How would you go about that?

The Words

wordy_list = ['dog cat bear tiger tiger tiger dog dog cat cat bear   
               tiger tiger panda panda panda iguana iguana']

The first thing you may notice is that list above does not lend itself very well to data analysis. A key issue is that the data is kept as a single string. Our first step then is to break the string into a list of words. To do that we use the trusty split() function in Python.

list_of_words = wordy_list.split()

Oh no, it seems we have an error. The reason for this is simple. If we look at the variable it is technically a list, however it is a list that contains only a single string as an element. Such issues are quite common the field of data science, as such we must be vigilant and spot such issues as early as possible. To bypass this error we can simply use the split() function in conjunction with an index.

list_of_words = wordy_list[0].split()

Excellent, this list lends itself well to data analysis. So let’s begin with some basic tasks.

Count the number of times each word appears

A fairly simple task that you will encounter many times throughout your data science journey is frequency analysis. The below function will return all the elements of a list as a set of tuples with (element, element frequency):

def word_counter(word_list):
    new_list = []
    for w in word_list:
        number = word_list.count(w)
        datapoint = (w, number)
        new_list.append(datapoint)
    return set(new_list)counted_list = word_counter(list_of_words)

Great, the task is now complete!

But let us say you want to impress your boss and your colleges. You should not give them such a code. Though the code fulfills its function, there are more aesthetic ways to write it. The below is an example of a code that does the exact same task but is written extremely cleanly.

def word_counter(word_list):
    return set([(element, word_list.count(element)) 
               for element in word_list])counted_list = word_counter(list_of_words)

We can see that the function has now been reduced into a single line of code. Not only does the code retain all its functionality, it does it in a way that looks simple yet elegant.

Count the number of times each word appears AND arrange the output from the most frequent to the least frequent

Now your boss wants you to construct the frequency count BUT ALSO wants you to return the data arranged from the most frequent word to the least frequent one. Of course we can simply process the data after returning it, but to save us time let us build this feature into the function. To do this let us make use of the sorted() function as well as a lambda function.

def word_counter(word_list):
    new_list = [(element, word_list.count(element)) 
                for element in set(word_list)]
    return sorted(new_list, key = lambda j: j[1], reverse  = True)counted_list = word_counter(list_of_words)

Excellent, we have now fulfilled the task.

LAST MINUTE CHANGES!!!

Let’s say your boss calls you at 7am and urgently needs you to make changes your code.

“What’s the issue?” You may ask.

“We need the data arranged from most frequent to least frequent AND THEN by alphabetical order, we also need it stored as a data frame” Your boss says.

Believe me, such scenarios are quite common in the world of data science. Let us comply with the request. The below code ensures that the output is arranged by most frequent to least frequent element AND THEN sorts the elements by alphabetical order. Additionally we load the data into a pandas DataFrame. Note that we must import the pandas module.

import pandas as pddef word_counter(word_list):
    new_list = [(element, word_list.count(element)) 
                for element in set(word_list)]
    return pd.DataFrame(sorted(new_list, 
                               key=lambda x:(-x[1], x[0])),
                        columns = ['element', 'frequency_count'])df_counted = word_counter(list_of_words)

Great, that should satisfy our boss and other stakeholders.

In Conclusion

We’ve seen some of the ways Python can help facilitate data analysis. For this article, the sample list I used was quite short but in the reality such tasks would involve thousands if not hundreds of thousands of unstructured datapoints. In future lessons we will go over other ways we can use Python to to help us conduct data analysis, but for now I hope I was able to give you an idea on how to apply Python to simple tasks.