The world’s leading publication for data science, AI, and ML professionals.

Splitting the text column and getting unique values in Python

A user case for a none-fixed-length text column

Splitting a string column and getting unique values and frequency in Python

With two methods: For loop and Counter container

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

What the problem and target are:

In order to simplify the problem, I take out two columns from my working file which is a Stackoverflow yearly Survey file, as below:

You can download the file from StackOverflow survey 2017, use the survey_results_public.csv file, and extract two columns: Respondent and CousinEducation to test the following codes.

My target is to get unique values for the column CousinEducation. The data in this column is a string type, separated by semi-column, but how many items (or semi-columns) in one row is none-fixed. I will introduce two methods to do it. Let’s start the journey.

Have a look at the data:

Import the library, read the file, and check the size of the file as below:

Have a look at how the data looks like, with value_counts method, which is a very common method to deal with categorical data.

Very good. It groups according to the answers of CousinEducation, but still they are still far away from what I want. Let’s rename "index" into method and CousinEducation into count to make it much more meaningful.

Method1: use for loop and list(set())

  1. Separate the column from the string using split, and the result is as follows.

Let’s check the type. Making sure the data type can help me to take the right actions, especially, when I am not so sure.

2. Create a list including all of the items, which is separated by semi-column

Use the following code:

Now how df1 looks like:

Great! We get much closer. Now search for the method of getting unique values.

3. Get the unique values

As you know, df1 is a list. We use list(set()) to get the unique value from df1:

It seems that there are some leading spaces for the same content, now delete the space.

Now we need to get a unique value again, use the same method: list_3=list(set(list_2)).

Great! We get unique values.

Everything goes well, let’s write it into a function to make it modularization.

Let’s have a test:

Super, it gets the unique values!

Now let’s try the second method.

Method 2: use of Counter container

Counter is a container that keeps track of how many times equivalent values are added. The value can be accessed by dictionary API.

  1. First, import the library:
  1. Instead of use for loop, we use Concatenate strings in the Series/Index with a given separator at the beginning.
  1. The next two steps: split the string and replace the space as before.
  1. Now use Counter container which keeps track of how many times equivalent values are added.
  1. Use keys() to get the unique values

The result is:

  1. The most exciting way is that we can get the unique list and the frequency through method most_common().

By given i=13, most_common() method will get a list as follows:

  1. Now change the list into DataFrame.
  1. Let’s modulize it.

    The result is:

Now you can see the evolution of the data.

Lesson learned:

In order to get the unique text from the Dataframe which includes multiple texts separated by semi-column, two methods are introduced here:

Method1:

  1. Use two for loops to get the list
  2. Use list(set()) to get the unique value from the list
  3. Use strip() to delete the leading or trailing spaces for the string in the list

Method2:

  1. Use Counter to get the container
  2. Use keys() method to get unique values
  3. Use most_common() method to get the unique values and frequency

Feel free to choose the one you prefer. Of course, you can replace the separator in your situation. With the unique values, depending on your question, you can do a lot of further analysis.

Thanks for your reading and happy coding.


Related Articles