When it comes to data analysis, Pandas is the most used Python library to manipulate and prepare the data for further analyses and Machine Learning.
The reality is that Pandas is a really flexible library and can be used even to convert file formats.
However, even if we use some Pandas features barely every day, we spend a lot of time Googling how to do something in Pandas.
I know, I caught you!
But let’s be honest: some features are difficult to remember, maybe because we can reach the same goal with different methods. So, there’s nothing to be ashamed of if we Google the same things every day.
However, saving time is always a good idea. For this reason, in this article, we’ll go through the 7 top features to manipulate Pandas columns. This way you won’t need to Google them anymore: you just need to save this article (maybe, by bookmarking it) and return to it whenever you need it.
This is what you’ll find here:
Table of contents:
How to create a new Pandas column
How to add a new column to a Pandas data frame
How to rename a column in Pandas
How to drop a Pandas column
How to find unique values in a Pandas column
How to transform a Pandas column into a list
How to sort a Pandas data frame for a column
How to create a new Pandas column
First of all, let’s remember that a Pandas column is also called a Pandas series. This means that a Pandas data frame is an ordered collection of Pandas series.
There are a few methods to create a new Pandas column. Let’s see them all.
Create a Pandas column as a Pandas series
The correct method to create a Pandas column that is thought to "live" on its own is through the Pandas series method like so:
# Create a Panad series
series = pd.Series([6, 12, 18, 24])
# Print Pandas series
print(series)
>>>
0 6
1 12
2 18
3 24
dtype: int64
I said "the correct method" because, as we’ve said, a Pandas column is a Pandas series. So, if we just need a single column we should use this method if we’d like to be "formally correct".
Create a Pandas column as a Pandas data frame
However, the reality is that we won’t need a column on its own much often.
So, another way to create a Pandas column is by creating a new Pandas data frame with just one column: this way, we could enrich it in a second moment with other columns.
We can do it like so:
import pandas as pd
# Create a Pandas column as a Pandas data frame
df = pd.DataFrame({'A': [1, 2, 3, 4]})
# Print Pandas data frame
print(df)
>>>
A
0 1
1 2
2 3
3 4
So, here the difference with the previous example is that, in this case, the Pandas column has also the name. In this case, we’ve called it "A".
NOTE:
If we take a look more closely to what we've done here, we can see that
we can create a Pandas data frame as a dictionary.
In fact, "A" is the key and it's separated by a list of values
by a colon. Then, both the keys and the values are inside curly braces.
Create a Pandas column as a Pandas data frame, starting from a NumPy array
One of the superpowers of Pandas is that it can "accept" NumPy arrays as input values. In other words, we can create a data frame starting from a NumPy array.
In the case of a single column, we can create a one-dimensional array and transform it into a data frame: this results in a data frame with a single column.
We can do it like so:
import numpy as np
import pandas as pd
# Create a NumPy array
values = np.array([5, 10, 15, 20])
# Transform array into Pandas data frame
df = pd.DataFrame(values)
# Print data frame
print(df)
>>>
0
0 5
1 10
2 15
3 20
How to add a new column to a Pandas data frame
The possibility to add a new column to a Pandas data frame is somehow paired with the creation of a new column.
What I mean here is that we first need to create a Pandas data frame, then a single Pandas column, then we need to add the column to the data frame.
Also in this case we have multiple possibilities to do so. Let’s see them all.
Adding a new column to a Pandas data frame: the standard method
The standard method to add a new column to a Pandas data frame is to create the data frame, then create a separate column, then add it to the data frame.
We’ll use this method throughout all the following examples. So, here’s how we can do so:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4]})
# Add a new column by using a list of values
df['B'] = [20, 30, 40, 50]
# Print data frame
print(df)
>>>
A B
0 1 20
1 2 30
2 3 40
3 4 50
So, let’s analyze what we’ve done step by step:
- We’ve created a Pandas data frame with the method
pd.DataFrame()
. - We’ve created a new column with
df['B']
, meaning we called this new column "B". - We’ve assigned the values to the newly created column, with a list of numbers.
So, what’s another method to create a new column? Is by using a list of numbers, if we already have a data frame.
Adding a new column to a Pandas data frame: applying functions
The power of the standard method to add a new column to an existing data frame gives us the possibility to create a new column and add it to an existing data frame, all in one line of code.
For example, say that we want to create two new columns as a combination of an existing column. We can do so by applying functions to existing columns like so:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4]})
# Create a column doubling the values of column A
df['B'] = df['A'] * 2
# Apply lambda function to column A to create column C
df['C'] = df['A'].apply(lambda x: x ** 2)
# Print data frame
print(df)
>>>
A B C
0 1 2 1
1 2 4 4
2 3 6 9
3 4 8 16
So, here’s what we’ve done:
- We’ve created a Pandas column ("A") as a data frame.
- We’ve created column "B" by doubling the values of column "A".
- We’ve created column "C" by applying a lambda function to column "A". In particular, in this case, we’re squaring the values of column "A".
All of these columns are stored together in a unique data frame.
Adding a new column to a Pandas data frame: using Pandas series or single Pandas columns
Of course, we can add columns to a Pandas data frame even when the columns are Pandas series or Pandas data frame.
Here’s how we can do so:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4]})
# Create a new column using pd.Series()
values = pd.Series([5, 10, 15, 20]) # Create series
df['B'] = values # Add series to data frame as a column
# Print data frame
print(df)
>>>
A B
0 1 5
1 2 10
2 3 15
3 4 20
So, in the above case, we’ve created a Pandas series and, then, we’ve added it to the existing data frame by giving it a name.
In the case of a Pandas column created as a Pandas data frame, we have:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]})
# Create a Pandas column as a data frame
df['C'] = pd.DataFrame({'C': [9, 10, 11, 12]})
# Print data frame
print(df)
>>>
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
And here we are.
NOTE:
of course, the same methodology can be applied if we create a column
as a NumPy array. We won't show the method here as "the game" should
now be clear.
How to rename a column in Pandas
Renaming a Pandas column (or more than one) is another typical daily task we need to perform, but that we often can’t remember.
Also, in this case, we have different methods to do so. Let’s see them all.
How to rename a Pandas column: the rename() method
We can rename a Pandas column with the rename()
method like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Renaming a single column
df = df.rename(columns={'A': 'NewA'})
# Print data frame
print(df)
>>>
NewA B
0 1 4
1 2 5
2 3 6
So, it’s like we’re using a dictionary. Inside the rename()
method, in fact, we need to pass the argument columns
and specify the actual name and the new name inside curly braces, separating them with a column. Just like we do in dictionaries.
Of course, we can use this method to rename multiple columns like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Rename multiple columns
df = df.rename(columns={'A': 'NewA', 'B': 'NewB'})
# Print data frame
print(df)
>>>
NewA NewB
0 1 4
1 2 5
2 3 6
And, again, it’s as we’d work with dictionaries.
How to rename a Pandas column: the column attribute
To rename one Pandas column (or more than one, as we’ll see) we can use the columns
attribute like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Renaming all columns
df.columns = ['NewA', 'NewB']
# Print data frame
print(df)
>>>
NewA NewB
0 1 4
1 2 5
2 3 6
So, in this case, the columns
attribute gives us the possibility to use a list of strings to rename the columns.
How to rename a Pandas column: the set_axis() method
To rename one (or more than one) Pandas column we can use the set_axis()
method like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Renaming all columns
df.set_axis(['NewA', 'NewB'], axis=1, inplace=True)
# Print data frame
print(df)
>>>
NewA NewB
0 1 4
1 2 5
2 3 6
So, even in this case, we use a list of strings to rename the columns, but here we also need to pass the parameters axis=1
and inplace=True
because the set_axis()
method sets the axes from zero, so is recreating them. This lets this method rename the columns.
How to rename a Pandas column: using lambda functions
When we have to deal with strings as in the case of Pandas columns names, we can use lambda functions to modify the characters of the text.
For example, we may want (or need) to rename the columns by simply lowering the letters. We can do it like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'COLUMN_1': [1, 2, 3], 'COLUMN_2': [4, 5, 6]})
# Renaming columns using a lambda function
df = df.rename(columns=lambda x: x.lower()) # Lowercase column names
# Print data frame
print(df)
>>>
column_1 column_2
0 1 4
1 2 5
2 3 6
And here we are.
How to drop a Pandas column
Dropping a Pandas column (or more than one) is another task we need to perform very often. Maybe because its values are not significant, maybe because its values are all NULL
or for other reasons.
To perform this task we have two methods. Let’s see them both.
How to drop a Pandas column: using the drop() method
The typical method to drop a Pandas column (or more than one) is by using the drop()
method.
Here the only thing to get in mind is to decide if we want to drop some columns and create a new data frame, or if we want to drop them and substitute the current data frame.
Let me show the difference:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# Drop one column and substitute the current data frame
df = df.drop('A', axis=1)
# Print updated data frame
print(df)
>>>
B C
0 4 7
1 5 8
2 6 9
So, we’ve dropped the column "A" by using the drop()
method specifying the name of the column we wanted to drop and the axis (axis=1
in Pandas indicates the vertical direction and must be specified).
In this case, we’ve decided to substitute the data frame df
. So, at the end of the process, the data frame df
hasn’t column "A".
Instead, if we want to create another data frame, let’s say we call id df_2
, we have to do it like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9],
'D': [10, 11, 12]})
# Drop one column and substitute the current data frame
df_2 = df.drop(['A', 'D'], axis=1)
# Print new data frame
print(df_2)
>>>
B C
0 4 7
1 5 8
2 6 9
So, in this case, we’ve dropped two columns and created a new data frame with just columns "B" and "C".
This may be useful if we think we may need the original data frame df
in the future, for further analyses.
How to drop a Pandas column: using the column index
In Pandas, columns can be singled out via the indexes. This means that we can drop them using indexes like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9],
'D': [10, 11, 12]})
# Drop one column and append to a new data frame
df_2 = df.drop(df.columns[[0, 1]], axis=1)
# Print new data frame
print(df_2)
>>>
C D
0 7 10
1 8 11
2 9 12
So, in this case, we’ve created a new data frame with just columns "C" and "D" and we’ve deleted columns "A" and "B" by using their indexes.
Remembering that in Python we start counting from 0 (so, the first column is at index 0 and is column "A"), we have to say that this method may not be optimal if we have tens of columns for a simple reason: we should find the one (or the ones) we want to drop by counting them, which is subject to errors.
How to find unique values in a Pandas column
Finding unique values in a Pandas column is another task that we may need to perform daily, because duplicated values have to be treated in a particular way.
Also in this case we have a couple of methods to do so: one shows the duplicates in one column and the other removes them.
Let’s see them both.
How to find unique values in a Pandas column: using the value_counts() method to find duplicates
If we want to see if a Pandas column has duplicated values, and we also want to see how much are they, we can use the value_counts()
like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 1, 3], 'B': [4, 5, 6, 7, 8,],
'C': [7, 8, 9, 10, 11]})
# Find unique values in a Pandas column
unique_values = df['A'].value_counts()
# Print unique values
print(unique_values)
>>>
1 2
3 2
2 1
Name: A, dtype: int64
So, the result here tells that:
- The name of the column is "A" and the types are all "int64".
- We have two 1s.
- We have two 3s.
- We have one 2.
So, it shows us the values and tells us how many of them are present in the column of our interest.
How to find unique values in a Pandas column: using the drop_duplicates() method to drop duplicates
If we want to drop the duplicates values in a Pandas column (because we know there are duplicates in it) we can use the drop_duplicates()
method like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 1, 3], 'B': [4, 5, 6, 7, 8,],
'C': [7, 8, 9, 10, 11]})
# Drop the duplicate values in a Pandas column
unique_values = df['A'].drop_duplicates()
# Print unique values
print(unique_values)
>>>
0 1
1 2
2 3
So, we have removed the duplicates from column "A", creating a new Pandas column called unique_values
.
How to find unique values in a Pandas column: studying a data frame
At this point, you may be asking:" Well, if I have a big data frame with tens of columns, how can I know that properly some columns have some duplicates?"
Good question! The thing we can do is to first study the whole data frame.
For example, we may want to see if any columns have duplicates. We can do it like so:
import pandas as pd
# Creating a DataFrame with duplicates
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 4], 'B': [5, 6, 6, 7, 8, 8]})
# Check if there are duplicates in the DataFrame
has_duplicates = df.duplicated().any()
# Print the result
print(has_duplicates)
>>>
True
So, this code returns "True" if there are columns with duplicates and "False" if there aren’t.
And how about if we want to know the name of the columns that actually have duplicates? We can do it like so:
import pandas as pd
# Creating a DataFrame with duplicates
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 4], 'B': [5, 6, 6, 7, 8, 8]})
# Find duplicate rows
duplicate_rows = df.duplicated()
# Print the duplicate rows
print(df[duplicate_rows])
>>>
A B
2 2 6
5 4 8
And, so, the above code shows:
- The columns with the duplicates.
- The values of the duplicates.
We can now investigate any further with the value_counts()
method or drop them with the drop_duplicates()
method.
How to transform a Pandas column into a list
Transforming a Pandas column into a list is a useful feature that can give us the possibility to "isolate" all the values from a Pandas column to put them into a list. Then, we can do whatever we may need with a list, which is easily manageable (iterating, and so on).
We have two possibilities to do this transformation.
How to transform a Pandas column into a list: using the list() method
The list()
method is a built-in Python function that converts an iterable object into a list. We can use it like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 1, 3], 'B': [4, 5, 6, 7, 8,],
'C': [7, 8, 9, 10, 11]})
# Transform Pandas column into a list
column_list = list(df['B'])
# Print list
print(column_list)
>>>
[4, 5, 6, 7, 8]
So, we’ve easily extracted our values and put them into a list.
How to transform a Pandas column into a list: using the to_list() method
To achieve the same result, we can use the to_list()
method from Pandas. But take care: it’s available from Pandas version 1.2.0 or later.
We can use it like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 1, 3], 'B': [4, 5, 6, 7, 8,],
'C': [7, 8, 9, 10, 11]})
# Transform Pandas column into a list
column_list = df['B'].to_list()
# Print list
print(column_list)
>>>
[4, 5, 6, 7, 8]
And, of course, we’ve obtained the same result as before.
How to sort a Pandas data frame for a column
There are a lot of situations in which we need to sort our columns. By sorting we mean ordering, so we can choose to order the data in ascending or descending way.
We can reach this goal with the following methods.
How to sort a Pandas data frame for a column: using the sort_values() method
To sort a Pandas data frame for a column, we can use the sort_values()
like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [10, 2, 7, 1, 15], 'B': [4, 2, 6, 28, 8,],
'C': [7, 1, 9, 10, 19]})
# Sort df for A in ascending order
df.sort_values('A', ascending=True, inplace=True)
# Print sorted data frame
print(df)
>>>
A B C
3 1 28 10
1 2 2 1
2 7 6 9
0 10 4 7
4 15 8 19
So, as we can see, the data frame has been sorted with column "A" in ascending order. In fact, if we check:
- In the initial data frame, in column "A" the number 1 is in the 4th position. In column "B", the number 28 is in the fourth position.
- In the sorted data frame, in column "A", the number 1 is in the first position. In column "B", the number 28 is in the first position.
So, we sort the data frame but we don’t lose the relations between the values of the columns.
A very useful feature of this method is that it can sort a column by putting NaNs
as first values. We can do it like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [10, 0, 0, 1, 15], 'B': [4, 2, 6, 28, 8,],
'C': [7, 1, 15, 10, 19]})
# Sort NaNs in the beginning
df.sort_values('A', ascending=True, inplace=True)
# Print sorted data frame
print(df)
>>>
A B C
1 0 2 1
2 0 6 15
3 1 28 10
0 10 4 7
4 15 8 19
And here we are.
How to sort a Pandas data frame for a column: using the sort_index() method
We can also sort a data frame for index value like so:
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [10, 2, 7, 1, 15], 'B': [4, 2, 6, 28, 8,],
'C': [7, 1, 9, 10, 19]})
# Sort data frame for index
df.sort_index(inplace=True)
# Print sorted data frame
print(df)
>>>
A B C
0 10 4 7
1 2 2 1
2 7 6 9
3 1 28 10
4 15 8 19
And, as we can see, the indexes are ordered (in ascending way).
Conclusions
In this article, we’ve seen the top 7 operations on Pandas columns that we perform barely every day.
This guide will help you save a lot of time if you save it, because we’ve performed the same task in different ways so that you won’t need to Google them anymore, saving a lot of time.
Hi, I’m Federico Trotta and I’m a freelance Technical Writer.
Want to collaborate with me? Contact me.