Pandas Basics Cheat Sheet (2021), Python for Data Science
The Pandas library is one of the most powerful libraries in Python. It is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.
Check out the sections below to learn the various functions and tools Pandas offers.
Sections:
1. Pandas Data Structures
2. Dropping
3. Sort & Rank
4. Retrieving Series/DataFrame Information
5. DataFrame Summary
6. Selection
7. Applying Functions
8. Data Alignment
9. In/Out
Pandas Data Structures
There are two main types of data structures that the Pandas library is centered around. The first is a one-dimensional array called a Series, and the second is a two-dimensional table called a Data Frame.
- Series — One dimensional labeled array
>>> s = pd.Series([3, -5, 7, 4], index = ['a','b','c','d'])
a 3
b -5
c 7
d 4
- Data Frame — A two dimensional labeled data structure
>>> data = {'Country':['Belgium','India','Brazil'], 'Capital':['Brussels','New Delhi','Brasilia'], 'Population':['111907','1303021','208476']}>>> df = pd.DataFrame(data, columns = ['Country','Capital','Population']) Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
Dropping
In this section, you’ll learn how to remove specific values from a Series, and how to remove columns or rows from a Data Frame.
s and df in the code below are used as examples of a Series and Data Frame throughout this section.
>>> sa 6
b -5
c 7
d 4>>> df Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Drop values from rows (axis = 0)
>>> s.drop(['a','c']) b -5
d 4
- Drop values from columns (axis = 1)
>>> df.drop('Country', axis = 1) Capital Population
0 Brussels 111907
1 New Delhi 1303021
2 Brasilia 208476
Sort & Rank
In this section, you’ll learn how to sort Data Frames by an index, or column, along with learning how to rank column values.
df in the code below is used as an example Data Frame throughout this section.
>>> df Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Sort by labels along an axis
>>> df.sort_index() Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Sort by values along an axis
>>> df.sort_values(by = 'Country') Country Capital Population
0 Belgium Brussels 111907
2 Brazil Brasilia 208476
1 India New Delhi 1303021
- Assign ranks to entries
>>> df.rank() Country Capital Population
0 1.0 2.0 1.0
1 3.0 3.0 2.0
2 2.0 1.0 3.0
Retrieving Series/DataFrame Information
In this section, you’ll learn how to retrieve info from a Data Frame that includes the dimensions, column names column types, and index range.
df in the code below is used as an example Data Frame throughout this section.
>>> df Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- (rows, columns)
>>> df.shape
(3, 3)
- Describe index
>>> df.index
RangeIndex(start=0, stop=3, step=1)
- Describe DataFrame columns
>>> df.columns
Index(['Country', 'Capital', 'Population'], dtype='object')
- Info on DataFrame
>>> df.info() <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Country 3 non-null object
Capital 3 non-null object
Population 3 non-null object
dtypes: object(3)
memory usage: 152.0+ bytes
- Number of non-NA values
>>> df.count()Country 3
Capital 3
Population 3
DataFrame Summary
In this section, you’ll learn how to retrieve summary statistics of a Data Frame which include the sum of each column, min/max values of each column, mean values of each column, and others.
df in the code below is used as an example of a Data Frame throughout this section.
>>> df Even Odd
0 2 1
1 4 3
2 6 5
- Sum of values
>>> df.sum()
Even 12
Odd 9
- Cumulative sum of values
>>> df.cumsum() Even Odd
0 2 1
1 6 4
2 12 9
- Minimum value
>>> df.min()
Even 2
Odd 1
- Maximum value
>>> df.max()
Even 6
Odd 5
- Summary statistics
>>> df.describe() Even Odd
count 3.0 3.0
mean 4.0 3.0
std 2.0 2.0
min 2.0 1.0
25% 3.0 2.0
50% 4.0 3.0
75% 5.0 4.0
max 6.0 5.0
- Mean of values
>>> df.mean()
Even 4.0
Odd 3.0
- Median of values
>>> df.median()
Even 4.0
Odd 3.0
Selection
In this section, you’ll learn how to retrieve specific values from a Series and Data Frame.
s and df in the code below are used as examples of a Series and Data Frame throughout this section.
>>> sa 6
b -5
c 7
d 4>>> df Country Capital Population
0 Belgium Brussels 111907
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Get one element
>>> s['b']
-5
- Get subset of a DataFrame
>>> df[1:] Country Capital Population
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Select single value by row & column
>>> df.iloc[0,0]
'Belgium'
- Select single value by row and column labels
>>> df.loc[0,'Country']
'Belgium'
- Select single row of subset rows
>>> df.ix[2]Country Brazil
Capital Brasilia
Population 208476
- Select a single column of subset of columns
>>> df.ix[:,'Capital']0 Brussels
1 New Delhi
2 Brasilia
- Select rows and columns
>>> df.ix[1,'Capital']
'New Delhi'
- Use filter to adjust DataFrame
>>> df[df['Population'] > 120000] Country Capital Population
1 India New Delhi 1303021
2 Brazil Brasilia 208476
- Set index a of Series s to 6
>>> s['a'] = 6 a 6
b -5
c 7
d 4
Applying Functions
In this section, you’ll learn how to apply a function to all values of a Data Frame or a specific column.
df in the code below is used as an example of a Data Frame throughout this section.
>>> df Even Odd
0 2 1
1 4 3
2 6 5
- Apply function
>>> df.apply(lambda x: x*2) Even Odd
0 4 2
1 8 6
2 12 10
Data Alignment
In this section, you’ll learn how to add, subtract, and divide two series that have different indexes from one another.
s and s3in the code below are used as examples of Series throughout this section.
>>> sa 6
b -5
c 7
d 4>>> s3a 7
c -2
d 3
- Internal Data Alignment
>>> s + s3a 13.0
b NaN
c 5.0
d 7.0#NA values are introduced in the indices that don't overlap
- Arithmetic Operations with Fill Methods
>>> s.add(s3, fill_value = 0)a 13.0
b -5.0
c 5.0
d 7.0>>> s.sub(s3, fill_value = 2)a -1.0
b -7.0
c 9.0
d 1.0>>> s.div(s3, fill_value = 4)a 0.857143
b -1.250000
c -3.500000
d 1.333333
In/Out
In this section, you’ll learn how to read a CSV file, Excel file, and SQL Query into Python using Pandas. You will also learn how to export a Data Frame from Pandas into a CSV file, Excel file, and SQL Query.
- Read CSV file
>>> pd.read_csv('file.csv')
- Write to CSV file
>>> df.to_csv('myDataFrame.csv')
- Read Excel file
>>> pd.read_excel('file.xlsx')
- Write to Excel file
>>> pd.to_excel('dir/'myDataFrame.xlsx')
- Read multiple sheets from the same file
>>> xlsx = pd.ExcelFile('file.xls')>>> df = pd.read_excel(xlsx, Sheet1')
- Read SQL Query
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///:memory:')
>>> pd.read_sql('SELECT * FROM my_table;', engine)
>>> pd.read_sql_table('my_table', engine)
- Write to SQL Query
>>> pd.to_sql('myDF', engine)
Python is the top dog when it comes to data science for now and in the foreseeable future. Knowledge of Pandas, one of its most powerful libraries is often a requirement for Data Scientists today.
Use this cheat sheet as a guide in the beginning and come back to it when needed, and you’ll be well on your way to mastering the Pandas library.