The world’s leading publication for data science, AI, and ML professionals.

Introduction to Pandas

Enhance your Data Science skills in Python with the Pandas API.

Photo by Damian Patkowski on Unsplash
Photo by Damian Patkowski on Unsplash

Pandas is an API used to analyze, organize, and structure Data. It is widely accepted among the Python community and is used in many other packages, frameworks, and modules. Pandas has a wide variety of use-cases and is hugely flexible for preparing your input data for machine learning, deep learning, and neural network models. With tools like DataFrames and Series you can create collections and sequences that are multi-dimensional. This allows greater indexing and slicing abilities. Pandas is open source and a BSD-licensed library. So, let’s see how you can structure, manipulate, and organize your data with Pandas.

pandas

My GitHub repo for the code in this article can be found here:

third-eye-cyborg/Intro_to_Pandas

Recommended Prerequisites

A brief history of the Python programming language

Python – Basic Overview

A Complete Beginners Reference Guide to Python

The best IDEs and Text Editors for Python

How to interact with APIs in Python

An Overview of The Anaconda Distribution

An Overview of The PEP 8 Style Guide

Exploring Design Patterns in Python

Table of Contents


Installing Pandas

Pandas installs through the PIP or through the Anaconda Distribution.

pip install pandas
conda install pandas

Pandas integrates with many tabular file types and is good at working with these file types by converting them into Pandas DataFrames. A DataFrames in Pandas is a specific type of data table. You can do things like manipulate data, handle timeseries data, plot graphs, reshape data structures, combine data, sort data, and filter data.

You can import Pandas into your Python code as follows:

import pandas as pd 
# the 'as pd' part is not necessary but is typically the standard for importing this library. 

There are multiple ways to create Pandas DataFrames. You can convert dictionaries, lists, tabular data, and Pandas Series objects into DataFrames or you can create them using the pd.DataFrame() method. The Series object is like a one column DataFrame. So, as you can imagine DataFrames are a collection of one or more Series. You can create a series object using pd.Series().


Exploring Pandas Series

Basic Pandas Series construction looks like this.

pd.Series([1, 2, 3, 4])

Usually you will want to name your series by assigning a variable name.

my_series = pd.Series([1, 2, 3, 4])

You can assign named Pandas Series to a DataFrame as follows.

first_series = pd.Series([1, 2, 3, 4])
second_series = pd.Series(['one', 'two', 'three', 'four'])
my_df = pd.DataFrame({ 'First': first_series, 'Second': second_series })
print(my_df)
[out]
   First Second
0      1    one
1      2    two
2      3  three
3      4   four

Exploring Pandas DataFrames

You can convert many file-types to Pandas DataFrames. Here are some of the more common methods used for this.

read_csv(filepath_or_buffer[, sep, ...])
read_excel(*args, **kwargs)
read_json(*args, **kwargs)
read_html(*args, **kwargs)

You can also manipulate data from file-types with some of these more used methods.

# Excel
ExcelFile.parse([sheet_name, header, names, ...])
ExcelWriter(path[, engine])
# JSON
json_normalize(data[, record_path, meta, ...])
build_table_schema(data[, index, ...])

You also have methods you can use on DataFrames to help structure and manipulate your data correctly. I recommend reading the docs to understand all of what you can do with DataFrames, but these are some of the more used methods and attributes.

DataFrame – pandas 1.1.3 documentation

# constructor
DataFrame([data, index, columns, dtype, copy])
DataFrame.head([n])
DataFrame.tail([n])
DataFrame.values
DataFrame.dtypes
DataFrame.columns
DataFrame.size
DataFrame.shape
DataFrame.axes
DataFrame.index
DataFrame.loc
DataFrame.iloc
DataFrame.keys()
DataFrame.filter([items, like, regex, axis])
DataFrame.dropna([axis, how, thresh, ...])
DataFrame.fillna([value, method, axis, ...])
DataFrame.sort_values(by[, axis, ascending, ...])
DataFrame.sort_index([axis, level, ...])
DataFrame.append(other[, ignore_index, ...])
DataFrame.join(other[, on, how, lsuffix, ...])
DataFrame.merge(right[, how, on, left_on, ...])
DataFrame.update(other[, join, overwrite, ...])
DataFrame.to_period([freq, axis, copy])
DataFrame.tz_localize(tz[, axis, level, ...])
DataFrame.plot([x, y, kind, ax, ....])
DataFrame.from_dict(data[, orient, dtype, ...])
DataFrame.to_pickle(path[, compression, ...])
DataFrame.to_csv([path_or_buf, sep, na_rep, ...])
DataFrame.to_sql(name, con[, schema, ...])
DataFrame.to_dict([orient, into])
DataFrame.to_excel(excel_writer[, ...])
DataFrame.to_json([path_or_buf, orient, ...])
DataFrame.to_html([buf, columns, col_space, ...])
DataFrame.transpose(*args[, copy])

DataFrames have five parameters data, index, columns, dtype, and copy.

DataFrames are used as input for many machine learning, deep learning, and neural network models. It is also good for EDA (Exploratory Data Analysis). Knowing and using at least the basics in Pandas is a must for most Data Science with Python projects.


Working with Data in Pandas (Example)

I am going to use a Sunspots dataset listed on Kaggle. Monthly Mean Total Sunspot Number | 1749 – July 2018

Kaggle: Your Machine Learning and Data Science Community

Sunspots

Acknowledgement:

SIDC and Quandl.

Database from SIDC – Solar Influences Data Analysis Center – the solar physics research department of the Royal Observatory of Belgium. SIDC website

Creative Commons – CC0 1.0 Universal


Conclusion

The Pandas library is really an amazing tool to have in Python. This article just goes over the tip of the iceberg as to what you can accomplish with the Pandas API. You can begin to see the true capabilities that Pandas has to offer when starting to work with data in Python. Learning Pandas and how it works will improve your Python experience with Data Science by allowing you to have more control over your input data. This will not only give you more flexibility and power when exploring data, but also when working directly with it to achieve your programmatic, computational, or scientific goals. I hope this helps anyone wanting to learn more about the Pandas API in Python. Happy coding!


Related Articles