Pandas is an API used to analyze, organize, and structure Data. It is widely accepted among the Python community and is used in many other packages, frameworks, and modules. Pandas has a wide variety of use-cases and is hugely flexible for preparing your input data for machine learning, deep learning, and neural network models. With tools like DataFrames and Series you can create collections and sequences that are multi-dimensional. This allows greater indexing and slicing abilities. Pandas is open source and a BSD-licensed library. So, let’s see how you can structure, manipulate, and organize your data with Pandas.
My GitHub repo for the code in this article can be found here:
Recommended Prerequisites
A brief history of the Python programming language
A Complete Beginners Reference Guide to Python
The best IDEs and Text Editors for Python
How to interact with APIs in Python
An Overview of The Anaconda Distribution
Table of Contents
- Installing Pandas
- Exploring Pandas Series
- Exploring Pandas DataFrames
- Working with Data in Pandas (Example)
- Conclusion
Installing Pandas
Pandas installs through the PIP or through the Anaconda Distribution.
pip install pandas
conda install pandas
Pandas integrates with many tabular file types and is good at working with these file types by converting them into Pandas DataFrames. A DataFrames in Pandas is a specific type of data table. You can do things like manipulate data, handle timeseries data, plot graphs, reshape data structures, combine data, sort data, and filter data.
You can import Pandas into your Python code as follows:
import pandas as pd
# the 'as pd' part is not necessary but is typically the standard for importing this library.
There are multiple ways to create Pandas DataFrames. You can convert dictionaries, lists, tabular data, and Pandas Series objects into DataFrames or you can create them using the pd.DataFrame()
method. The Series object is like a one column DataFrame. So, as you can imagine DataFrames are a collection of one or more Series. You can create a series object using pd.Series()
.
Exploring Pandas Series
Basic Pandas Series construction looks like this.
pd.Series([1, 2, 3, 4])
Usually you will want to name your series by assigning a variable name.
my_series = pd.Series([1, 2, 3, 4])
You can assign named Pandas Series to a DataFrame as follows.
first_series = pd.Series([1, 2, 3, 4])
second_series = pd.Series(['one', 'two', 'three', 'four'])
my_df = pd.DataFrame({ 'First': first_series, 'Second': second_series })
print(my_df)
[out]
First Second
0 1 one
1 2 two
2 3 three
3 4 four
Exploring Pandas DataFrames
You can convert many file-types to Pandas DataFrames. Here are some of the more common methods used for this.
read_csv(filepath_or_buffer[, sep, ...])
read_excel(*args, **kwargs)
read_json(*args, **kwargs)
read_html(*args, **kwargs)
You can also manipulate data from file-types with some of these more used methods.
# Excel
ExcelFile.parse([sheet_name, header, names, ...])
ExcelWriter(path[, engine])
# JSON
json_normalize(data[, record_path, meta, ...])
build_table_schema(data[, index, ...])
You also have methods you can use on DataFrames to help structure and manipulate your data correctly. I recommend reading the docs to understand all of what you can do with DataFrames, but these are some of the more used methods and attributes.
# constructor
DataFrame([data, index, columns, dtype, copy])
DataFrame.head([n])
DataFrame.tail([n])
DataFrame.values
DataFrame.dtypes
DataFrame.columns
DataFrame.size
DataFrame.shape
DataFrame.axes
DataFrame.index
DataFrame.loc
DataFrame.iloc
DataFrame.keys()
DataFrame.filter([items, like, regex, axis])
DataFrame.dropna([axis, how, thresh, ...])
DataFrame.fillna([value, method, axis, ...])
DataFrame.sort_values(by[, axis, ascending, ...])
DataFrame.sort_index([axis, level, ...])
DataFrame.append(other[, ignore_index, ...])
DataFrame.join(other[, on, how, lsuffix, ...])
DataFrame.merge(right[, how, on, left_on, ...])
DataFrame.update(other[, join, overwrite, ...])
DataFrame.to_period([freq, axis, copy])
DataFrame.tz_localize(tz[, axis, level, ...])
DataFrame.plot([x, y, kind, ax, ....])
DataFrame.from_dict(data[, orient, dtype, ...])
DataFrame.to_pickle(path[, compression, ...])
DataFrame.to_csv([path_or_buf, sep, na_rep, ...])
DataFrame.to_sql(name, con[, schema, ...])
DataFrame.to_dict([orient, into])
DataFrame.to_excel(excel_writer[, ...])
DataFrame.to_json([path_or_buf, orient, ...])
DataFrame.to_html([buf, columns, col_space, ...])
DataFrame.transpose(*args[, copy])
DataFrames have five parameters data
, index
, columns
, dtype
, and copy
.
DataFrames are used as input for many machine learning, deep learning, and neural network models. It is also good for EDA (Exploratory Data Analysis). Knowing and using at least the basics in Pandas is a must for most Data Science with Python projects.
Working with Data in Pandas (Example)
I am going to use a Sunspots dataset listed on Kaggle. Monthly Mean Total Sunspot Number | 1749 – July 2018
Kaggle: Your Machine Learning and Data Science Community
Acknowledgement:
SIDC and Quandl.
Database from SIDC – Solar Influences Data Analysis Center – the solar physics research department of the Royal Observatory of Belgium. SIDC website
Conclusion
The Pandas library is really an amazing tool to have in Python. This article just goes over the tip of the iceberg as to what you can accomplish with the Pandas API. You can begin to see the true capabilities that Pandas has to offer when starting to work with data in Python. Learning Pandas and how it works will improve your Python experience with Data Science by allowing you to have more control over your input data. This will not only give you more flexibility and power when exploring data, but also when working directly with it to achieve your programmatic, computational, or scientific goals. I hope this helps anyone wanting to learn more about the Pandas API in Python. Happy coding!