Be a more efficient data scientist, master pandas with this guide
Python is open source. It’s great, but has the inherent problem of open source: many packages do (or try to do) the same thing. If you’re new to Python, it’s hard to know the best package for a specific task. You need someone who has experience to tell you. And I tell you today: there’s one package you absolutely need to learn for data science, and it’s called pandas.
And what’s really interesting with pandas is that many other packages are hidden in it. Pandas is a core package with additional features from a variety of other packages. And that’s great because you can work only using pandas.
pandas is like Excel in Python: it uses tables (namely DataFrame) and operates transformations on the data. But it can do a lot more.
If you’re already familiar with Python, you can go straight to the 3rd paragraph
Let’s start:
The most elementary functions of pandas
Reading data
The most usual functions: read_csv, read_excel
Some other great functions: read_clipboard (which I use way too often, copying data from Excel or from the web), read_sql
Writing data
I usually don’t go for the other functions, like .to_excel, .to_json, .to_pickle since .to_csv does very well the job. And because csv is the most common way to save tables. There’s also the .to_clipboard if you’re like me an Excel maniac who wants to paste your results from Python to Excel.
Checking the data
Seeing the data
The basic functions of pandas
Logical operations
Basic plotting
This feature is made possible thanks to the matplotlib package. As we said in the intro, it’s usable directly in pandas.
Updating the data
Alright, now you can do things that were easily accessible in Excel. Let’s dig in some amazing things that are not doable in Excel.
Medium level functions
Counting occurrences
Operations on full rows, columns, or all data
The .map() operation applies a function to each element of a column.
.apply() applies a function to columns. Use .apply(, axis=1) to do it on the rows.
.applymap() applies a function to all cells in the table (DataFrame).
tqdm, the one and only
When working with large datasets, pandas can take some time running .map(), .apply(), .applymap() operations. tqdm is a very useful package that helps predict when theses operations will finish executing (yes I lied, I said we would use only pandas).
Correlation and scatter matrices
Handling missing values
Advanced operations in pandas
The SQL join
Joining in pandas is overly simple.
Grouping
Not quite simple at the beginning, you need to master the syntax first and you’ll see yourself using this feature all the time.
Iterating over rows
The .iterrows() loops through 2 variables together: the index of the row and the row (i and row in the code above).
Overall pandas is one of the reason why Python is such a great language
There are many other interesting pandas features I could have shown, but it’s already enough to understand why a data scientist cannot do without pandas.
To sum up, pandas is
- simple to use, hiding all the complex and abstract computations behind
- (generally) intuitive
- fast, if not the fastest data analysis package (it highly optimized in C)
It is THE tool that helps a data scientist to quickly read and understand data and be more efficient at his role.
I hope you found this article useful, and if you did, consider giving at least 50 claps :)