When programming in Python and especially in the field of Data Science, probably no one will get around the library Pandas. In many applications, data must be used in tabular form, and this can be processed most easily in Python with the help of Pandas DataFrames.
Make sure you know the following commands and can use them without any fuss so that you can save time and get the maximum out of Pandas every day.
Example
In this article, we use the following data as an example to test our commands:
For our example, we simply generate a DataFrame with two columns and dates as the index. We simply fill the numerical values randomly using Numpy.
1. Add a New Column
There are several ways to add new columns to an existing DataFrame. By simply defining the new column with square brackets it will be added as a new column to the DataFrame from the right.
If instead, we want to insert the new column at a specific index, we can use "df.insert()" for that.
The first value passed to this function is the index of the new column to be inserted, then the name of the column, and thirdly the object to be inserted as the column. The last parameter specifies whether duplicates of this column are allowed. So if the column with the name and the same values already exists and "allow_duplicates" is set to "False", then you will get an error message.
2. Add a New Row
If values are to be added later in the form of a new line, you can simply define it and provide it with a new index. It is important that the list of values you pass matches the number of columns in the DataFrame.
3. Delete a Column
As with any good Pandas command, there are several options for dropping columns. The two easiest are either using the function "df.drop()" and the name of the column, and "axis=1" for a column selection. Or you can use the standard Python function "del" and define the corresponding column:
4. Delete Empty Fields
When we pass empty values to a DataFrame or other Pandas object, they are automatically replaced by Numpy NaNs (Not a Number). For calculations, such as averaging, these fields are not included and are ignored. We can simply extend the existing DataFrame with an empty third column containing a value only for the index 01/01/2022. The remaining values are then automatically set as NaN.
If we want to delete all rows that have an empty value in at least one of the columns, we can do that using the following command.
If we want to delete columns with missing values instead, we use the same command and additionally set ‘axis = 1’. Otherwise, we can also fill the empty fields with predefined values, for example with the value 0.
In some cases, it can also be useful to display the missing values as Boolean values (True/False). In most cases, however, the DataFrame objects are too large and this is not a useful representation.
5. Delete a Row
If we don’t want to just delete empty values from our DataFrame but rows instead, there are two ways we can do that. First, we can delete the rows from the DataFrame by using the index of the row we want to delete. In our case, this is a concrete date, such as 01.01.2022:
By doing this, we have deleted the first line in this object. In most cases, however, we will not yet know the specific row that we want to delete. Then we can also filter the DataFrame to the rows we want to delete and then output the indexes of the corresponding rows.
In this case, we delete all lines for which a value greater than 0.1 is detected in "column 1". This leaves a total of four lines in the "df" object.
6. Merge Pandas Objects
Pandas provide several ways to concatenate Series or DataFrame objects. The concat command extends the first-named object by the second-named object if they are of the same type. The command can of course be executed with more than two data structures.
With DataFrames the code line looks the same. The addition ‘ignore_index’ is used to assign a new continuous index and not the index from the original object.
Pandas also allow joining possibilities with ‘Merge’, which most people are probably familiar with from SQL.
If we want to perform an inner join instead of left or right joins, we again use the Concat command with the addition ‘join = "inner"’.
This is what you should take with you
- Pandas offers many possibilities to deal with missing values. You can either delete the columns/rows in question or replace the fields with a value.
- With Pandas, we have the same join possibilities as with SQL.
_If you like my work, please subscribe here or check out my website Data Basecamp! Also, medium permits you to read 3 articles per month for free. If you wish to have unlimited access to my articles and thousands of great articles, don’t hesitate to get a membership for $5 per month by clicking my referral link:_ https://medium.com/@niklas_lang/membership
Online Analytical Processing (OLAP): A Comprehensive Guide to Data Analysis