Pandas: Most Used Functions in Data Science

Most useful functions for data preprocessing

Saloni Mishra
Towards Data Science

--

Frome Unsplash by Sid Balachandran

When you get introduced to machine learning, the first step is to learn Python and the basic step of learning Python is to learn pandas library. We can install pandas library by pip install pandas. After installing we have to import pandas each time of the running session. The data used for example is from the UCI repository “https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records”.

  1. Read Data

We can read data in pandas data frame as read_csv(). Two most used data read formats are csv and excel. If we are reading data in excel format, we can also give sheet names as follows. There are other less used options of other file types too.

Image by Author

2. Head and Tail

To see the data frame we can use df.head(). Head returns the first rows, if no input is given it will always show above 5 rows. In contrast to see below rows, we can use df.tail().

Image by Author

3. Shape, Size and Info

Two most basic functions after reading data is to know the number of rows and columns, and to know the datatype of variables. We can use df.shape, it gives a total number of rows and then columns. df.size() returns the number of rows times number of columns in the data frame. We can also use df.info(), from that we get different information such as rows from RangeIndex, Data columns and then data type of each column. It also includes the information of non-null counts.

4. isna

But, if one needs to get the total number of null values in a data, we can use df.isna() as below. Sum will give the total null values. If we want just one variable null values, we can also get it by giving the name of the variable as below.

Image by Author

5. Describe

Then to understand basic statistics of variables we can use df.describe(). It will give you count, mean, standard deviation, and also 5 number summary.

Image by Author

6. Nunique

To get the total unique values of variables, we can use df.nunique(). It will give all the unique values a variable contains.

Image by Author

7. Value Counts

Also to get the unique values of a single variable, we can use df.anaemia.value_counts(). For demonstration, only variables with boolean values are given below. count_values() returns counts of unique values. The resulting object will be in descending order. This function excludes NA values by default.

Image by Author

If we make any changes in the data and want to write it to a comma-separated values (csv) file, we can use to_csv(). In this for index default is true.

Image by Author

8. Columns

To know the names of all the variables in a data frame, we can use df.columns.

Image by Author

CONCLUSION

These functions are very common to preprocess the data in the initial steps. Even memorizing these functions is a good idea. There are many other useful functions and can be used as per condition and requirement. It can be explored in pandas documentation: “https://pandas.pydata.org/pandas-docs/stable/reference/frame.html”.

Thanks for reading!

--

--