Pandas: Most Used Functions in Data Science
Most useful functions for data preprocessing
When you get introduced to machine learning, the first step is to learn Python and the basic step of learning Python is to learn pandas library. We can install pandas library by pip install pandas. After installing we have to import pandas each time of the running session. The data used for example is from the UCI repository “https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records”.
- Read Data
We can read data in pandas data frame as read_csv(). Two most used data read formats are csv and excel. If we are reading data in excel format, we can also give sheet names as follows. There are other less used options of other file types too.
2. Head and Tail
To see the data frame we can use df.head(). Head returns the first rows, if no input is given it will always show above 5 rows. In contrast to see below rows, we can use df.tail().
3. Shape, Size and Info
Two most basic functions after reading data is to know the number of rows and columns, and to know the datatype of variables. We can use df.shape, it gives a total number of rows and then columns. df.size() returns the number of rows times number of columns in the data frame. We can also use df.info(), from that we get different information such as rows from RangeIndex, Data columns and then data type of each column. It also includes the information of non-null counts.
4. isna
But, if one needs to get the total number of null values in a data, we can use df.isna() as below. Sum will give the total null values. If we want just one variable null values, we can also get it by giving the name of the variable as below.
5. Describe
Then to understand basic statistics of variables we can use df.describe(). It will give you count, mean, standard deviation, and also 5 number summary.
6. Nunique
To get the total unique values of variables, we can use df.nunique(). It will give all the unique values a variable contains.
7. Value Counts
Also to get the unique values of a single variable, we can use df.anaemia.value_counts(). For demonstration, only variables with boolean values are given below. count_values() returns counts of unique values. The resulting object will be in descending order. This function excludes NA values by default.
If we make any changes in the data and want to write it to a comma-separated values (csv) file, we can use to_csv(). In this for index default is true.
8. Columns
To know the names of all the variables in a data frame, we can use df.columns.
CONCLUSION
These functions are very common to preprocess the data in the initial steps. Even memorizing these functions is a good idea. There are many other useful functions and can be used as per condition and requirement. It can be explored in pandas documentation: “https://pandas.pydata.org/pandas-docs/stable/reference/frame.html”.
Thanks for reading!