Pandas Index Explained

Pandas is a best friend to a Data Scientist, and index is the invisible soul behind pandas

Manu Sharma
Towards Data Science

--

index is like an address

We spend a lot of time with methods like loc, iloc, filtering, stack/unstack, concat, merge, pivot and many more while processing and understanding our data, especially when we work on a new problem. And these methods use indexes, even most of the errors we face are indices error. Index become more important in time series data. Visualisations also need good control over pandas index.

Index is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.

Machine/ Deep learning algorithms has a performance limit at prediction accuracy just like their ancestors, statistical models, so how do you get a significant better accuracy? Feed a better data for learning. Data Processing & Feature Enginnering is the key

Pandas have three data structures dataframe, series & panel. We mostly use dataframe and series and they both use indexes, which make them very convenient to analyse.

Time to take a step back and look at the pandas' index. It empowers us to be a better data scientist.

We will be using the UCI Machine Learning Adult Dataset, the following notebook has the script to download the data.

Business Problem: Classification (a person earns more than 50k or less) Predictor Variable: Label ; Predictors: country, age, education, occupation, marital status etc.

The following Notebook is very easy to follow and also has small tips and tricks to make daily work a little better.

adult  = pd.read_csv("https://archive.ics.uci.edu/ml/machine-    learning-databases/adult/adult.data", names = ['age','workclass','fnlwgt', 'education',    'education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss', 'hours_per_week', 'native_country','label'], index_col = False)print("Shape of data{}".format(adult.shape))adult.head()

Dataset has 32561 rows and 15 features, the leftmost series 0,1 2,3 … is index. Let’s Look at some more information.

Adult has rangeindex 32561 entries, an integer series from 0 to 32560.

Selection methods

  1. df.loc are for labels/ names
  2. df.iloc are for position numbers

e.g. Lets assume Ram, Sonu & Tony are standing at positions 1, 2 & 3 respectively. If you want to call Ram you have two options, either you call him by his name or his position number. So, if you call Ram by his name “Ram”, you will use df.loc and if we will call him by his position number “1” we will use df.iloc.

Before we understand loc & iloc more, let's take a sample from our data for further analysis. We are taking a sample of 10000 observations, using pandas df.sample method.

Nice, now we have a dataset name df, and the leftmost series are 3,4,6,8, 11…Strange!

It is because the rows will carry their original( old address ) index or index names from the adult dataset

Naming our index will help us a little initially, its the indices from adult dataset.
look at the rows and column indices

Both rows and columns have indexes, and the name of index is ‘index_adult

Let's discuss a couple of examples on loc & iloc methods

error on df.loc method

In the first example of .loc, it gave us an error

because we have used .loc method and df has no row who has a name ‘2’, row index looks like a number to us, but they are name/label to .loc method

try replacing 2 with 3 or 4, it will work, because there are names ‘3’,’4' as position names

In the second example, we are trying the same with .iloc, its a position number-based method

“age” is first column so we will use its position which is 0

there will be a position 0,1,2,3 till the last row of df, so 2 will be the third row

For our further analysis, let's Keep a few interesting variables only

Some of the times, it’s difficult to work with random numbers in index, at that time, resetting this index will make it a column and recreate another default index

`reset_index()`will recreate index column every time we run it on same data
`drop = True` paramater won’t create that as column in the dataframe, look at the difference between following two dataset

`inplace = True` save us from assigning it to data again
we are not using `drop = True`, now df should have its last index as a column in it

df has another column index_adult, because of reset

Filtering

Filter on India

After resetting our index, and applying a filter for India, we can see index hold itself from df, just like sampling, now the row index(4, 312, 637 ,902..) are from df and index_adult is the indices of these rows in adult

Let's look at the observations with more than 50k income across the gender

Even this dataframe has an index, hard to recognise by looking at the dataframe, and individual items can be accessed like

Filter ind dataset for people with income more than 50K

indices are intact with their rows, just like an address

Very Impressive, these people are earning really good, Let's try to know their work hours since we don't have ‘hours_per_week’ in this data, we will bring it from adult with the help of indices

Indices made it very easy to bring more information easily, the above formula can be understood as filtering adult[‘hours_per_week’] on the address index_adult of ind_50

mean of work hours per week for people who earn more than 50k

Just with the use of index_adult, we were able to bring another column information easily

Index make filtering very easy and also give you space to move forward and backwards in your data

one last use of the index for this intro exercise

Filtering a complementary set from the data, just like train and test from the total dataset

we are slicing that part of ind, which is not in ind_50, i.e. people who are earning less than 50k

Nice! It looks like people who earn 50k & more, work more hours per week in this data sample

Jupyter nb can be downloaded from this Github-repo.

The true capability of pandas index can be realised only when we drill down our data with multi-indexing & visualisations. Visit my next exercise on stack/unstack, pivot_table & crosstab

Thanks for reading. If you have liked this article, you may also like Pandas Pivot & Stacking, Scaling & Transformation When & Where

--

--