India Air Quality Data Analysis

A report by the Health Effects Institute on air pollution in India (2018) reports that air pollution was responsible for 1.1 million deaths in India in 2015.

Shubhankar Rawat
Towards Data Science

--

AIR POLLUTION refers to the release of pollutants into the air that is detrimental to human health and the planet as a whole.

Today air pollution has been one of the significant problems to deal with for any nation. In South Asia, it is ranked as the sixth most dangerous killer.
One does not realize the harmful effects of a problem if he/she has not experienced it in the first place.
Take Delhi, for instance, we all have experienced what it feels like inhaling in the ‘deadly’ smog that remained for about a week, after Diwali. Citizens were advised not to leave their homes and were asked to wear masks whenever going outside. Looking outside the window made me feel like I was living in a gas chamber. Low visibility, a high number of deaths, etc. were the effects of pollution.

Being a data analysis and data science enthusiast, I decided to analyze the air quality data of my own country to find some underlying principles or patterns which might give me an insight into how severe the problem is and I must say the results were worth sharing. So, here I am writing this article to share my approach and what I analyzed from the data and to also make people aware of the enormous problem our country is facing.

The Approach

The following data analysis is carried out in python, and the code can be downloaded from the Github repository: Air quality analysis

Data

This data is a cleaner version of the Historical Daily Ambient Air Quality Data released by the Ministry of Environment and Forests and Central Pollution Control Board of India under the National Data Sharing and Accessibility Policy (NDSAP).

The dataset contains the following features :

  1. stn_code : Station code. A code is given to each station that recorded the data.
  2. sampling_date: The date when the data was recorded.
  3. state: It represents the states whose air quality data is measured.
  4. location: It represents the city whose air quality data is measured.
  5. agency: Name of the agency that measured the data.
  6. type: The type of area where the measurement was made.
  7. so2: The amount of Sulphur Dioxide measured.
  8. no2: The amount of Nitrogen Dioxide measured
  9. rspm: Respirable Suspended Particulate Matter measured.
  10. spm: Suspended Particulate Matter measured.
  11. location_monitoring_station: It indicates the location of the monitoring area.
  12. pm2_5: It represents the value of particulate matter measured.
  13. date: It represents the date of recording (It is a cleaner version of ‘sampling_date’ feature)

Why these features?

SO₂: Sulphur Dioxide is a gas. It is one of the major pollutants present in the air.
It is colourless and has a nasty, sharp smell.
It combines effortlessly with other chemicals to form harmful substances like sulphuric acid, sulfurous acid, etc.
Sulfur dioxide affects human health when it is inhaled. It irritates the nose, throat, and airways to cause coughing, wheezing, shortness of breath, or a tight feeling around the chest. Those most at risk of developing problems if they are exposed to sulfur dioxide are people with asthma or similar conditions. Also, the concentration of sulfur dioxide in the atmosphere can influence the habitat suitability for plant communities, as well as animal life.
Inhaling sulfur dioxide is associated with increased respiratory symptoms and disease, difficulty in breathing, and premature death.
It also causes acid rain.

NO₂: Nitrogen Dioxide is a reddish-brown gas with a pungent, acrid odour.
It can cause bronchoconstriction, inflammation, reduced immune response, and may have effects on the heart. Direct exposure to the skin can cause irritations and burns.
The following gives a rough idea of nitrogen dioxide’s impact on health :
10–20 ppm can cause mild irritation of the nose and throat
25–50 ppm can cause oedema leading to bronchitis or pneumonia
Levels above 100 ppm can cause death due to asphyxiation from fluid in the lungs.
High levels of NO₂ can harm vegetation, including leaf damage and reduced growth. It can make vegetation more susceptible to disease and frost damage.
Longer exposures to elevated concentrations of NO₂ may contribute to the development of asthma and potentially increase susceptibility to respiratory infections.

Particulates: These are also known as Atmospheric aerosol particles, atmospheric particulate matter, particulate matter (PM) or suspended particulate matter (SPM).
These are microscopic solid or liquid matter suspended in the atmosphere.
Particulates are the deadliest form of air pollution due to their ability to penetrate deep into the lungs and bloodstreams unfiltered, causing permanent DNA mutations, heart attacks, respiratory disease, and premature death.
Worldwide exposure to PM 2.5 contributed to 4.1 million deaths from heart disease and stroke, lung cancer, chronic lung disease, and respiratory infections in 2016. Overall, ambient particulate matter ranks as the sixth leading risk factor for premature death globally.

The internet is filled with the harmful effects of the above pollutants, and hence it makes them an essential factor to be analyzed and considered when discussing air pollution.

Coming back to the analysis.

DATA EXPLORATION
Let us get some insights about the data — the number of entries in each column, the type of entry in each column, etc.

From the above figure, we see that we have 435742 entries in our dataset.
We also see that we have only two data types: float and object.
There are very few values present for pm2_5.

Now, let us check the null values.

Null values in the dataset

It seems that we have a lot of null values in some columns.
Looking at the figure, we see that pm2_5 have very few non-null values, and it might not be able to contribute much.
stn_code, agency, spm also are filled with null values.

Let us stop for a bit and check how much helpful the features are.
If I have to analyze the air pollution data of India, then do I need to consider the agency’s name that provided me with that data?
NO, because the agency’s name has nothing to do with how much polluted the state is. Similarly, stn_code is also unnecessary.
It is given in the data description that date is a cleaner representation of sampling_date attribute and so we will eliminate the redundancy by removing the latter.
location_monitoring_station attribute is again unnecessary as it contains the location of the monitoring station, which we do not need to consider for the analysis.

So, to summarize we will delete the following features from our dataset :
agency, stn_code, sampling_date and location_monitoring_station.

Let us have a look at what we are left with

Dataset after removing the unnecessary columns

Now let us consider the type feature.
It represents the type of area where the data was recorded like industrial, residential, etc.
Let us see how many types of area were considered :

Different categories in the type attribute

It seems that we have redundant types.
Looking at the above figure, it can be said that a given area can be classified into three classes or types: industrial, residential, other.
So, we must remove this redundancy and make cleaner classes.
I have simplified the type attribute to contain only one of the three above mentioned categories.
type attribute after changing the categories :

Now it looks much better and cleaner. We can visualize the type attribute using cat plots.

the plot of Number of entries vs Categories in the type column

Looking at the above figure, we can say that the data was recorded with the main focus near the residential area as it has the highest number of entries. This is quite obvious as one is more concerned about the places where people live.

Now let us consider the null values.
We have few null values in type, location, and so2 and so we will remove the rows which have null values in each of the three attributes.

Now let us see how our data looks :

Dataset after deleting the rows which had null values in so2, location

We can see that we are left with 396157 values, so we have not removed a very high number of values.
It is just my personal preference to delete the null values in the three columns rather than imputing them. One can impute these null values; however, it would be a bad practice to impute columns like pm2_5 which have a high number of null values.

DATA VISUALIZATION

Let us plot the concentration of so2 in different states, using bar plots, in descending order.

barplot of so2 vs states

From the above figure, we see that so2 level is highest in Uttarakhand and lowest in Chandigarh.
Uttarakhand, Sikkim, Jharkhand, Gujarat, Maharashtra, Chattisgarh — the government should take action against the growing so2 concentration in these states.
Let us go even more profound by plotting the location of the so2 concentration (city) wise :

barplot of so2 vs location for 50 sites with highest so2 concentrations

The above plot shows 50 places with the highest so2 levels in descending order.
We can see that Dharudhera has the highest so2 concentration and is located in Harayana, followed by Jamshedpur, which is situated in Jharkhand.
Amlai(Madhya Pradesh) and Sindri(Jharkhand), on the other hand, have the least concentrations of so2 among the top 50 locations.

Now let us plot the 50 locations with least so2 concentrations :

barplot of so2 vs location for 50 places with least so2 concentrations

We can see that the so2 level is least in Kottayam which is situated in Kerala, which is expected as we saw from the bar plot of so2 vs state that Kerala was one of those states having least so2 concentrations.
In fact locations Malappuram(Kerala), Konark(Odisha), dawki(Meghalaya)… Nalagarh(Himachal Pradesh), Naharlagun(Arunachal Pradesh) and Kalyani(West Bengal) have almost the same so2 concentrations.

Let us now look at no2 concentrations :

barplot of no2 vs state

It is clear from the figure that West Bengal has the maximum level of no2, whereas Nagaland has the least.
Delhi(the capital) is ‘ranked’ second, followed by Jharkhand.
It is not surprising as Delhi hit the headlines a couple of times over the past few years, regarding air pollution and specifically no2 concentrations.
Let us go even deeper and see which locations are most affected :

barplot of no2 vs location for 50 sites with highest no2 concentrations

The above plot shows 50 places with the highest no2 level.
We can see that Howrah(West Bengal) has the highest no2 concentration, followed by Badlapur(Maharashtra) and Durgapur(West Bengal).
It is clear that the most polluted city(Howrah) belongs to the most polluted state(West Bengal), in terms of no2.
Now let us see the 50 locations with least no2 concentrations :

barplot of no2 vs location for 50 sites with the least no2 concentrations

We see from the above plot that Rudrapur(Uttarakhand) is the least polluted city in terms of no2, followed by Alappuzha(Kerala) and Kohima(Nagaland).

Let us now look at rspm :

barplot of rspm vs state

From the above plot, we see that Delhi has the highest concentration of rspm, which again is not surprising as it can be read from any news article related to pollution in India over the past few years.
Delhi has been an important target when it comes to air pollution. The significant rise in the pollutants in Delhi has made people suffer a lot and has been responsible for the deaths of thousands.
From the above plot, we also see that Uttar Pradesh(UP) is also not far away from Delhi in terms of rspm. Being the most populated state in the country, UP’s atmosphere is not ‘safe’ to breathe. It becomes more than necessary to deal with the rise of rspm levels, especially in UP as it is home to more than 20 crore people. Also, UP and Delhi are neighbouring states(Interesting).
Next, we have Punjab. Punjab has always made it to the headlines when it comes to air pollution, primarily because of the farmers. This has been such a significant concern that the government has released so many policies and programs to prevent rice farmers from clearing their fields by burning the stubble that remains once paddy is harvested.

On the other hand, we have Sikkim with the least concentration of rspm, followed by Mizoram and Puducherry.

Let us delve in deeper and see the concentrations of rspm location-wise :

barplot of rspm vs location for 50 locations which have the highest levels of rspm

As seen from the above plot, we have Kashipur(Uttarakhand) which has the highest level of rspm, followed by Ghaziabad(UP) and Allahabad(UP). Most of the top locations belong to UP, which accounts for UP being the second most polluted state in terms of rspm.

Now let us look at the locations with least concentrations of rspm :

barplot of rspm vs location for few sites which have the least levels of rspm

We see from the above plot that the least level of rspm is in Pathanamthitta(Kerala), followed by Nongstoin(Meghalaya) and Champhai(Mizoram). We can see that the states, to which these locations belong, lie to the lower side of the barplot of rspm vs states.

Now let us consider spm :

barplot of spm vs states

We see that Uttar Pradesh(UP) and Delhi pair again tops the list. UP and Delhi have a comparable concentration of spm as well as rspm.
A news report was found stating :

In 1997, an air quality monitoring station in Lucknow, the capital of Uttar Pradesh, recorded the maximum level of suspended particulate matter (spm) at 2,339 microgrammes per cubic metre (g/cum), more than 11 times the permissible limit for residential areas and four times the limit for industrial areas. This is as high as the maximum ever recorded in Delhi: 2,340 g/cum in 1992.

If we see the plot more carefully, we will observe that the level of spm in Delhi and UP is way high than any other state. Rajasthan number 3 has concentrations that are significantly lower than that of UP or Delhi.
Goa, on the other hand, has the least concentrations of spm, followed by Kerala and Meghalaya.
Note that Arunachal Pradesh and Telangana have null values and concentrations of spm is not zero for these states.

Let us go more in-depth and look at the locations which have the highest concentration of spm :

barplot of spm vs location for sites with the highest level of spm

We see from the above barplot that Meerut(UP) is the most polluted city in terms of spm followed by Khurja(UP) and Ghaziabad(UP).
In fact the top 7 cities namely, Meerut, Khurja, Ghaziabad, Kanpur, Firozabad, Noida and Allahabad, are situated in UP. This is an alarming situation as the most populated state of the country is most polluted when it comes to spm or rspm.

Now let us consider the last but not the least feature, i.e., pm2_5: the value of particulate matter measured/recorded :

bar plot of pm2_5 vs state

We can see that we have null values for most of the states which is quite evident as discussed above that pm2_5 had the most number of null values(97.86% of null values).
From whatever we have, we see that Delhi again tops the list, followed by West Bengal and Madhya Pradesh.
Not much can be discussed for pm2_5 as it has a large number of null values.

Let us plot the barplot between pm2_5 and locations for all the non-null values :

barplot of pm2_5 vs location for all non-null values

We see that Delhi is still on the top, followed by Talcher(Odisha) and Gwalior(Madhya Pradesh).

Statistical Analysis

Now let us do some statistical analysis for the dataset and check whether these features have some relations.

We will start by plotting the scatter plot for each feature :

scatter plot for each column

First things first, I will not comment on the relationship between pm2_5 and any other feature simply because pm2_5 has tons of null values. Hence, its statistical significance is very low maybe even negligible.

so2 and no2 values are highly concentrated near to the origin, which means that both are low for most of the observations.
We can see that no2 and so2 have a somewhat similar pattern with other features.

It can be said that spm and rspm share somewhat linear relationship, rest all features are not entirely related.

For a more in-depth analysis, let us look at the correlation matrix :

Correlation matrix for the dataset

It is clear from the correlation matrix that we have some correlation between spm and rspm, which supports our scatter plot analysis.
There is little correlation between other features.

Date Feature

Let us now use the date feature which we have not touched yet.
date feature signifies the date at which the data was recorded.
Let us be efficient and devise a new feature(year) from the date feature.
This is because we are interested in the annual effects of air pollution.

The data looks like following after making the year column :

dataset after devising year column from date feature

Now that we have created a year column so we can analyze the data annually.

so2 analysis using the heatmap

Let us plot the heatmap for so2 where
row: state attribute
column: year attribute
value: so2 attribute

heatmap for so2 with state and year attributes

It is evident from the heatmap that there has been a gradual increase of so2 concentration in Bihar from 1987 to 1999.
Similarly, there has been a high concentration of so2 in Gujarat around 1995. In Harayana also we can see that so2 level has been high around 1987 and has been consistently high till 2003.
In Karnataka also there is a gradual increase in so2 concentration from 1987 to 2000.
Puducherry also has witnessed a high value of so2 concentration around 1996.
Rajasthan also has experienced a high concentration of so2 around 1987.
Uttarakhand has been experiencing a high level of so2 concentration from 2004 till now.
In West Bengal, the so2 concentration has been consistently high from 1987 to 2000.

The above analysis shows that the presence of the pollutant sulphur dioxide has been high from 1980 to 2000 in some states but has decreased in the new century(from 2000).

I found the following news article which supports our conclusion and the heatmap analysis.

Data released by NASA’s Aura satellite calls into question the veracity of Central Pollution Control Board’s (CPCB) claim made in 2012 that the mean sulphur dioxide (SO2) emissions in India decreased in 2010 as compared to 2001 level.

Some of the states like Uttarakhand, Jharkhand, Sikkim, etc. still experience considerably high levels of so2 concentration.
Note that these are the states that were at the top in the barplot of so2 vs states.

The heatmap shows that looking at the dangerous levels of so2 concentration right steps were taken to decrease it. For example, The Air (Prevention and Control of Pollution) Act was introduced in 1981 and was amended in 1987.
The implementation showed results and the level of so2 decreased.

No2 analysis using the heatmap :

In the following heatmap, we have
row: state attribute
column: year attribute
value: no2 attribute

heatmap for no2 with state and year attributes

We can see that states like Rajasthan, Bihar, Delhi, Harayana, Jharkhand, Puducherry, West Bengal have experienced a severe level of no2 concentration.

The no2 concentration has decreased annually in some states like Rajasthan, whereas in states like Bihar, Delhi, etc. it has increased.
In other states like West Bengal, Jharkhand the no2 concentrations have remained consistently high.
If we look at the heatmap closely, then we see from 2000 onwards the no2 concentrations have increased(as a whole) throughout the country.
I found the following news article which supports the above conclusion :

The emission of the nitrogen dioxide pollutant has gone up significantly in the South Asia region, including India, during the 2005–2014 period, severely affecting air quality in the process, NASA satellite maps show.

rspm analysis using the heatmap

In the following heatmap, we have
row: state attribute
column: year attribute
value: rspm attribute

heatmap for rspm with state and year attributes

Heatmaps are such a vital tool for data analysis that it makes everything so easy to analyze. One can easily see the changes, various levels of rspm in a state in the respective year, etc.

Here we see that states like Delhi, Punjab, Uttar Pradesh, Harayana, Jharkhand have suffered from high levels of rspm.

spm analysis using the heatmap

In the following heatmap, we have
row: state attribute
column: year attribute
value: spm attribute

heatmap for spm with state and year attributes

Here we see that states like Delhi, Haryana, Punjab, Uttar Pradesh have been the prime sufferers from the high concentration of spm.

Conclusions

From the above analysis, we see that the majorly affected states in India by air pollution belong to the northern region.
States like Delhi, Punjab, Uttar Pradesh, Haryana are heavily polluted and require immediate action.
We also saw that even if a state had a high level of pollutants, there were some regions in the states that were not polluted.

We also see from statistical analysis — scatter plots that the states with high rspm concentrations also have high spm concentrations.

From the heatmap, we conclude that some states were heavily polluted in the early stages(1980 to 2000) but, later, were taken care of.
The reason for the decrease could be awareness in citizens and government policies.
For example, The Air (Prevention and Control of Pollution) Act, 1981 was an Act to provide for the prevention, control and abatement of air pollution.
Also, I found a news article which stated the following :

Story of vehicle emission controls began in India when mass emission norms were enforced for the first time for petrol vehicles in 1991 and for diesel vehicles in 1992. Emission norms were further tightened in 1996 with the compulsory fitment of catalytic converters in petrol cars. Bharat Stage emission norms (equivalent to Euro norms for four-wheeled vehicles) were first introduced in 2000. These norms specify the maximum permissible emission limit for carbon monoxide (CO), hydrocarbons (HC), nitrous oxides (NOx) and particulate matter (PM).

The above news article clearly states that the government took necessary steps to counter the growing air pollution (that we saw) from 1990 to 2000.
This could be one of the reasons why we saw in the heatmaps that the concentrations of pollutants in some states started decreasing from 2000.

Another news article stated below supports the above conclusions.

The 1997 White Paper sponsored by the Ministry of Environment and Forests already proposed various measures to bring down pollution caused by traffic, including smoothing the flow of traffic with parking regulations and bringing down total traffic by mandatory limits on driving. City authorities claim to have had some success in bringing down air pollution; for instance, during the bidding process for the 2014 Asian Games, the city’s organizing committee had claimed that “pollution levels had come down drastically in Delhi with the arrival of Metro rail as well as all public transport vehicle being run compulsorily on CNG(Compressed Natural Gas)”.

End Notes

From the above data analysis approach, we conclude that data analysis is a crucial aspect for a better future.

The approach was purely data-driven, however, was backed by real-life instances(news articles).

It is interesting to see how data analysis and the day to day instances are coherent and how data analysis can be used to deal with significant problems.

I would recommend following the GitHub repository link(provided at the beginning of this article) to find the code for the above approach to get a deeper inside of how data analysis can be implemented in python.

So, this was my analysis of my country, which is slowly turning to a gas chamber. Not only India but other countries are also suffering from air pollution.
We must find a cure to this significant problem as it is killing our nation slowly.

If you have any suggestions to make or any improvements to suggest, then please write it in the comment section. I will be highly obliged.

--

--

I am a data science and machine learning enthusiast, who loves to share knowledge.