Identifying the Sources of Winter Air Pollution in Bangkok Part II

Worasom Kundhikanjana
Towards Data Science
9 min readMar 30, 2019


Mae Fah Luang University Campus on March 2019. (Photo by MFU Photoclub with permission)

In the previous blog, I looked at the winter air pollution in Bangkok. The main source of pollution comes from particles smaller than 2.5 micrometer (PM 2.5 particles). These particles are smaller than the width of a human hair and can easily enter our bodies, even making their way into our blood. Last week (March 17, 2019), many provinces in the northern part of Thailand had the worst Air Quality Index (AQI) in the world due to particle pollution. So far, no long term solution has been proposed because the source of the PM 2.5 particle pollution has not been clearly pinpointed. In this notebook, I identify the sources of high PM 2.5 particles in Bangkok through a machine learning model. The code can be found in my GitHub page.

High PM2.5, Who Are the Culprits ?

There are three major theories regarding the source of air pollution in Bangkok: (1) The temperature inversion effect where cold air along with pollution is trapped close to the surface of the Earth. This theory was proposed by the government at the beginning of the 2019 winter season. The government blamed emission from old diesel engines for the pollution. (2) Agricultural burning, either locally or from surrounding provinces. During winter, a lot of open agricultural burning occurs throughout the country. Some officials have tried to tackle the air pollution problem by reducing open agricultural burning. (3) Pollution from other provinces or countries. Some NGOs blamed the pollution on near by power plants.

My analysis procedure is as follows: Build a machine learning model(ML) to predict the air pollution level in Bangkok using environmental factor such as weather, traffic index, and fire maps. Include date-time features such as local hour, and weekday versus weekend in the model to capture other effects from human activities. Identify dominant sources of pollution using the feature of importance provided by the ML model.

If the source of the pollution is local, then the AQI will depend on factors such as weather patterns (wind speed, humidity, average temperature), local traffic, and hour of day. If the pollution is from agricultural burning, the AQI will depend on active fires with some time lag to account for geographical separation. Fire activities are included based on the distance from Bangkok. On the other hand, if the pollution not correlated with the fire map, then the model should put more weight on weather patterns, such as wind direction and wind speed.

Here are a list of features I considered and their data sources:

  • Active fire information from NASA’s FIRMS project
  • Weather pattern: temperature, wind speed, humidity, and rain, scraped from the Weather Underground website
  • Traffic index from Longdo Traffic
  • Date time features: hour of day, time of day, and holiday patterns (explored in the Part I blog post)

Let me first walk through all the features included in the model.

Agricultural Burning is a Major Problem !

Farmers in Southeast Asia pick January — March as their burning season. For the north and northeastern provinces in Thailand, these burning activities are large enough to make these provinces among the most polluted places in the world during this time. For Bangkok, one might argue that because the region is heavily industrial rather than agricultural, it may not be affected as much by agricultural burning. But this is not the case.

Because of the tiny size of PM 2.5 particles, they remain suspended in the atmosphere for prolonged periods and can travel over very long distances. From the weather data, the average wind speed is 10 km/hour. The reported PM 2.5 level is a rolling average over 24 hours. A rough estimate is that the current PM 2.5 reading may be from sources as far as 240 km away. The picture below shows the fire map measured by NASA’s satellites, indicative of agricultural burning, on Jan 8, 2018 and on Feb 8, 2018. The yellow circle indicates the area within 240 km of Bangkok. The number of fires on Jan 8, which has an acceptable level of pollution, is much lower than the number of fires on Feb 8, which has an unhealthy level of pollution.

Fire spots from NASA’s satellites

In fact, the fire pattern closely aligns with the PM 2.5 pattern.

The number of fires aligns with spikes in PM 2.5 levels

Weather Patterns

The temperature inversion effect often occurs during winter because the temperature is cooler near the ground. The hotter air on top traps the cool air from flowing. This stagnant atmospheric condition allows the PM 2.5 particles to remain suspended in the air for longer. On the other hand, higher humidity or rain will help remove particles from the atmosphere. This is one reason why in the past when the air pollution was high, the government has sprayed water in the air. Unfortunately, this mitigation does not appear to be effective, since the volume of water is minuscule compared to actual rain. How much influence does weather pattern have on air pollution? Let’s compare the weather in winter versus other seasons.

compare the weather pattern in winter and other seasons

Temperature, wind speed and humidity are all lower in winter, but not by a large amount. Now, let’s look at the relationship of each of these with the PM 2.5 level.

Effect of temperature, wind speed, and humidity on PM 2.5 level in winter

Higher temperature (which disrupts the temperature inversion effect), wind speed and humidity have a negative correlation with the pollution level.

Effect of wind on PM 2.5 level in winter

On windy days, the pollution is clearly better. The median of the distribution for PM 2.5 levels is lower on windy days compared to on days without wind.

In fact, the pollution level also depends on the wind direction, as seen in this plot. I selected only four major wind directions for simplicity.

PM2.5 relationship with the wind direction in winter

On the days where the wind comes from the south, the pollution level is lower likely because the Thai gulf is to the south of Bangkok. The clean ocean wind improves the air quality. Wind from the other three directions pass overland. However, having any wind is better than the stagnant atmospheric conditions on calm days.

The shift in the median PM 2.5 level is smaller between rainy days and days with no rain. There are fewer rainy days during the winter season, so the data is somewhat noisy, but a difference can be observed in the cumulative density function.

Effect of rain PM 2.5 level in winter

Traffic Index

One of the sources of PM 2.5 particles is car engine exhaust. While campaigning for more public transportation usage is in general good for the environment, the effectiveness toward reducing PM 2.5 pollution is unclear. Here is why.

traffic index and air pollution

We have seen that PM 2.5 levels are related to the time of day. The pollution is lower around 3 pm, but remains high during the night time. When plotted against the traffic data, the relationship with the pollution level is very noisy. There does not seem to be a strong correlation.

PM2.5 relationship with traffic index

Including the time of day and weekday versus weekend information into the model might make the relationship more clear.

Autoregression process

The current PM 2.5 value can also depend on the previous value. The partial autocorrelation plot below shows a strong correlation at 1 hr time lag, which means the PM 2.5 level is an autoregression process. Thus I include the 24 hour average values in the model, with the restriction that the model is only allowed to see the previous value for future predictions. The importance of this feature should be directly related to how long the particles stay in the atmosphere.

Machine Learning Model

The picture below show a dedrogram of all input features calculated from Spearman correlation. The dendrogram helps to identify redundant features that can be removed from the model. The number of fires within various distances and the level of PM 2.5 are closely related. Other features are further away. I ended up using all of these features in the model.

dendrogram of inputfeatures

To identify the major contributions to the pollution, I used a random forest regression to fit the model because of its simplicity and ease of interpretation. During hyper-parameter tuning, 25% of the data was allocated for the validation set. The model was retrained again using the entire dataset. The model achieves 0.99 R-squared on the training set. Since the purpose of this study is to understand the sources of the air pollution in the past, I focused on the training set. The plot below ranks the importance of each contributing factor. The importance is calculated from the decrease in the R-squared values upon permuting the column, and re-normalizing the sum of all columns.

feature of importance

As expected the previous pollution level is the most important predictor. This is followed by the number of fires from the closest to the furthest. The number of fires as far away as 720 km has more influence on the air quality than the local humidity, traffic, or even rain. The hour of day is a more important predictor than the traffic index. Among the weather features, humidity is the most important feature.

The influence of each feature is illustrated below using a tree interpreter for the data on Jan 13, 2019 at 8 am with 96 PM 2.5 level.

model interpretation on Jan 23, 2019

We start with the average value of 26. The PM 2.5 level for the previous hour was 62, thus the model adds a value 20. There were 150 fires within a 240 km radius, thus the model adds 10 to the pollution level. The value is now 56. There are 1649 fires between 240-480 km, and 896 fires between 480-720 km, and the model adds a value of 9 and 8 respectively. The low wind speed and the morning rush hour (8 am) adds 8 to the model. These six top factors account for 81 out of the total 96 predicted for the PM 2.5 level. The remaining features to the right are less important and thus increase the predicted pollution value less.

model interpretation on Feb 2, 2019

On a good day such as Feb 2, 2019 at 7 pm the PM 2.5 level was 10. The pollution level in the hour before was low, thus the model subtracts a value of 10. There were still a lot of fires in the area, and the model adds a value of 2. The wind speed was high, reducing the value by 2. The weather and traffic were good. The combination of many factors results in a low predicted PM 2.5 level of 10.


The PM 2.5 level has a complex relationship with various factors: number of fires, weather patterns, and traffic. But this analysis confirms the suspicion that many people have — agricultural burning is the root cause of PM 2.5 pollution in Thailand. Burning activities as far as 720 km away from Bangkok, an area which extends into Myanmar, Laos, and Cambodia, can cause air problems in Bangkok. Solving this problem will not be easy. It will require a collaborative international effort among the Southeast Asian countries.

I leave you with a fire map from March 17, 2019, one of the worst days ever!

