The world’s leading publication for data science, AI, and ML professionals.

Air pollution analytics using PAMI

Identifying pollution hot spots in Japan through periodic-frequent patterns

Technological advances in the field of Information and Communication Technologies have enabled organizations to collect big data. Useful information that can empower the end-users to achieve socio-economic development lies hidden in these databases. The field of pattern mining emerged to discover interesting patterns (or itemsets) that may exist in big data. This tutorial covers the following topics:

  • What are periodic-frequent Patterns?
  • How to find periodic-frequent patterns in temporal big data?
  • Finding periodic-frequent patterns by employing algorithms available in the PAMI package.
  • What are the merits and limitations of periodic-frequent pattern mining?

What are Periodic-Frequent Patterns?

Periodic-frequent patterns [1] are an important class of regularities that exist in a temporal database. Finding these patterns in real-world applications is of high value as they represent something that is predictable with the database. The objective of periodic-frequent pattern mining is to discover all frequent patterns that were occurring regularly in a database.

Fig 1. Identifying the locations in which people were regularly exposed to harmful levels of PM2.5 in Japan [Image by Author]
Fig 1. Identifying the locations in which people were regularly exposed to harmful levels of PM2.5 in Japan [Image by Author]

Example: Air Pollution is the major cause of many cardio-respiratory problems reported in Japan. On average, 42.6 thousand people are dying every year in Japan due to pollution [2]. To tackle this problem, a nationwide sensor network, called SORAMAME (Atmospheric Environmental Regional Observation System: AEROS) [3], was set up by the Ministry of Environment, Japan, to monitor pollution. Figure 1a shows the spatial locations of these sensors in Japan. The raw data (see Figure 1b) generated by this system for an air pollutant at hourly intervals can be represented as a temporal database (see Figure 1c). Periodic-frequent pattern mining (see Figure 1c) on this database provides the environmentalists and policymakers with useful information regarding the areas (see Figure 1d & 1e) in which people were regularly exposed to harmful levels of air pollution.

Table 1: A sample temporal database [Image by Author]
Table 1: A sample temporal database [Image by Author]

Let {S1, S2, S3, S4, S5, S6} be a set of air pollution measuring sensors. Table 1 shows a hypothetical air pollution temporal database generated by these sensors. This database contains 12 transactions. The initial timestamp of this database, i.e., _tsinitial=0. The final timestamp of this database, i.e., _tsfinal=15. The first transaction in this database provides the information that the sensors S1, S2, and S5 have recorded hazardous levels of PM2.5. Other transactions in this database can be explained in a similar fashion. The set of sensors, S1 and S2, i.e., {S1, S2}, is called a pattern (or an itemset). In Table 1, this pattern occurs at the timestamps of 1, 3, 6, 8, 9, 12, and 14. Thus, the support (or frequency) of {S1, S2} is 7. If the user-specified minimum support, denoted as minSup, is 5, then {S1, S2} is said to be a frequent pattern. The periods of this pattern are: 1 (=1-ts_initial), 2 (=3–1), 3 (=6–3), 2 (=8–6), 1 (=9–8), 3 (=12–9), 2 (=14–12) and 1 (=ts_final-14). The maximum period of this pattern is considered as its periodicity. Thus, the periodicity of {S1, S2}, i.e., per({S1, S2})=max(1,2,3,2,1,3,2,1) = 3. It means the sensors S1 and S2 have simultaneously recorded hazardous levels of PM2.5 at least once every 3 hours. If the user-specified maximum periodicity (maxPer) is 3, then the frequent pattern {S1, S2} is said to be a periodic-frequent pattern. This pattern is expressed as follows:

{S1, S2} [support=7, periodicity=3]

How to find periodic-frequent patterns in temporal big data?

The space of items in an itemset lattice represents the search space of periodic-frequent pattern mining. Thus, the search space of periodic-frequent pattern mining is 2^n-1, where n represents the total number of items in a database. The mining algorithms reduce this huge search space by employing apriori/anti-monotonic/downward-closure property. This property says that "all non-empty subsets of a periodic-frequent pattern are also periodic-frequent patterns." This property makes pattern mining practicable on real-world large databases. The algorithms to discover periodic-frequent patterns are:

  • PFP-growth
  • PFP-growth++
  • PS-growth and
  • PFP-ECLAT.

What is PAMI? How to install it? How to implement a periodic-frequent pattern mining algorithm that is available in PAMI?

PAMI stands for PAttern MIning. It is a Python library containing several pattern mining algorithms. This library is supplied under the GPL V3 license. Users can install this library by simply executing the following command:

pip install pami

The good thing about the PAMI library is that items in the database can be integers or strings. For illustrative purposes, we use the air pollution database as a test case. Click here to download the database. Users can try with other datasets. However, please keep in mind that the temporal database must exist in the following format:

timestamp<sep>item1<sep>item2<sep>...<sep>itemN

The default separator used by the algorithms in PAMI is tab space (default). However, the users can employ other separators, such as comma and space. As mentioned above, PAMI contains several algorithms to find periodic-frequent patterns. However, we employ the PS-growth algorithm [4] in this tutorial for brevity.

The step-by-step implication of the PS-growth algorithm is as follows:

Step 1: Import the PS-growth algorithm from the PAMI library

from PAMI.periodicFrequentPattern.basic import PSGrowth as alg

Step 2: Initialize and execute the mining algorithm by providing the data, minSup, and maxPer parameters. Please note that data can be an HTTP link, a file, or a data frame. Click here to download the dataset.

URL="https://www.u-aizu.ac.jp/~udayrage/datasets/temporalDatabases/temporal_pol_pm2_16.csv"
obj = alg.PSGrowth(iFile=URL,minSup=100,maxPer=24)   #initialization
obj.startMine()                                      #execution 

Step 3: Store the patterns in a file

obj.savePatterns('patterns.txt')

Step 4: Printing the patterns

df = obj.getPatternsAsDataFrame()
df

Step 5: Calculating number of patterns, memory, and runtime

print(len(df))
print(obj.getMemoryRSS())
print(obj.getRuntime())

What are the merits and demerits of the Periodic-Frequent Pattern Mining Algorithm?

Merits:

  • Downward closure property facilitates the mining algorithms to be deployable on large-scale databases
  • Few user-specified (or hyper) parameters

Demerits:

  • Finds only full/perfectly occurring periodic patterns. Thus, missing partial periodically occurring periodic patterns.
  • Too many patterns may be generated from the database.

Conclusions

In this blog, we have introduced the model of periodic-frequent pattern that may exist in a temporal database. The importance of these patterns has also been demonstrated with an example. Finally, we described how our PAMI Python library can be utilized to discover hidden patterns in air pollution data.

References:

[1] Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Young-Koo Lee: Discovering Periodic-Frequent Patterns in Transactional Databases. PAKDD 2009: 242–253

[2] Statista: https://www.statista.com/statistics/935022/number-deaths-air-pollution-japan/#:~:text=In%202019%2C%20the%20number%20of,attributable%20to%20air%20pollution%20exposure.

[3] SORAMAME: https://soramame.env.go.jp/ (accessed on 14-Nov-2021)

[4] R. Uday Kiran, Alampally Anirudh, Chennupati Saideep, Masashi Toyoda, P. Krishna Reddy, Masaru Kitsuregawa: Finding Periodic-Frequent Patterns in Temporal Databases using Periodic Summaries. Data Science and Pattern Recognition, Vol. 3, №2, pp. 24–46 (2019)


Related Articles