Pipelining Machine Learning Libraries With PAMI: A Nice Approach to Publish Papers in Top Data Science Conferences

Published in

Towards Data Science

6 min readMar 10, 2022

This article provides a useful tip to the students/researchers to publish papers in several top data science conferences. The organization of this article are as follows:

About author
“Why are my papers getting rejected in top conferences?” — A common student’s question
Strengthing one's own research work by combining ML libraries + PAMI
Briefly introduction to PAMI
A walkthrough example of rainfall analytics by combining TensorFlow and PAMI

About author:

I am Prof. RAGE Uday Kiran, working as an associate professor at the University of Aizu, Japan. I am also a visiting researcher at the Kitsuregawa Lab, University of Tokyo, and NICT, Japan.

Research interests: Big Data Analytics, Intelligent Transportation Systems, Air Pollution Analytics, Recommendation Systems, and ICTs for Agriculture and Environment.

Research life:

Several publications in top computer science conferences, such as IEEE-FUZZY, IEEE-BIGDATA, EDBT, CIKM, PAKDD, SSDBM, ICONIP, DSAA, DASFAA, and DEXA.
Reviewer for IEEE-FUZZ, PAKDD, DEXA, DASFAA, ICONIP, and DSAA.
Publicity co-chair for PAKDD 2021 and ICDM 2022
Publication co-chair for DASFAA 2022.
Book: Periodic pattern mining (Springer)

More details regarding my publications can be found at [1].

“Why are my papers getting rejected in top conferences?” — A common student’s question

Many (data science) students explain their work as follows:

“I am taking the real-world data, pushing it into an Machine Learning library, tuning the hyper-parameters, and getting the best results than the state-of-the-art (See Figure 1).”

After the explanation, the common question they raise is as follows:

Why are my papers keep getting rejected in top conferences though my results are good?

My response is very simple, which is “your approach lacks novelty.” In particular, simply pushing the real-world data into an ML library is no longer treated as a core computer science research problem. It is commonly treated as an applied research problem. In fact, a large portion of papers submitted to top conferences do the exact same, which we often reject due to lack of novelty. (Some readers may argue that tuning hyperparameters for a real-world dataset is a non-trivial task. Yes, we do accept this claim. However, most of us accept this claim from an applied computing perspective, and not from the core computer science perspective most of the time. It is because there are more real-world datasets than the stars in the sky.)

Strengthing one’s own research work by combining ML libraries + PAMI

Upon hearing my above answer, some students ask

“Is there a way to strengthen my work to improve the paper acceptance chance?

My answer is “Yes.” It is possible by going beyond clustering, classification, and prediction. Figure 2 shows an approach to strength conventional machine learning research by applying pattern mining techniques. For instance, apply pattern mining algorithms to extract hidden patterns from past and predicted data. Perform various analytics on these patterns to understand concept drifts, new trends, diminished trends, and so on.

The rest of this article aims to guide students in strengthing their research work by combining ML libraries with PAMI.

What is PAMI?

PAMI stands for PAttern MIning. PAMI is a Python library, which contains several pattern mining algorithms to discover knowledge hidden in voluminous datasets. Currently, PAMI has 50+ algorithms to find different types of user interest-based patterns. More information on the installation and usage of PAMI can be found in our previous article [2].

A walkthrough example of rainfall analytics by combining TensorFlow and PAMI

Download the rainfall data [3] and pixel data [4]
Import the necessary libraries

# Import the necessary librariesimport pandas as pd
from statsmodels.tsa.ar_model import AutoReg as AR
from PAMI.extras.DF2DB import DF2DB as df2db
import PAMI.extras.dbStats.transactionalDatabaseStats as tds
import PAMI.extras.graph.plotLineGraphFromDictionary as plt
import databasePrune as dbPrune

Implement the auto-regression on the rainfall data

#reading input csv file with 20 days of global rainfall datadataframe = pd.read_csv("inputfile.csv")
for i in dataframe:
   #training the model using auto regression   column_values = dataframe[i].values
   model = AR(column_values,1)
   model_fit = model.fit()# forecast start period and end period are given as parameters for the prediction model. In this example, we are learning from rainfall data from 20 days and predicting for rainfall details for the next 300 (=320-20) days.output_df = model_fit.predict(start = 20, end = 320)
output_df.insert(0,'tid',output_df.index)

Transform the predicted data into a transactional database

#Create a transactional database by considering only those pixels/points whose predicted rainfall values are greater than 500df2db_conversion = df2db.DF2DB(output_df,500, '>=', 'dense','transactionalDatabase.txt')

Understanding the statistical details of the database.

#statistics of a transactional databaseobj = tds.transactionalDatabaseStats("transactionalDatabase.txt", sep=',') #overrride default tab seperator
obj.run()
print(f'Database size : {obj.getDatabaseSize()}')
print(f'Total number of items : {obj.getTotalNumberOfItems()}')
print(f'Database sparsity : {obj.getSparsity()}')
print(f'Minimum Transaction Size : {obj.getMinimumTransactionLength()}')
print(f'Average Transaction Size : {obj.getAverageTransactionLength()}')
print(f'Maximum Transaction Size : {obj.getMaximumTransactionLength()}')
print(f'Standard Deviation Transaction Size : {obj.getStandardDeviationTransactionLength()}')
print(f'Variance in Transaction Sizes : {obj.getVarianceTransactionLength()}')itemFrequencies = obj.getSortedListOfItemFrequencies()
transactionLength = obj.getTransanctionalLengthDistribution()#Storing the statistical details in a fileobj.storeInFile(itemFrequencies, 'itemFrequency.csv')
obj.storeInFile(transactionLength, 'transactionSize.csv')# Visualizing the distributions of items frequencies and transactional lengths.plt.plotLineGraphFromDictionary(obj.getSortedListOfItemFrequencies(),50,'item frequencies', 'item rank', 'frequency')plt.plotLineGraphFromDictionary(obj.getTransanctionalLengthDistribution(),100,'distribution of transactions', 'transaction length', 'frequency')

As per the Zip’s law, highly frequent items carry very less informations. Thus, we prune highly frequent items, i.e., items having frequency greater than or equal to 275, from the transactional database and create a new transactional database. The code for pruning highly frequent items from a database is providing below. Please save the below code with the file name databasePrune.py.

import statistics
import validators
from urllib.request import urlopen
import csv

class databasePrune:
    def __init__(self, inputFile, outputFile, threshold, sep='\t'):
        self.inputFile = inputFile
        self.lengthList = []
        self.sep = sep
        self.database = {}
        self.outputFile = outputFile
        self.threshold = threshold

    def run(self):
        self.readDatabase()

    def readDatabase(self):
        numberOfTransaction = 0
        with open(self.inputFile, 'r', encoding='utf-8') as f:
            for line in f:
                numberOfTransaction += 1
                line.strip()
                temp = [i.rstrip() for i in line.split(self.sep)]
                temp = [x for x in temp if x]
                self.database[numberOfTransaction] = temp

    def getItemFrequencies(self):
        self.itemFrequencies = {}
        for tid in self.database:
            for item in self.database[tid]:
                self.itemFrequencies[item] = self.itemFrequencies.get(item, 0)
                self.itemFrequencies[item] += 1
        myDict = {key: val for key, val in self.itemFrequencies.items() if val >= self.threshold}
        return myDict.keys()

    def dbpruning(self, itemFrequencies):
        self.dict = {}
        for tid in self.database:
            list1 = self.database[tid]
            list2 = itemFrequencies
            set_diff = set(list1) - set(list2)
            self.dict[tid] = list(map(int, set_diff))
        with open(self.outputFile, 'w') as f:
            for key, value in self.dict.items():
                if value != []:
                    f.write('{0}\n'.format(str(value).replace("[", "").replace("]", "")))
        return self.dict


if __name__ == '__main__':
    obj = databasePrune('transactionalDatabase.txt', 'newTransactionalDatabase.txt', 275,sep=',')
    obj.run()
    itemFrequencies = obj.getItemFrequencies()
    items = obj.dbpruning(itemFrequencies)

Apply frequent pattern mining on the newly generated transactional database

from PAMI.frequentPattern.basic import FPGrowth as alg
obj = alg.FPGrowth('transaction_300.csv',273,sep=',')
obj.startMine()
obj.savePatterns('patterns.txt')
df = obj.getPatternsAsDataFrame()#Print the runtime and memory consumed for the mining processprint('Runtime: ' + str(obj.getRuntime()))
print('Memory: ' + str(obj.getMemoryRSS()))

Mapping the items in a transactional database to pixels/points

import pandas as pd# reading location points from the filedf = pd.read_csv('point_location.txt',header=None,sep='\t',usecols=b [0])
location_values = []
with open('pattern_loc.txt','w') as w:
    with open('patterns.txt') as f:
        # mapping respective points with location in patterns file
        for line in f:
            freq = line.split(":")[-1]
            values = list(line.split())[:-1]
            for i in values:
                a = "POINT" + "(" + str(df.iloc[int(i)].values[0]) + ")"
                location_values.append(a)
            result = re.sub(r"[',']", "", str(location_values).replace('[', '').replace(']', '')
            w.write(result + " :" + freq)

Visualizing the generated patterns

import PAMI.extras.plotPointOnMap as plt
obj = plt.plotPointOnMap('pattern_loc.txt')
mmap = obj.plotPointInMap()
mmap

Conclusions

A fundamental problem faced by many students/researchers working in the field of data science is “How to publish papers in top conferences?” This blog tries to address this problem by providing a nice approach of transforming the output (or predictions) derived from the existing machine learning libraries into a database and apply pattern mining algorithms to discover hidden patterns in the data. Finally, we illustrated our approach with a toy example using the real-world global rainfall data.

Acknowledgments

I thank my graduate students, Miss. E. Raashika and Miss. P. Likitha, for their support in preparing the necessary code and data for the toy experiment.

References

DBLP https://dblp.org/pid/11/1466.html
Introduction to PAMI https://towardsdatascience.com/hello-i-am-pami-937439c7984d
Sample rainfall data, https://www.u-aizu.ac.jp/~udayrage/datasets/medium/rainfall/inputfile.csv
Point information for each item in the transactional database, https://www.u-aizu.ac.jp/~udayrage/datasets/medium/rainfall/point_location.txt