Introduction to kNN algorithm by experiment on Khmer Handwriting classification using Java 8

Engleang Sam
Towards Data Science
3 min readJul 20, 2019

--

Machine learning is the popular field currently and being discussed everywhere. There are many implementations on it using some functional or statistical programming languages such as Python or R. However Java , the popular OOP programming language, still can also be implemented for almost all of machine learning algorithms. Being a Java developer , I would like to introduce kNN ( K Nearest Neighbors ) algorithm since it’s easy to implement and can be understood without requiring a lot of mathematic or statistic backgrounds.

1. Solution to be applied

Handwriting can be one of the problems that solved by kNN algorithm and I will use Khmer script characters as the sample problem for solving. Khmer is one of the Southeast Asia languages which spoken by Cambodian people. It contains a lot of consonants , dependent or independent vowels as well as numerals. Below image is the numeral scripts and we will write some codes to determine whether it’s look like the expected number of not or in other words , to classify it.

Khmer numerals
Khmer numerals : source from Wikipedia

2. Data preparation

In order to conduct test on it , it’s required to convert the handwriting image to be text with contain bit ( 0, 1) representing the pixel being drawn on the image. Each machine learning has to have training data set as well as test data set for experiment. The process of converting image to be such text would be mentioned in another article.

Text represents Khmer handwriting numeral script in bit(0,1)

3. Flowchart of kNN

To have precise idea what we are going to write let see below flowchart:

kNN flowchart
kNN flowchart by https://draw.io

4. Java code implementation using Java 8

Firstly DataLoader class need to be generated in order to load the text file to memory. Map is being used since grouping by folder is preferred.

Data loader class

DistanceCalculator is for computing distance using Euclidean formula:

Some classes also required to generate for holding some data and value as below :

KnnClassifer is the primary logic for this demonstration . Following the above flowchart below is the Java code:

Finally we create the main class to combine all the logics into one application.

If nothing wrong running the Application class should display below result as in picture. The error rate is 0.34 , and the time is 0.572 second ( can be different on your PC depend on hardware spec ). We can modify the K_CONSTANT, NUMBER_OF_THREAD or modify some parallelStream part to normal stream to see the different time of classification.

kNN classification result

5. Conclusion

kNN can be experimented using Java and optimized by multi Threading , Parallel stream and many other Java 8 features. The applications of kNN can be found on may fields including recommender system , handwriting , and other classifications . Uber prediction system is one of the real world example of applied kNN algorithm system. However, it required much computation ,scaled system and well prepared for training set to avoid performance issue. Full code and data set of this demonstration can be found in https://github.com/engleangs/knnclassifier.

I hope this basic introduction can be useful for you. Let me know if any feedback or suggestion , Cheer !!.

--

--

Sr. Digital Business Development @MJQE , Like to keep up with technologies such as algorithms , backend development , solution architecture.