Kuzushiji-MNIST - Japanese Literature Alternative Dataset for Deep Learning Tasks

Plus our VGG-ResNet ensemble model with state-of-the-art results

Rani Horev
Towards Data Science

--

MNIST, a dataset with 70,000 labeled images of handwritten digits, has been one of the most popular datasets for image processing and classification for over twenty years. Despite its popularity, contemporary deep learning algorithms handle it easily, often surpassing an accuracy result of 99.5%. A new paper introduces Kuzushiji-MNIST, an alternative dataset which is more difficult than MNIST, while also introducing two, more difficult, Kuzushiji datasets.

Interestingly, the Kuzushiji datasets may also help restore an almost lost language — Cursive Kuzushiji. Cursive Kuzushiji is a Japanese script that has been used for over 1000 years, without common standards, and sometimes included dozens of styles and formats for the same word. In the 19th century, Japan reformed its official language and writing system and standardized it, and over time Kuzushiji became extinct, causing millions of documents of Japanese culture and history to be inaccessible for most people.

The Dataset

Source: Clanuwat et al

The Dataset

The Japanese language can be divided into two types of systems:

  • Logographic systems, where each character represents a word or a phrase (with thousands of characters). A prominent logographic system is Kanji, which is based on the Chinese System.
  • Syllabary symbol systems, where words are constructed from syllables (similar to an alphabet). A prominent syllabary system is Hiragana with 49 characters (Kuzushiji-49), which prior to the Kuzushiji standardization had several representations for each Hiranaga character.

The Kuzushiji dataset includes characters in both Kanji and Hiranaga, based on pre-processed images of characters from 35 books from the 18th century. It includes 3 parts:

Kuzishiji-MNIST is similar in its format to the original MNIST but is a harder classification task because of the multiple variations of each character. The other datasets include more classes and are based on the frequency in which characters appear in the books, and some of the classes include only a few samples.

Source: Clanuwat et al

The dataset can be download from here. The authors plan to expand the datasets and add more samples.

Results

The creators of the Kuzushiji-MNIST dataset created a baseline by training a few classification algorithms and comparing them to MNIST. The best algorithm (PreActResNet-18) achieved 99.56% on MNIST, but only 98.83% and 97.33% on Kuzushiji-MNIST and Kuzushiji-49 respectively.

Additional Analysis

To evaluate Kuzishiji-MNIST we compared several architectures — VGG, ResNet-18, Capsule Networks, and ensembles of these architectures. The best results were achieved with an ensemble of VGG and ResNet — a 98.9% accuracy on the test set, which is a state-of-the-art result on the new dataset

The code can be found here.

Conclusion

The Kuzushiji dataset can serve as a new benchmark system for classification algorithms, presenting more of a challenge than the traditional MNIST. As shown, algorithms such as ResNet and VGG can achieve excellent results on the Kuzushiji-MNIST but there remains room for improvement on the other datasets. Finally, work on the Kuzushiji dataset may also help to restore millions of books from this lost language.

--

--