The world’s leading publication for data science, AI, and ML professionals.

How I Used Python Code to Improve My Korean

"Hello world!" / "안녕하세요 세계!"

Photo by Valery Rabchenyuk on Unsplash
Photo by Valery Rabchenyuk on Unsplash

Two important things to me are being efficient and being effective.

Early in 2020, I decided that I was going to teach myself to speak Korean. I was intrigued by the culture and drawn to the challenge of learning a language that wasn’t European. I didn’t want to fall victim to the usual trap of it feeling like a chore, so I started using a diverse range of resources including Duolingo, LingoDeer, Talktomeinkorean.com, flash cards, Korean dramas and – obviously – several thousand hours’ worth of BTS albums.

The Korean language (Hangul) is very logical, interesting, and significantly different to English in many key ways. It took months to get grips with topic markers, honorifics, and different conversation levels, but eventually I got to a point where I have enough familiarity with the structure of the language that I’m focused on picking up more vocabulary.

However – as many learners may agree – I have found that translating with google isn’t always useful, as not only are translations often slightly off the mark, but because Korean sentences are structured subject-object-verb (rather than S-V-O), use particle markers, and deploy descriptors differently, a translated passage doesn’t easily show me which words are the verbs, nouns, adjectives.

Enter 파이썬! (…that means ‘python’ in Korean)

I’ve been working on several different courses recently, all of which involve using a lot of python. I, therefore, thought "there has to be a way that I can bring these two great loves together" and after a lot of research, the following plan was born:

The basic idea was that I wanted to be able to pass a passage of Korean text (e.g. a news article, recipe, song lyrics) through python and have it tell me clearly which parts are nouns, verbs etc. I would then translate each part and use a dataframe to print a vocab sheet. Spoiler alert: I did it.

The good news is, as always, almost all of the hard work had already been done by some great developers. I decided to use the KoNLPy module and the Google Translate API to keep it relatively simple. [KoNLPy is a python package for natural Language processing of Korean, and it is seriously cool.]

Setting up KoNLPy

For all the details on setting up the module, a very clear walk-through can be found here. All I needed to do was install the Java Development Kit, a correctly versioned JPype wheel, and KoNLPy at the command line before importing into my Python environment. There are several classes available, but I’m using OKT which is a wrapper for "Open Korean Text":

from konlpy.tag import Okt
from konlpy.utils import pprint

Testing the Module

I set up my ‘text’ to be the sentence "I am eating cake", and initialised the class as an object, before calling the pprint utils class. I’ve included some parameters that will normalise the sentence and reduce words to the root form:

text = '나는 케이크를 먹고있다'
okt = Okt()
pprint(okt.pos(text, norm=True, stem=True, join=True))

The output was exactly what I wanted, showing me the nouns "I" and "cake", the two particles (subject and topic) and the stem of the verb "eating", "to eat".

['나/Noun', '는/Josa', '케이크/Noun', '를/Josa', '먹다/Verb']

Now on to the fun bit.


Building the prototype

The final script makes use of just three libraries:

  • KoNLPy
  • Google Translate
  • Pandas

Putting it all together was fairly straightforward. This time I took a paragraph of text, and made a dataframe with the results after using KoNLPy as above:

from konlpy.tag import Okt
from googletrans import Translator
from pandas import DataFrame

text = '내가 뭐랬어 이길 거랬잖아 믿지 못했어 (정말) 이길 수 있을까 이 기적 아닌 기적을 우리가 만든 걸까 (No) 난 여기 있었고 니가 내게 다가와준 거야'
okt = Okt()

# use the okt.pos function and make a dataframe
ttrans = (okt.pos(text, norm=True, stem=True, join=True))
koreanlist = DataFrame(ttrans, columns=['Korean'])

Next, I split the results in to two columns, removed any duplicate rows or punctuation, and sorted the values:

# remove punctuation and sort based on word type
koreanlist.drop(koreanlist[koreanlist.Type == "Punctuation"].index, inplace=True)
koreanlist.drop_duplicates(subset="OriginalWord", keep = False, inplace = True)
koreanlist = koreanlist.sort_values(by="Type")

Then, I set up the translator and added a column to append the results in the dataframe. I dropped the original column and printed the dataframe to a markdown:

# set up translate
translator = Translator()

# translate by adding a column
koreanlist['English'] = koreanlist['OriginalWord'].apply(translator.translate).apply(getattr, args=('text',))

# format
del koreanlist['Korean']
print(koreanlist.to_markdown())

Finally, I set two options: to either export to_csv as a .txt file, or to save to_html as .html:

# optional save as a text file
koreanlist.to_csv('songtranslation.txt', sep='t', index=False)

# optional save as html
koreanlist.to_html('koreanlist.html')

And there we have it. You can see here that the html file looks like a proper vocab chart that I think is pretty effective!

HTML vocab chart made with python
HTML vocab chart made with python

If you made it this far – thanks very much! I’d love to know what you think. This is just a humble starting point with a lot of potential, but once K-pop twitter gets hold of it, I’m sure it’ll be a great success :-).

Check out KoNPLy: https://konlpy.org/en/latest/ Eunjeong L. Park, Sungzoon Cho. "KoNLPy: Korean natural language processing in Python", Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea, Oct 2014.


Related Articles