Pre-data collection

Before collecting and analyzing text data, because it is impossible to analyze all Korean words, I analyzed the words which people feel confused most because of similar or same pronounce. Orthographic confusion indicates that the language usage of people is being different with existing Korean orthography. In addition, it is not difficult to collect the examples because there are a lot of information sources which provide the confusing orthography. For pre-data collection, two of twitter accounts (https://twitter.com/grammar_korea, https://twitter.com/korgrammar_bot ) and one of newspaper (http://www.newsen.com/news_view.php?uid=201412311410464710) are investigated. There are a few restrictions to select pre-data to make this project more practical. First of all, the words break evidently the rules of Korean spelling are excepted because if they are considered to be modified, the other words follow the rules are affected even they do not get through any confusion to write. Secondly, the words from Chinese words are excluded, too. For example, ‘폭발’ means explosion in Korean and the word is from Chinese, but it is written ‘폭팔’ frequently because the pronounce is more natural when they write as ‘폭팔’. However, there is a settled pronounce for Chinese character even it is used in Korean, so changing the orthography is impossible. All of the word pairs are checked whether it is right by checking spelling check program provided by National academy of the Korean language because the Korean language standardization has been updated. (http://164.125.7.61/speller/lib/check.asp) As a result, there are 48 pairs of a wrong word and a right word. Right word is the word which follows existing Korean orthography and Wrong word is the word which is used frequently by people even does not follow existing Korean orthography



Data collection, analysis and visualizing


Fig. 3. Process of data

is a schematized process of data for this project. First of all, text data from Twitter are collected by using the pre-data as keyword and by using Python. Python is one of the programming language can be used to this process. The process of collecting data from web is called ‘crawling’. The collected data are saved on PostgreSQL because they are somewhat big data. After then, the collected Korean texts are analyzed as morpheme by using JAVA because there are many variations of basic form in Korean. To be specific, Komoran2.0 (http://shineware.tistory.com/53), java open source for analyzing Korean text morpheme because its performance is better than any Korean morpheme analyzers’ performances. Morpheme is the smallest grammatical unit in a language in a language. In other words, it is the smallest meaningful unit of a language. The followed represents an example of analyzed written Korean.


Fig. 4. Example of analyzed Hangul sentence


Finally, the Analyzed data are presented in time series to figure out the changing pattern of the words’ used frequency as a time goes. For this, Google motion chart (https://developers.google.com/chart/interactive/docs/gallery/motionchart?hl=en) is used.

Evaluation

To evaluate the usefulness of the project, the evaluation on existing changes of the rule will be conducted. In 2011, there was an extensive modification on the rule – 39 words were changed. Among them, the most widely known change was ‘짜장면’. Before 2011, the word ‘자장면’ was standard word in Korean orthography, but people had used the word ‘짜장면’ more than ‘자장면’. As a result, ‘짜장면’ was modified to standard word from wrong word. This example will be used to make a criteria for modification of wrong word. In other word, the usage pattern of ‘짜장면’ will be analyzed by applying referred method (3.2) and the frequency of ‘자장면’ and ‘짜장면’ will be used to calculate the value which will be the criteria for pre-collected pre-data. In addition, to figure out how the prediction is like to the real usage of the words, the project has been published on the web. There is a little survey section where the people who visit the web-site (http://ybigdata.info/juyoung) can cast a vote to the real word they use. It can be used to evaluate the project because the goal of the project is to figure out real usage of the words in reality.

go to survey