Pre-data collection
Before collecting and analyzing text data, because it is impossible to analyze all Korean words, I analyzed the words which people feel confused most because of similar or same pronounce. Orthographic confusion indicates that the language usage of people is being different with existing Korean orthography. In addition, it is not difficult to collect the examples because there are a lot of information sources which provide the confusing orthography. For pre-data collection, two of twitter accounts (https://twitter.com/grammar_korea, https://twitter.com/korgrammar_bot ) and one of newspaper (http://www.newsen.com/news_view.php?uid=201412311410464710) are investigated. There are a few restrictions to select pre-data to make this project more practical. First of all, the words break evidently the rules of Korean spelling are excepted because if they are considered to be modified, the other words follow the rules are affected even they do not get through any confusion to write. Secondly, the words from Chinese words are excluded, too. For example, ‘폭발’ means explosion in Korean and the word is from Chinese, but it is written ‘폭팔’ frequently because the pronounce is more natural when they write as ‘폭팔’. However, there is a settled pronounce for Chinese character even it is used in Korean, so changing the orthography is impossible. All of the word pairs are checked whether it is right by checking spelling check program provided by National academy of the Korean language because the Korean language standardization has been updated. (http://164.125.7.61/speller/lib/check.asp) As a result, there are 48 pairs of a wrong word and a right word. Right word is the word which follows existing Korean orthography and Wrong word is the word which is used frequently by people even does not follow existing Korean orthography
Data collection, analysis and visualizing
Fig. 3. Process of data
Fig. 4. Example of analyzed Hangul sentence
Finally, the Analyzed data are presented in time series to figure out the changing pattern of the words’ used frequency as a time goes. For this, Google motion chart (https://developers.google.com/chart/interactive/docs/gallery/motionchart?hl=en) is used.
Evaluation
To evaluate the usefulness of the project, the evaluation on existing changes of the rule will be conducted. In 2011, there was an extensive modification on the rule – 39 words were changed. Among them, the most widely known change was ‘짜장면’. Before 2011, the word ‘자장면’ was standard word in Korean orthography, but people had used the word ‘짜장면’ more than ‘자장면’. As a result, ‘짜장면’ was modified to standard word from wrong word. This example will be used to make a criteria for modification of wrong word. In other word, the usage pattern of ‘짜장면’ will be analyzed by applying referred method (3.2) and the frequency of ‘자장면’ and ‘짜장면’ will be used to calculate the value which will be the criteria for pre-collected pre-data. In addition, to figure out how the prediction is like to the real usage of the words, the project has been published on the web. There is a little survey section where the people who visit the web-site (http://ybigdata.info/juyoung) can cast a vote to the real word they use. It can be used to evaluate the project because the goal of the project is to figure out real usage of the words in reality. go to survey