Presentation is loading. Please wait.

Presentation is loading. Please wait.

KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011.

Similar presentations


Presentation on theme: "KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011."— Presentation transcript:

1 KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

2 The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean natural language analysis. The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc. Research Goal

3 Contents 1. Introduction of Korean Morphological Analysis 2. HanNanum Korean Morphological Analyzer & POS Tagger 3. Extension to KKAP(KAIST Korean Analysis Platform)

4  Features of Korean morphological analysis  가시는  가시 /noun + 는 /josa(thorn, prickle)  가시 /verb + 는 /eomi(leave, disappear)  가 /verb + 시 /eomi + 는 /eomi(go)  갈 /verb + 시 /eomi + 는 /eomi(grind, sharpen)  Example Sentences:  그 선인장의 가시는 참 따가웠다.  물을 마셨더니 갈증이 가시는 기분이다.  할머니께서는 집에 가시는 길이었다.  아저씨의 칼을 가시는 모습은 인상적이다. Ambiguity of part-of-speech Ambiguity of segmentation of morpheme

5 HanNanum Korean Morphological Analyzer HanNanum has been developed since 1990s. Written in C programming language Module-based architecture Based on KAIST morphological analyzed corpus HMM-based, Maximum Entropy-based POS Tagger

6 HanNanum Architecture Morphological Analyzer Analyzer Phoneme Restoration Connection Check Connection Check Dictionary Search Dictionary Search Tag Set Code Conversion Sentence Divisor Tag Set Table Connection Info. Table System Dictionary User Dictionary User Dictionary Number Dictionary Number Dictionary Tag Mapper Tagger Computation Frequency Dictionary Frequency Dictionary Bigram Info. OUTPUT INPUT (Trie) Segment Position Inverse Segment Position Morpheme Chart (lattice form) Chart (lattice form)

7 HMM-based POS Tagger Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of- Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, 389-394, 1994. Transition probability between word phrase tag Transition probability between morpheme tag in a word phrase Probability of occurrence of morpheme and POS

8 Analysis Example - POS-tagged Dictionary - Check Connection rule - Phoneme Restoration -HMM-based Tagger Find the most suitable result among the candidates

9 –Each functionality for the Korean morphological analysis is implemented as a plug-in. –It allows a user to set up a workflow with existing plug-ins for his own goal. Plug-In Pool Corpus-base Morph Analyzer CRF POS Tagger … Plug-In Component-based System HMM POS Tagger Unknown Noun Proc. Noun Extracting Tag Mapping … Auto Spacing Sentence Splitter Input Filter … Noun Extractor Tag Mapper Trans- literation Chart-base Morph Analyzer Phase2 Morphological Analyzer Phase3 POS Tagger Phase1 Supplement Plugin Phase2 Supplement Plugin Phase3 Supplement Plugin

10 Flexible Workflow $$$$$ 장소 $$$$$ 서울코엑스 3 층 $$$$$ $/su+$/su+$/su+$/su+$/su 장소 장소 /ncn $$$$$ $/su+$/su+$/su+$/su+$/su 서울 서울 /nq 코엑스 코엑스 /ncn 3 층 3/nnc+ 층 /nbu Sentence Splitter Auto Spacing Unknown Processo r HMM-based POS Tagger Chart-based Morphological Analyzer Informal Input Filter Plain Text Processor Morphological Analyzer Morpheme Processor POS Tagger Sentence Splitter Noun Extractor Chart-based Morphological Analyzer Plain Text Processor Morphological Analyzer Morpheme Processor 지난 9 월 거제도에서 열린 축제 … 9 월 /n 거제도 /ncn 축제 /ncn - Analysis of Announcement on Web - Indexing of News Articles

11 HanNanum Korean Morphological Analyzer Phase 3. POS Tagging Phase 2. Morphological Analysis Plugin Pool Phase 1. Plugin Sentence Segmentation Sentence Segmentation Input Filter Input Filter Auto Spacing Auto Spacing Noun Extraction Noun Extraction Tag Mapper Tag Mapper Unknown Term Processing Unknown Term Processing Chart-base Morph Analyzer Chart-base Morph Analyzer Phase 2. Plugin HMM-based POS Tagging HMM-based POS Tagging CRF-based POS Tagging CRF-based POS Tagging Phase 3. Plugin Phase 1. Text Preprocessing Supplement Plugin Supplement Plugin Supplement Plugin Major Plugin Workflow for Morphological Analysis Supplement Plugin Supplement Plugin Major Plugin Noun Extraction Noun Extraction Tag Mapper Tag Mapper 7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca …. 7 일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP 통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Korean Document Analysis Extract the Part Of Speech Information from Korean Text

12 Open Source Project http://kldp.net/projects/hannanum/ 2011.01.10 jhannanum 0.8.2 was released

13 GUI Demo Plug-in Pool Workflow Information of a plug-in Workflow control Input & Output

14 KKAP: KAIST Korean Analysis Platform Phase 3. POS Tagging Phase 2. Morphological Analysis Plugin Pool Phase 1. Plugin Sentence Segmentation Sentence Segmentation Input Filter Input Filter Auto Spacing Auto Spacing Noun Extraction Noun Extraction Tag Mapper Tag Mapper Unknown Term Processing Unknown Term Processing Chart-base Morph Analyzer Chart-base Morph Analyzer Phase 2. Plugin Phase 1. Text Preprocessing Supplement Plugin Supplement Plugin Supplement Plugin Major Plugin Workflow for Korean Analysis Major Plugin 7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca …. 7 일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP 통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Korean Document Analysis Analyzed Korean Document Phase 4. Parsing Supplement Plugin Supplement Plugin Major Plugin Supplement Plugin Supplement Plugin HMM-based POS Tagging HMM-based POS Tagging Phase 3. Plugin Noun Extraction Noun Extraction Tag Mapper Tag Mapper Chart Parser Phase 4. Plugin Verb Phrase Extractor Verb Phrase Extractor Noun Phrase Extractor Noun Phrase Extractor

15 Korean Syntactic Tree Tagged Corpus Registered at BoRA ( Bank of Resource for Language and Annotation ) –http://bora.or.krhttp://bora.or.kr –Corpus 5. Manual sentence analysis corpus –31,091 Sentences from 97 different sources. –Length: 1 ~ 33 Eojeols Average 11.35 Eojeols Related document –Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department Technical Report, CS/TR-97-112, 1997 (In Korean) –Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)

16 Question & Comments

17

18


Download ppt "KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011."

Similar presentations


Ads by Google