KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

An Introduction to GATE
Part-Of-Speech Tagging and Chunking using CRF & TBL
Natural Language Processing Projects Heshaam Feili
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
Opinion Mapping Travelblogs Efthymios Drymonas Alexandros Efentakis Dieter Pfoser Research Center Athena Institute for the Management of Information Systems.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Open Information Extraction From The Web Rani Qumsiyeh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Summarization using Event Extraction Base System 01/12 KwangHee Park.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Knowledge Discovery in Ontology Learning A survey.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Information Extraction From Medical Records by Alexander Barsky.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Overview Project Goals –Represent a sentence in a parse tree –Use parses in tree to search another tree containing ontology of project management deliverables.
Open Health Natural Language Processing Consortium (OHNLP)
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
BY TSHISHONGA AW /04/081 Co-Supervisor : Mr Reg Dodds Supervisor :Professor I.M Venter APPLYING VENDA TEXT TOWARDS THE DEVELOPMENT OF AN INTELLIGENT.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
NATURAL LANGUAGE UNDERSTANDING FOR SOFT INFORMATION FUSION Stuart C. Shapiro and Daniel R. Schlegel Department of Computer Science and Engineering Center.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Sangwon Park  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Tokenization & POS-Tagging
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Copyright  2009 by CEBT Meeting  Lab. 이사 3 월 28( 토 )~29( 일 ) 잠정 예정 포장이사 견적 & 냉난방기 이전 설치 견적  정보과학회 데이터베이스 논문지 1 차 심사 완료 오타 수정 수식 설명 추가 요구  STFSSD 발표자료.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Data Mining: Text Mining
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
Ontology-based fuzzy event extraction agent for Chinese e- news summarization Expert Systems with Applications Volume: 25, Issue: 3, October, 2003, pp.
CS 4705 Lecture 7 Parsing with Context-Free Grammars.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 20, 2011.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Model for Machine Translation Jang, HaYoung.
Language Identification and Part-of-Speech Tagging
English-Korean Machine Translation System
Multimedia Information Retrieval
Machine Learning in Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean natural language analysis. The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc. Research Goal

Contents 1. Introduction of Korean Morphological Analysis 2. HanNanum Korean Morphological Analyzer & POS Tagger 3. Extension to KKAP(KAIST Korean Analysis Platform)

 Features of Korean morphological analysis  가시는  가시 /noun + 는 /josa(thorn, prickle)  가시 /verb + 는 /eomi(leave, disappear)  가 /verb + 시 /eomi + 는 /eomi(go)  갈 /verb + 시 /eomi + 는 /eomi(grind, sharpen)  Example Sentences:  그 선인장의 가시는 참 따가웠다.  물을 마셨더니 갈증이 가시는 기분이다.  할머니께서는 집에 가시는 길이었다.  아저씨의 칼을 가시는 모습은 인상적이다. Ambiguity of part-of-speech Ambiguity of segmentation of morpheme

HanNanum Korean Morphological Analyzer HanNanum has been developed since 1990s. Written in C programming language Module-based architecture Based on KAIST morphological analyzed corpus HMM-based, Maximum Entropy-based POS Tagger

HanNanum Architecture Morphological Analyzer Analyzer Phoneme Restoration Connection Check Connection Check Dictionary Search Dictionary Search Tag Set Code Conversion Sentence Divisor Tag Set Table Connection Info. Table System Dictionary User Dictionary User Dictionary Number Dictionary Number Dictionary Tag Mapper Tagger Computation Frequency Dictionary Frequency Dictionary Bigram Info. OUTPUT INPUT (Trie) Segment Position Inverse Segment Position Morpheme Chart (lattice form) Chart (lattice form)

HMM-based POS Tagger Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of- Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, , Transition probability between word phrase tag Transition probability between morpheme tag in a word phrase Probability of occurrence of morpheme and POS

Analysis Example - POS-tagged Dictionary - Check Connection rule - Phoneme Restoration -HMM-based Tagger Find the most suitable result among the candidates

–Each functionality for the Korean morphological analysis is implemented as a plug-in. –It allows a user to set up a workflow with existing plug-ins for his own goal. Plug-In Pool Corpus-base Morph Analyzer CRF POS Tagger … Plug-In Component-based System HMM POS Tagger Unknown Noun Proc. Noun Extracting Tag Mapping … Auto Spacing Sentence Splitter Input Filter … Noun Extractor Tag Mapper Trans- literation Chart-base Morph Analyzer Phase2 Morphological Analyzer Phase3 POS Tagger Phase1 Supplement Plugin Phase2 Supplement Plugin Phase3 Supplement Plugin

Flexible Workflow $$$$$ 장소 $$$$$ 서울코엑스 3 층 $$$$$ $/su+$/su+$/su+$/su+$/su 장소 장소 /ncn $$$$$ $/su+$/su+$/su+$/su+$/su 서울 서울 /nq 코엑스 코엑스 /ncn 3 층 3/nnc+ 층 /nbu Sentence Splitter Auto Spacing Unknown Processo r HMM-based POS Tagger Chart-based Morphological Analyzer Informal Input Filter Plain Text Processor Morphological Analyzer Morpheme Processor POS Tagger Sentence Splitter Noun Extractor Chart-based Morphological Analyzer Plain Text Processor Morphological Analyzer Morpheme Processor 지난 9 월 거제도에서 열린 축제 … 9 월 /n 거제도 /ncn 축제 /ncn - Analysis of Announcement on Web - Indexing of News Articles

HanNanum Korean Morphological Analyzer Phase 3. POS Tagging Phase 2. Morphological Analysis Plugin Pool Phase 1. Plugin Sentence Segmentation Sentence Segmentation Input Filter Input Filter Auto Spacing Auto Spacing Noun Extraction Noun Extraction Tag Mapper Tag Mapper Unknown Term Processing Unknown Term Processing Chart-base Morph Analyzer Chart-base Morph Analyzer Phase 2. Plugin HMM-based POS Tagging HMM-based POS Tagging CRF-based POS Tagging CRF-based POS Tagging Phase 3. Plugin Phase 1. Text Preprocessing Supplement Plugin Supplement Plugin Supplement Plugin Major Plugin Workflow for Morphological Analysis Supplement Plugin Supplement Plugin Major Plugin Noun Extraction Noun Extraction Tag Mapper Tag Mapper 7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca …. 7 일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP 통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Korean Document Analysis Extract the Part Of Speech Information from Korean Text

Open Source Project jhannanum was released

GUI Demo Plug-in Pool Workflow Information of a plug-in Workflow control Input & Output

KKAP: KAIST Korean Analysis Platform Phase 3. POS Tagging Phase 2. Morphological Analysis Plugin Pool Phase 1. Plugin Sentence Segmentation Sentence Segmentation Input Filter Input Filter Auto Spacing Auto Spacing Noun Extraction Noun Extraction Tag Mapper Tag Mapper Unknown Term Processing Unknown Term Processing Chart-base Morph Analyzer Chart-base Morph Analyzer Phase 2. Plugin Phase 1. Text Preprocessing Supplement Plugin Supplement Plugin Supplement Plugin Major Plugin Workflow for Korean Analysis Major Plugin 7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca …. 7 일 저녁 발표예정인 노벨문학상의 유 력 수상자로 고은 시인이 거론되고 있 다. AP 통신은 스웨덴의 노벨상 관측통 들 사이에 한국의 고은 시인이 시리아 의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거 론됐다고 전했다. … Korean Document Analysis Analyzed Korean Document Phase 4. Parsing Supplement Plugin Supplement Plugin Major Plugin Supplement Plugin Supplement Plugin HMM-based POS Tagging HMM-based POS Tagging Phase 3. Plugin Noun Extraction Noun Extraction Tag Mapper Tag Mapper Chart Parser Phase 4. Plugin Verb Phrase Extractor Verb Phrase Extractor Noun Phrase Extractor Noun Phrase Extractor

Korean Syntactic Tree Tagged Corpus Registered at BoRA ( Bank of Resource for Language and Annotation ) – –Corpus 5. Manual sentence analysis corpus –31,091 Sentences from 97 different sources. –Length: 1 ~ 33 Eojeols Average Eojeols Related document –Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department Technical Report, CS/TR , 1997 (In Korean) –Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)

Question & Comments