Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi.

Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen Toyohashi University of Technology 日本豐橋技術科學大學 ACL 2008

Introduction Named entity recognition is important problem in NLP It is difficult to get a large annotated corpus. There are increasing the number of named entities. This paper proposes a novel method of extracting named entities which contain unfamiliar morphemes using a large unannotated corpus. 2

Related Work of Japanese NER Machine learning base approaches for named entity recognition Maximum entropy (Uchimoto et al., 2000) Decision list (Sassano and Utsuro, 2000; Isozaki, 2001) Support Vector Machine (Yamada et al., 2002; Isozaki and Kazawa, 2002) Rule base approaches for named entity recognition Hand-crafted rules(NExT) (Masui et al., 2002) 3

Method 1. Assign the similar and familiar morpheme to each unfamiliar morpheme 2. Chunk of named entities 3. Machine learning approaches using both features of original morphemes and features of similar morphemes 4

Method - Assign the Similar morpheme Vector of frequencies of unigrams and bigram M ≡ {m 0,m 1,...,m N } is a set of all morphemes of the unannotated corpus m u ∈ M ∩ M F Using cosine function for similarity function 5

Method - Chunking IOB2 representation (Tjong Kim Sang, 1999) B Current token is the beginning of a chunk. I Current token is a middle or the end of a chunk consisting of more than one token. O Current token is outside of any chunk 16 types for the label B and 8 types for the label I 6

Method – Feature for Machine Learning(1) morpheme feature MF(m i ) similar morpheme feature SF(m i ) character type feature CF(m i ) MF(m i ) is the surface string and the part-of-speech of m i. CF(m i ) flags of 漢字, 平假名, 片假名, 英文字母 7

Method – Feature for Machine Learning(2) Using 前後兩個 F Fi-2 Fi-1 Fi Fi+1 Fi+2 跟前面兩個 chunk labels Ci-2 Ci-1 Ci CF(m i ) flags of 漢字, 平假名, 片假名, 英文字母 8

Evaluation - setup IREX corpus: annotated corpus; 1,174 newspaper articles which include 18,677 NEs. Familiar morpheme: Occur 5 or more times in IREX corpus Mainichi Newspaper Corpus: 3.5M sentences consisting of 140M words, is used as the unannotated corpus to calculate context vectors. Conditional Random Fields(CRF) (Lafferty et al., 2001) or Support VectorMachine(SVM) (Cristianini and Shawe-Taylor,2000) is employed to train a statistical NE chunker 9

Evaluation - IREX 10

Evaluation - NHK 11

Conclusion and Future Work This paper proposes a novel method to extract NEs including unfamiliar morphemes using a large unannotated corpus. similar morpheme feature (SF) is effective for robust extracting NEs which consist of unfamiliar morphemes. Including effective features of extracting NEs like N-best morpheme Sequences and features of surrounding phrases. 12

Thank you! 13

Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi.

Similar presentations

Presentation on theme: "Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi.

Similar presentations

Presentation on theme: "Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi."— Presentation transcript:

Similar presentations

About project

Feedback