IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July 1. 2000. - June.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
An Introduction to GATE
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Stemming, tagging and chunking Text analysis short of parsing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Part-of-Speech Tagging
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
MedKAT Medical Knowledge Analysis Tool December 2009.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Communicative and Academic English for the EFL Professional.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Development of an Intelligent Translation Memory MorphoLogic SZAK Publishers Balázs Kis
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
NATURAL LANGUAGE PROCESSING
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Identification and Part-of-Speech Tagging
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June Consortium: University of Szeged, Department of Computer Science, MorphoLogic Ltd. Budapest Coordinator: Tibor Gyimóthy, PhD. University of Szeged,

IKTA-27/2000 Contributors University of Szeged: Zoltán Alexin Károly Bibok János Csirik Tibor Gyimóthy (coordinator) linguist students MorphoLogic Ltd.: Gábor Prószéky László Tihanyi

IKTA-27/2000 Hungarian Part-of-speech Tags The „házaitokkal” 1 :noun, plural, instrumental case owner (2nd person, plural) The „fognak” 2 :may have several POS tags e.g. auxiliary verb (future tense), 3rd person, plural or noun in genitive case, or noun in dative case Attributes of a particular word (number, person, case, owner number, owner person, degree, …) are encoded. Words can be labeled using their possible codes. The task of the disambiguation is to slice up the continuous text into sentences and words, then map to each word its contextually correct POS tag. 1 : with your houses 2 : (they) will do something, or something that belongs to a tooth

IKTA-27/2000 Motivation Ambiguous words cause problems in many natural language processing systems. Promising results have been published for part-of-speech tagging for European languages Encourage IT research and development related to the Hungarian language. POS tagging and morphological parsing are only two of the open problems. There are other tasks: NP recognition (NP-chunking), shallow parsing, syntactic parsing, semantic encoding, text understanding, the co-reference problem etc. Using the newest tools of Artificial Intelligence (machine learning algorithms), trying existing or designing new algorithms Establishing a learning database for further studies Development of a real disambiguation software (POS-tagger) based on the prototype elaborated in the current project. Enhancement of existing software versions, testing them against the learning database.

IKTA-27/2000 Work phases Work phase 1 : July 1, December 31, 2000 Work phase 2 : January 1, December 31, 2001 Work phase 3 : January 1, June 30, 2002

IKTA-27/2000 The goals of the complete project Establishment of a medium-sized (at least 1 million words long) manually annotated Hungarian learning corpus, that can be used for other natural language processing tasks beyond the scope of this project. Make a separate POS-tagger module that can be built into different applications. That module can efficiently disambiguate texts of different domains. Writing a continuously adapting system that can follow temporal changes in the language. The chosen technology should be applicable to other European languages. The system should use a general encoding scheme that includes properties of all European languages. The encoding should support the representation of the rare features of the Finno-Ugric Hungarian language. The solution’s per-word accuracy, i.e. the ratio of the number of well annotated words to the number of all words should reach at least 97-98%.

IKTA-27/2000 The Results of the Work Phase 1. The text of the 1 million word long corpus is already collected. For the encoding of morpho-syntactic features of the words both the MSD and HuMor codes are used. The corpus (the text augmented by annotations with the disambiguation information) is stored in an XML database. The TEIXLite DTD was used. Software needed for Work Phase 2 is ready. A case study was written, in which the application of machine learning algorithms was discussed in detail.

Current State of the Project Collection of texts (completed) Setting up a lexicon containing all words and their possible encodings (completed) Manual disambiguation of the corpus (by the end of 2001) Part-of-speech tagger prototype (by summer 2002) IKTA-27/2000

The Hungarian Learning Corpus Making a better training data set than the so-called Orwell corpus (1997) Conformance to the „Multext-East” - Hungarian MSD specification, i.e. complete subclassing of pronouns, numbers and adverbs (which is missing from the Orwell corpus) Conformance to the Hungarian Academic Dictionary IKTA-27/2000

Comparison of the IKTA and the Orwell corpora Size: words (plus punctuations) Selected text modern, recent texts XML technology Full MSD encoding Size: tokens (words and punctuations) Single roman (special written language) SGML technology Partial implementation of the Hungarian MSD IKTA-27/2000

MSD Lexical Encoding I. Can be applied to represent the morpho-syntactic features of many European languages, such as English, German, French,..., and also for Hungarian, Slovak, Czech, Roman, Russian, etc. Each language is an instance of the general encoding scheme Codes are strings, whose first character denotes the main category; the individual attributes or features are encoded by letters at fix positions within the code string. Unused attributes are denoted by a ‘-’. The main categories: (A - adjective, C - conjunctive, D - determiner, I - interjection, M - numeral, N - noun, P - pronoun, Q - particle, R - adverb, S - adposition, T - article, V - verb, X - residual (unknown), Y - abbreviation. IKTA-27/2000

MSD Lexical Encoding II. The code of the word ablakokban is: Nc-p2 (Noun, common, - (gender), plural, inessive The code of the word legeslegnagyobbaké is: Afe-pn------s (Adjective, type, degree, - (gender), plural, nominative, - (definiteness), - (clitic), - (animate), - (formation), - (owner number), - (owner person), owned number The HuMor codes are converted to MSD by the morphological parsing software using a conversion table Information not encoded in MSD (combination of words, suffixes) are lost For Hungarian, there are about 10,000 different MSD codes, but only a fraction of them occur in everyday written language. A big portion of codes virtually never occur. IKTA-27/2000

Some Data The size of the lexicon was 162,332 words (the preliminary morphological parsing was done by the HuMor (Hungarian Morphology) parser system; then labels have been individually checked and corrected when it was necessary A 200,000 word long part of the corpus was sliced, and preliminary studies and statistics were done for this part. The number of ambiguous words increased dramatically % of the words are proved to be ambiguous (115,542 of 202,604, i.e %), while in the case of the Orwell corpus this ratio was 25,526 of 80,708 (31.62%); ambiguous words have 2.92 labels on average; 1499 ambiguity-classes were found (versus 566 classes in the Orwell corpus). IKTA-27/2000

Hungarian POS-tagger It is written in Prolog For each ambiguity-class*, a separate learning task is generated. The learned rule-sets are tested and refined. The rule-sets obtained by different learning tools can be combined. It is based on context rules of the form: choose_C 1 _…_C n (-Predicted, +CurrentWord, +LeftContext, +RightContext) Resolution of ambiguities within a sentence is done in a non-deterministic way *Words having the same set of morphological annotations belong to the same ambiguity-class IKTA-27/2000

Software components Lexicon maintenance utility for the HuMor parser Text  XML conversion utilities XSLT scripts for maintaining the XML databases Morphological parser program (HuMor: Hungarian Morphology) Disambiguation program Learning task generator programs Parsers that can read back the output of the learning tools, then can convert the results to a standardized Prolog rule-set. Programs for maintaining rule-sets, POS-taggers, test utilities. IKTA-27/2000

A Sample Screen of the Disambiguation Program IKTA-27/2000

Sample Data from the First 200,000-Word Section of the Corpus Number of ambiguity-classes: 1,499 Most frequent ambiguity class: A, a - ['I', 'Nc-sn', 'Pd3-sn', 'Tf'] although the correct label was ‘Tf’ in 12,185 cases from the all 12,194 occurrences Some other frequent classes: Az, az - ['Pd3-sn', 'Tf'] (annotated by ‘Tf’ in 5,447 cases from 6,207), Mármint, Márpedig - ['Ccsp', 'Ccsw'] (got ‘Ccsp’ in 3,978 cases from the all 5,548 occurrences) The most embarrassing class was: Aztán, Csakhogy - ['Ccsp', 'Ccsw', 'Rx'] where the ‘Ccsp’ was correct in 2,457 cases of 4,853. Using a probabilistic tagger (which chooses the most probable tag) this class itself causes 2,396 errors, that is 2,08% of the appr. 100,000 ambiguous words In the case of 855 classes only one label has been chosen; in 736 cases the number of occurrences was less than 10; 430 classes occurred only once in the text. IKTA-27/2000