Presentation on theme: "A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information."— Presentation transcript:
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada
Outline BALIE System RO-BALIE Capabilities Improvements Evaluation & Results Future Work
BALIE- BaseLine Information Extraction Multilingual information extraction system Language identification Tokenization Sentence boundary detection Part-of-speech tagging for English, French, German, Spanish  Java trainable open source system Uses WEKA  a Machine Learning Tool Uses QTag  – a language independent probabilistic part-of-speech tagger
BALIE- BaseLine Information Extraction (cont.) Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in o ne or more texts.
BALIE- BaseLine Information Extraction (cont.) Output 1. Introduction Information …
RO-BALIE Improvements Easier manipulation of the input and output texts A new tag set that maps the numerical tag set internally used by BALIE More information in the output provided by the system Available at: Balie/RO-Balie.html
RO-BALIE Language Identification 2-grams (sequence of 2 characters) Naïve Bayes classifier Overall accuracy is: 99.25%. LanguageFiles Train Files Test Correctly classified Accuracy English % French % Spanish % German % Romanian %
RO-BALIE (cont.) Tokenization Split each compound word based on “-” and “/” Examples: iat-o, socio-economic Tokenization results: TokensPrecisionRecall %98.7%
RO-BALIE (cont.) Sentence Boundary Detection Training – 106 hand-tagged English sentences Decision Tree Classifier Features Beginning of the sentence – first token Previous token Current token Next token
RO-BALIE (cont.) Sentence Boundary Detection (cont.) Feature values Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. A list with Romanian abbreviations (510) Evaluation on Orwell’s 1984 novel TextAccuracyPrecisionRecall Romanian97%92%71% English97.5%96.5%82%
RO-BALIE (cont.) Part-of-speech tagging – QTag tagger Used a corpus of 40 million words of newspaper articles Romanian newspapers 3-year period The training corpus is 98% accurate Our system has a tagset of 14 tags for POS and 30 tags for punctuations Train CorpusTest CorpusAccuracy 2.5 mil words words95.3%
RO-BALIE (cont.) Output for Apel tirziu si inutil NISTORESCU. Apel tirziu si inutil NISTORESCU.
RO-BALIE (cont.) Future Work Use machine learning for the tokenization task Add new services: morphological analysis, named entity recognition, etc. Add more specific information for each supported language.