Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Natural Language Processing Projects Heshaam Feili
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Stemming, tagging and chunking Text analysis short of parsing.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Part of speech (POS) tagging
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Introduction to Machine Learning Approach Lecture 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Some Advances in Transformation-Based Part of Speech Tagging
A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Ling 570 Day 17: Named Entity Recognition Chunking.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Natural language processing tools Lê Đức Trọng 1.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
POS Tagger and Chunker for Tamil
PoS tagging and Chunking with HMM and CRF
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Natural Language Processing Vasile Rus
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
Tools for Natural Language Processing Applications
Machine Learning in Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
Topics in Linguistics ENG 331
CSCI 5832 Natural Language Processing
SANSKRIT ANALYZING SYSTEM
Presentation transcript:

Shallow Parsing for South Asian Languages -Himanshu Agrawal

Shallow Parsing Parts Of Speech Tagging Assigning grammatical classes to words in a natural language sentence. Text Chunking Dividing the text in syntactically co-related parts of words. Example: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in [NP September ]].

Applications Direct Applications Automatic Spell Checking Software Grammar Suggestions ( MS word pop-ups) Full Parsing Indirect Applications Machine Translation Systems Web Search ( )

Nature of the problem of Shallow Parsing A classic problem of classifying input tokens into given classes. The sequence aspect The sequence of best classes. The best sequence of classes. Typically, the classifying information is the language context of the word under consideration.

Shallow Parsing for English The problem has been well worked upon for English. Very Efficient Systems Exist Example : Brill’s Tagger: ’95, Transformation Based Learning. Adwait Ratnaparkhi: ’99, Parsing with Maximum Entropy Significant effect on the development of MT systems for European Languages

Shallow Parsing for South Asian Languages Portability of Shallow Parsing Systems across languages ?? NOT GOOD !! Inflectional Richness of the Languages. * Training on 22,000 words and Testing on 5000 words. POS tagging only EnglishHindi Brill’s Transformation Based Learning 87%79% Ratnaparkhi’s Maximum Entropy Based Learning 89%81%

Challenges with Indian Languages. Poor Disambiguation between certain POS class categories example NNP and NNC !! (Error Type 1) JJ and NN !! (Error Type 2) Inflectional Richness of the language Absence of markers like the capitalization of proper nouns and etc. Is that Raj ?

On Improving the performance for Hindi and other South Asian Languages. There can be two ways Improving the classifying information by the use of better features or using language specific information or both. Improving the learning by better training and better inference-ing.

A. POS Tagging For better training and inference-ing. oApproach 1: Training on a hierarchical structure of tags oApproach 2: Building a knowledge database from raw / un-annotated text to use as a `look up`.

Approach 1: Training on Hierarchical Tagset Training in steps, on a hierarchical structure of classes. Training Level  1  2

Approach 1: Training on Hierarchical Tagset The approach was devised to minimize the number of errors that are made within a family class. Results % Reason:  No mechanism to correct errors in the part 1 of training  Jittered language constructs while training in part 2.

Approach 2: Building a knowledge database for `look up.` The Knowledge database consists of words and the POS tags it is known to have occurred with. How is it important ?? Inflectional richness Vs per class ambiguity

Building the knowledge database Adding words and their POS tags from the training data. Training on 22,000 words on Gold Standard POS tags, and creating a training model `A`. Using model ‘A’ to annotate the raw text consisting of 2 Lakh words. Extracting the words/POS tags of words tagged with very high confidence measure. And adding them to the database.

Using the knowledge database For the final tagging We use model ‘A’ to get the probability of each tag to be associated with a word. ie P(tag i / word) for (every tag) for (every word in the test data) If a word is found in the database, we choose the tag in its entry, which has the highest probability. If not found, we let the tag predicted in the first run remain unchanged.

Approach 2 Results : %

Training for Model `A` We use Linear Chain Implementation of the Conditional Random Fields. Taku Kudo et. Al We use simple language independent features  Word Window [-2, 2].  Suffix Information as in last 2, 3, 4 chars.  Presence of Special Characters.  Word Length.

B. Chunking We have followed the approach used by Anirudh, Himanshu ’06 NWAI. 2 step Training: Training on Boundary-Label scheme for extracting Chunk Labels. Training on Boundaries with added information of chunk labels.

Chunking cont. Training for identifying Chunk tags is also done using a linear chain implementation of CRF. Features: Word window of [-2, 2] POS tag window of [-2, 2] Chunk Labels, for chunk Boundary Identification [-2, 0]

Chunking Results %

Consolidated Results ** The results below are on calculated on the development data. HindiTeluguBengali POS Tagging %71.22 %81.09 % Chunking %91.77 %94.90 %

Conclusions: Training on a tag-set optimal for capturing the language patterns. If training is done in more than one step, esp. such that tags in the subsequent step are directly dependent on the tags in the present step, then it is of importance that there exist a way to re-tag the mis-tagged tokens.

References: Charles Sutton, An Introduction to Conditional Random Fields for Relational Learning Adwait Ratnaparkhi,1998, Maximum Entropy Models For Natural Language Ambiguity Resolution, Dissertation in Computer and Information Science,University Of Pennslyvania,1998. Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005,HMM Based Chunker for Hindi, IIIT Hyderabad. Thorsten Brants TnT - A Statistical Part-of- Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224–231. Himanshu Agrawal, Anirudh Mani 2006, Part Of Speech Tagging and Chunking Using Conditional Random Fields: Proceedings of the NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.