Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.

Similar presentations

Presentation on theme: "Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011."— Presentation transcript:

1 Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011

2 (Syntactic) Treebank Sentences annotated with syntactic structure (dependency structure or phrase structure) 1960s: Brown Corpus Early 1990s: The English Penn Treebank Late 1990s: Prague Dependency Treebank 1990s – now: Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Latin, Norwegian, Polish, Spanish, Turkish, etc. 2

3 PS and DS John loves Mary. S NP VP./. John/NNP loves/VBPNP Mary/NNP loves/VBP John/NNPMary/NNP./. 3 Phrase structure (PS): Dependency structure (DS):

4 Proposition Bank (PropBank) Sentences annotated with predicate argument structure Ex: John loves Mary – “loves” is the predicate – “John” is Arg0 (“Agent”) – “Mary” is Arg1 (“Theme”) 2000s: The English PropBank, followed by the PropBanks for Chinese, Arabic, Hindi/Urdu, etc. 4

5 Why do we need treebanks? Computational Linguistics: – To build and evaluate NLP tools (e.g., word segmenters, part-of-speech taggers, parsers, semantic role labelers) – This leads to significant progress of the CL field Theoretical linguistics: – Annotation guidelines are like a grammar book, with more detail and coverage – As a discovery tool – One can test linguistic theories and collect statistics by searching treebanks. 5

6 The Hindi-Urdu Treebank (HUTB) Traditional approach: – Syntactic treebank: PS or DS, but not both – Layers are added one-by-one Our approach: – Syntactic treebank: both DS and PS – DS, PS, and PB are developed at the same time – Automatic conversion from DS+PB to PS

7 Motivation 1: Two Representations Both phrase-structure treebanks and dependency treebanks are used in NLP – Collins/Charniak/Bikel parsers for PS – CoNLL task on dependency parsing Problem: currently few treebanks (no?) with PS and DS which are independently motivated  Our project: build treebank for Hindi/Urdu for which PS and DS are linguistically motivated from the outset – Dependency: Paninian grammar (Panini 400 BC) – Phrase structure: variant of Minimalism (Chomsky 1995)

8 Motivation 2: Two Content Levels Everyone (?) wants syntax Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets” Recent experience: PropBank may inform some treebanking decisions  Build treebank with all levels from the outset  Annotating them together allows us to study relation between DS/PB/PS and reduce annotation time

9 Goals Hindi/Urdu Treebank: – DS, PB, and PS for 400K-word Hindi 150K-word Urdu – Unified annotation guidelines – Frame files for PropBank Better understanding of the relation between DS, PB, and PS.

10 Where we are now Guidelines are almost complete. Annotation: – DS annotation: 354K-word Hindi, 60K-word Urdu – PB annotation: 40K-word Hindi Automatic conversion from DS + PropBank in progress. Preliminary release in 2009 and 2010

11 The HUTB team IIIT, India (DS team): Dipti Sharma, Samar Husain, Rahul Aggarwal, etc. Univ of Colorado at Boulder (PB team): Martha Palmer, Bhuvana Narasimhan, Ashwini Vaidya, Archna Bhatia, etc. UMass (PS team): Rajesh Bhatt, Annahita farudi Columbia Univ (PS team): Owen Rambow, Univ. of Washington (Conversion): Fei Xia, Michael Tepper

12 Some Sample Structures Guideline Sentences -transitive (25), causatives (4), AP predicate (10), 21 (clausal extraposition + unaccusative), participial adjunct (35), complex predicate (1) Corpus Sentences

Download ppt "Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011."

Similar presentations

Ads by Google