Presentation is loading. Please wait.

Presentation is loading. Please wait.

07.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s.

Similar presentations


Presentation on theme: "07.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s."— Presentation transcript:

1 07.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s Perspective

2 07.06.2016COGS 523 - Bilge Say2 Related Readings Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a Corpus and a Treebank for Present-day Written Turkish, in Proceedings of the Eleventh International Conference of Turkish Linguistics, August 2002. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür, Building a Turkish Treebank, Invited chapter in Building and Exploiting Syntactically-annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers, 2003. Nart B. Atalay, Kemal Oflazer, Bilge Say, The Annotation Process in the Turkish Treebank, in Proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC, April 13-14, 2003, Budapest, Hungary.The Annotation Process in the Turkish Treebank

3 07.06.2016COGS 523 - Bilge Say3 Acknowledgements Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer Main Contributors: Umut Özge and Nart Bedin Atalay, METU; around 5 research assistants and 13 student annotators and trainees at various phases of the project. Various members of faculty gave ideas esp at initial stages. Agreements with 14 publishers (incl. 3 newspapers and 4 magazines)

4 07.06.2016COGS 523 - Bilge Say4 Requirements for Corpora for Turkish ? Incorporating many registers representatively  Diachronic and synchronic  Electronic  Annotated with standard practices (typographically, morphosyntactically, semantically, prosodically...)  Respecting copyright laws  Accessible (free availabilty, support, etc)  Searchable 

5 07.06.2016COGS 523 - Bilge Say5 What is METU Turkish Corpus? A synchronic (1990+) corpus of written Turkish 2.000.000 words from 201 books, 87 journal issues and issues of 3 daily newspapers totaling 999 samples Various kinds of annotation (creation of a treebank as separate subproject) Project: 1999-2003

6 07.06.2016COGS 523 - Bilge Say6 Other Features of METU Turkish Corpus Permissions for each sample obtained from the publishers Opportunistic representativeness !! Platform-independent; XML and TEI- compliant annotation Accompanying query software Free for academic research purposes on signature of a user agreement http://www.ii.metu.edu.tr/~corpus/

7 07.06.2016COGS 523 - Bilge Say7 Building the Corpus Text Compilation (permissions, scanning if necessary, control) Computer-aided annotation (TEI-XCES for general-typographic; XML-compliant in-house scheme for the treebank) Control Query Workbench Development

8 07.06.2016COGS 523 - Bilge Say8 Distribution of Text Types

9 07.06.2016COGS 523 - Bilge Say9 Annotation of the Corpus Text Encoding Initiative (TEI) compliant XCES – XML based Corpus Encoding Standards compliant- a TEI application Compliant with major current corpora such as British National Corpus

10 07.06.2016COGS 523 - Bilge Say10 The TEI Structure - 1 teiHeader teiCorpus text front bodyback teiHeader TEI.2 (Burnard, 2001)

11 07.06.2016COGS 523 - Bilge Say11 The TEI Structure - 2 front bodyback components divisions e.g. e.g., … phrase-level e.g., … (Burnard, 2001)

12 07.06.2016COGS 523 - Bilge Say12 A Typical Header 00017113 2008 17929...

13 07.06.2016COGS 523 - Bilge Say13 A Typical Header (cont.) Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan BAYTOP Nalân MAHSERECİ Bilim ve Ütopya Mart 2000 İstanbul 1301 - 6717

14 07.06.2016COGS 523 - Bilge Say14 A Typical Header (cont.) Makale 12.10.2000 Sedef The header part was changed.

15 07.06.2016COGS 523 - Bilge Say15 A Typical Body Oktay biraz önce, Hadi biz de Sitem'in yanına gidelim, demişti. Sitem'in, kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp gidivermemden korkan Oktay beni oyalamak için geçen yaz Giray Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı. O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez...

16 07.06.2016COGS 523 - Bilge Say16 Entering XCES Annotations - 1

17 07.06.2016COGS 523 - Bilge Say17 Entering XCES Annotations - 2

18 07.06.2016COGS 523 - Bilge Say18 METU-Sabancı treebank project Annotation of morphological and (surface) syntactic features in a dependency- inspired manner A subcorpus containing 7.300 annotated sentences and 65.000 words: initially whole samples selected from the main corpus. (Another version containing 5600 sentences) Genre distribution is proportional with the METU Corpus

19 07.06.2016COGS 523 - Bilge Say19 Building the Treebank Morphological Analysis of Selected Samples from the Corpus Preprocessing of the Collocations (Manual) Disambiguation of the Morphological Parses Annotating with the Dependency Structure Control

20 07.06.2016COGS 523 - Bilge Say20 Annotation – Lexical Level A word can be seen as a sequence of inflectional groups (IGs) of the form Lemma+Infl 1 ^DB+Infl 2 ^DB+…^DB+Infl n evinizdekilerden (from the ones at your house) Inflectional Group ev+Noun+A3sg+P2pl+Loc^DB+Adj^DB+Noun+A3pl+Pnon+Abl

21 07.06.2016COGS 523 - Bilge Say21 Annotation- Syntactic Level DeterminerSubject Abl. adj Modifier Bu çocuk okuldan erken geldi. This child school +Abl early come +Past+3sg This child came from the school early.

22 07.06.2016COGS 523 - Bilge Say22 Annotation- Syntactic Level Sentence Object Subject Intensifier Modifier Determiner Question-Particle Total of 20 syntactic tags Relativizer Coordination Possessor Classifier Ablative Adjunct Dative Adjunct Locative Adjunct Instrumental Adjunct...

23 07.06.2016COGS 523 - Bilge Say23 Morphosyntactic processing Tokenized text is annotated (ambiguously) by all possible morphological analyses for each token. Involves also unknown word processing A constraint-based disambiguation module performs limited morphological disambiguation. Recognizing and morphological annotation of collocations

24 07.06.2016COGS 523 - Bilge Say24 Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to help and speed up the human annotator Human annotator can override if the selected dependency relation is not right. Pilot work is done but not practised in the METU-Sabancı treebank

25 07.06.2016COGS 523 - Bilge Say25 Automatic Dependency Annotation A set of heuristic rules tentatively attach some of the relations automatically Appropriately case-marked nouns to the immediately following unambiguous postposition as objects Indefinite nominative nouns to the first verb to the right as objects Adverbs and Adjuncts attach to the first verb to the right as modifiers and adjunct

26 07.06.2016COGS 523 - Bilge Say26 The Annotation Tool The text thus processed can now be further annotated with an annotation tool Visualization Review selections (morph/dependency) and override (for morphology) or annotate (for dependency) The output of the program is morphologically disambiguated and annotated text which is encoded according to XML document and Turkish Treebank formats.

27 07.06.2016COGS 523 - Bilge Say27 Annotating the Treebank - 1

28 07.06.2016COGS 523 - Bilge Say28 Annotating the Treebank –2

29 07.06.2016COGS 523 - Bilge Say29 Corpus Query Workbench A user-friendly query engine for linguists Organization through sessions Boolean or regular expression queries Filtering queries through bibliographic constraints such as author, genre, year Treebank entries viewed through a graphical interface Printing and saving options of outputs and session queries available Implemented in Java SE 1.4.1, compatible with Window XP/Linux

30 07.06.2016COGS 523 - Bilge Say30

31 07.06.2016COGS 523 - Bilge Say31

32 07.06.2016COGS 523 - Bilge Say32 Post-project developments About 100 user forms received Some uses (from a recent survey) Word sense disambiguation Coherence in Turkish texts Subcategorization Frame Acquisition Teaching Turkish or NLP CoNLL Dependency task for METU- Sabancı Treebank (~5000 sentences) Frequency lists available (due to Umut Özge and Serge Sharoff)

33 07.06.2016COGS 523 - Bilge Say33 What would we have done differently? More funding, more interdisciplinary organization, less turnover... Approaching a corpus development project like a software engineering project... Doing a pilot project Better quality control processes, version control and documentation control processes. More and better automatic text capture and annotation

34 07.06.2016COGS 523 - Bilge Say34 Requests from Users Extend the size and variety of the corpus POS tag the whole corpus Enable the users to enter their own corpora to query tool Implement statistical features to the query tools Add semantic annotation Treebank specific ones: 10,000; 7,000 or 5,000 sentences? Detailed stylebook LEM and MORPH fields Better versioning, some nonconformant entries with XML

35 07.06.2016COGS 523 - Bilge Say35 Requirements for future generations of Turkish corpora Turkish National Corpus (like ANC, BNC, or CNC) Spoken Part Automatic Tools Diachronic Part Linguistically motivated morphological and syntactic annotation Some motivation for text providers Well-funded, well-organized project Comparable corpora of Turkic languages

36 07.06.2016COGS 523 - Bilge Say36 Lecture 6 Bernardini et al. A Wacky Introduction. April 14, your tool evaluation presentations and reports – only two weeks left!


Download ppt "07.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s."

Similar presentations


Ads by Google