Presentation is loading. Please wait.

Presentation is loading. Please wait.

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.

Similar presentations


Presentation on theme: "Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra."— Presentation transcript:

1 Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

2 Outline  TiGer Treebank  TiGer Search

3 The TiGer Treebank TIGER: LinguisTic Interpretation of a GERman Corpus Institute of Natural Language Processing (IMS) in Stuttgart, Institut für Germanistik in Potsdam, Department of Computational Linguistics and Phonetics in Saarbrücken German treebanks: Verbmobil Corpus (only spoken language), NEGRA Corpus and Tuebingen Treebank (only 20,000 sentences) The need for a large and comprehensive German treebank: – Data for the testing and training of statistically based methods in natural language processing – Basis for empirical language research TIGER Corpus: – First release (mid 2003): 40,000 sentences of newspaper text (Frankfurter Rundschau, full articles) – Second release (X-mas 2005): 50,000 sentences – Together with 20,000 NEGRA sentences comparable to Penn Treebank in size (1,5 million words)

4 TiGer: Levels of annotation Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HD SBOC HD OAMO ACNK S VP NP PP annotation on word level: part-of-speech, morphology, lemmata node labels: phrase categories edge labels: syntactic functions crossing branches for discontinuous constituency types will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen

5 TiGer: Annotation method Interactive tagging and parsing Tagging: TnT (97% reliable), Parsing: Cascaded Markov Models (71% reliable), Morphology: TigerMorph Independent annotation by 2 different annotators and comparison => consistency of corpus + improvement of annotation scheme Annotation time: 10 minutes per sentence

6 TiGer: Annotation formats #BOS 37 3 863207489 1 %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$.-- --0 #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S-- --0 #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.

7 ...... ● TIGER XML document is split up into header and body ● Header contains meta-information on corpus name, date, author etc. and an annotation grammar ● Body: directed acyclic graphs are used as the underlying data model to encode the linguistic annotation ● Element terminals contains the following attributes: word, part-of-speech, morphological tag ● Element nonterminals: information on phrase categories and syntactic functions TiGer: Annotation formats #BOS 37 3 863207489 1 %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$.-- --0 #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S-- --0 #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.

8 TiGer: Annotation scheme Uses a hybrid framework which combines advantages of dependency grammar and phrase structure grammar Syntactic structures are rather flat and simple in order to reduce the potential for attachment ambiguities (e.g. the distinction between arguments and adjuncts is not expressed in the constituent structure, but encoded by means of syntactic functions) Based on the NEGRA annotation scheme Changes in TIGER: – improvement of linguistic adequacy – extension of linguistic inventory Cross-fertilization of corpus and annotation scheme: annotation and comparison discrepancy between annotation scheme and data changes in annotation scheme, test for operationalization

9 TiGer: Query tool ● TIGERSearch: query tool for treebanks using TIGER Query Language ● TIGERRegistry: format conversions into TIGER XML and indexing of the annotated corpus ● TIGER Graph Viewer: visualization of query results ● TIGERin: Graphical User Interface to simplify complex queries and to improve accessibility of the query language

10 TiGer: Query language

11 Node level: ● Nodes can be described by Boolean expressions over feature-value pairs ● Query: [word="lacht" & pos="VVFIN"]

12 TiGer: Query language Node relation level: ● Descriptions of two or more nodes are combined by a relation ● Query: [cat="NP"] >RC [cat="S"]

13 TiGer: Query language Graph description level: ● Boolean expressions over node relations are allowed (without negation) ● Query: ([cat="S"] > [pos="PRELS"]) & ([cat="S"] > [pos="VVFIN"]) ● Variables can be used to express coreference of nodes or feature values ● Query: (#n:[cat="S"] > [pos="PRELS"]) & (#n > [pos="VVFIN"])

14 For further information (downloads, papers etc.): http://www.coli.uni-sb.de/cl/projects/tiger


Download ppt "Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra."

Similar presentations


Ads by Google