Martin Volk Universität Zürich Eurospider Information Technology AG

Martin Volk Universität Zürich Eurospider Information Technology AG
Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AG

Overview Clean-Up and Text Structure Recognition
Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr. Lemmatisation and Lemma Filtering NP/PP Chunk Recognition Recognition of Local and Temporal PPs Clause Boundary Recognition Martin Volk 4 December 2018

Part-of-Speech Tagging
Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). The Tree-Tagger Is a Statistical Tagger. uses the STTS tag set (50 PoS tags and 3 tags for punctuation). assigns 1 tag to each word. preserves pre-set tags. Martin Volk 4 December 2018

Tagging Correction Correction of observed tagger problems:
Sentence-initial adjectives are often tagged as noun (NN) '...liche[nr]' or '...ische[nr]'  ADJA Verb group patterns the verb in front of 'worden' must be perfect participle VVXXX + 'worden'  VVPP if verb + modal verb then the verb must be infinitive VVXXX + VMYYY  VVINF Unknown prepositions (a, via, innert, ennet) Martin Volk 4 December 2018

Correction of sentence boundaries
E.g.: suspected ordinal number followed by a capitalized determiner or pronoun or preposition or adverb  insert sentence boundary. Open question: Could all sentence boundary detection be done after PoS tagging? Martin Volk 4 December 2018

Lemmatisation Was done with Gertwol (von Lingsoft Oy, Helsinki)
for adjectives, nouns, prepositions, and verbs. Gertwol is a two-level morphology analyzer for German is lexicon-based returns all possible interpretations for each word form segments compound words dynamically analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien)  feed last element to Gertwol Martin Volk 4 December 2018

Lemma Filtering (a project by Julian Käser)
After lemmatisation: Merging of Gertwol and tagger information Case 1: The lemma was prespecified during proper name recognition (IBMs  IBM) Case 2: Gertwol does not find a lemma  insert the word form as lemma (with '?') Martin Volk 4 December 2018

Lemma Filtering Case 3: Gertwol finds exactly one lemma for the given PoS  insert the lemma Case 4: Gertwol finds multiple lemmas for the given PoS  disambiguate and insert the best lemma Disambiguation weights the segmentation symbols: Strong compound segment boundary: 4 points Weak compound segment boundary: 2 points Derivational segment boundary: 1 point the lemma with the lowest score wins! Example: Abteilungen  Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) Martin Volk 4 December 2018

Lemma Filtering Case 5: Gertwol finds a lemma but not for the given PoS  this indicates a tagger error (Gertwol is more reliable than the tagger.) Case 5.1: Gertwol finds a lemma for exactly one PoS  insert the lemma and exchange the PoS tag Case 5.2: Gertwol finds lemmas for more than one PoS  find closest PoS tag, or guess Martin Volk 4 December 2018

Lemma Filtering 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. 85% are cases with exactly one Gertwol tag, 15% are guesses. Martin Volk 4 December 2018

Limitations of Gertwol
Compounds are lemmatized only if all parts are known. Idea: Use corpus for lemmatizing remaining compounds: Example: kaputtreden, Waferfabriken Solution: If first part occurs standing alone AND second part occurs standing alone with lemma, then segment and lemmatize! and store first part as lemma (of itself)! !! Martin Volk 4 December 2018

NP/PP Chunk Recognition (a project by Dominik A. Merz)
Pattern matcher with patterns over PoS-tags Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!! Martin Volk 4 December 2018

Representation Format
The NEGRA export format is a line based format works with pointers for tree structure comprises node labels (constituents) and edge labels (grammatical functions) has no provision for semantic information. Therefore: We use the comment field. Martin Volk 4 December 2018

Recognition of temporal PPs (a project by Stefan Höfler)
A second step towards semantic annotation. Starting point: Prepositions (3) that always introduce a temporal PP: binnen, während, zeit Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, additional evidence Additional evidence: Temporal adverb in PP: heute, niemals, wann, ... Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ... Martin Volk 4 December 2018

Recognition of temporal PPs
Evaluation corpus: 990 sentences with manually checked 263 temporal PPs Result: Precision: 81% Recall: 76% Martin Volk 4 December 2018

Recognition of local PPs
Starting point: Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von Prepositions (30) that may introduce a local PP: ab, auf, bei, additional evidence Additional evidence: Local adverb in PP: dort, hier, oben, rechts, ... Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ... Martin Volk 4 December 2018

Recognition of temporal and local PPs
Martin Volk 4 December 2018

A Word on Recall and Precision
The focus varies with the application! For my project: Precision is more important than Recall! Idea: If I annotate something, then I want to be 'sure' that it is correct. Martin Volk 4 December 2018

Clause Boundary Recognition (a project by Gaudenz Lügstenmann)
Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. A sentence consists of one or more clauses, and a clause consists of one or more phrases. Clauses are important for determining the cooccurrence of verbs and PPs (among other things). Martin Volk 4 December 2018

Clauses Boundary Recognition
Exceptions from the definition: Clauses with more than one verb: Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Martin Volk 4 December 2018

Exceptions from the definition: Clauses without a verb: Elliptical clauses (e.g. in coordinated structures) Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Martin Volk 4 December 2018

The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) Example: Comma + Relative Pronoun Finite verb Conjunction Finite Verb Most difficult: CB without overt punctuation symbol or trigger word Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. Martin Volk 4 December 2018

Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Results (counting all CBs) Precision: 95.8% Recall: 84.9% Results (counting only intra-sentential CBs) Precision: 90.5% Recall: 61.1% Martin Volk 4 December 2018

Using a PoS Tagger for Clause Boundary Recognition
A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). A tagger may serve the same purpose. Example: ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Martin Volk 4 December 2018

Using a PoS Tagger for Clause Boundary Recognition
Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. Training the Brill-Tagger on 75% and applying it on the remaining 25% Results: 93% Precision 91% Recall Caution: very small evaluation corpus!! Martin Volk 4 December 2018

Clause Boundary Recognition vs. Clause Recognition
CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. Example: Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> Clause Recognition should be done with a recursive parsing approach because of clause nesting. Martin Volk 4 December 2018

Martin Volk Universität Zürich Eurospider Information Technology AG

Similar presentations

Presentation on theme: "Martin Volk Universität Zürich Eurospider Information Technology AG"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Volk Universität Zürich Eurospider Information Technology AG

Similar presentations

Presentation on theme: "Martin Volk Universität Zürich Eurospider Information Technology AG"— Presentation transcript:

Similar presentations

About project

Feedback