Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

Similar presentations


Presentation on theme: "IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004."— Presentation transcript:

1 IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004

2 IMS Universität Stuttgart 2 Motivation maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research full parsers are not robust enough need for analyzing tools that meet the specific needs of corpus linguistic studies

3 IMS Universität Stuttgart 3 Information needed syntactic information subcategorization patterns semantic information selectional preferences, collocations synonyms multi-word units lexical classes morphological information case, number, gender compounding and derivation

4 IMS Universität Stuttgart 4 Requirements for the tool it has to work on unrestricted text shortcomings in the grammar should not lead to a complete failure to parse no manual checking should be required should provide a clearly defined interface annotation should follow linguistic standards

5 IMS Universität Stuttgart 5 Requirements for the annotation head lemma morpho-syntactic information lexical-semantic information structural and textual information hierarchical representation

6 IMS Universität Stuttgart 6 A corpus linguistic approach

7 IMS Universität Stuttgart 7 Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

8 IMS Universität Stuttgart 8 Three different dimensions type of grammar symbolic grammar probabilistic grammar type of grammar development hand-written grammar learning methods depth of analysis analysis on token level only full parsing partial parsing

9 IMS Universität Stuttgart 9 Classical chunk definition Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

10 IMS Universität Stuttgart 10 Problems for extraction Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

11 IMS Universität Stuttgart 11 An example 1.[ PC mit kleinen ], [ PC über die Köpfe ] with small above the heads [ NC der Apostel ] [ NC gesetzten Flammen ] the apostles set flames 2.[ PP mit [ NP [ AP kleinen ], [ AP über [ NP die Köpfe with small above the heads [ NP der Apostel ] ] gesetzten ] Flammen ] ] the apostles set flames `with small flames set above the heads of the apostles´

12 IMS Universität Stuttgart 12 Problems for extraction four NCs instead of only one NP AN-pair: +gesetzten + Flammen -kleine + Flammen NN-pair Köpfe + Apostel needs agreement information VN-pair setzen + Flammen needs information about the deverbal character of gesetzten a more complex analysis is needed PCs and NCs need to be combined

13 IMS Universität Stuttgart 13 Simple solution PP PC (PC|NC)* theoretical motivation? rule covers this particular example, other examples might need additional rules rule is vague and largely underspecified not very reliable internal structure is mainly left opague

14 IMS Universität Stuttgart 14 Complex solution 1.NP NC NC gen 2.PP preposition NP 3.AP PP adjective 4.NP AP* noun

15 IMS Universität Stuttgart 15 Complex solution solution for this particular example only large number of rules needed rules have to be repeated for every instance of a complex phrase in order to support extractions, the classic chunk concept has to be extended

16 IMS Universität Stuttgart 16 Conclusion Chunking Full Parsing flat non-recursive structures simple grammar robust and efficient non-ambiguous output full hierarchical representation complex grammar not very robust ambiguous output YAC

17 IMS Universität Stuttgart 17 A recursive chunker for unrestricted German text recursive chunker for unrestricted German text fully automatic analysis main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

18 IMS Universität Stuttgart 18 based on a symbolic regular expression grammar grammar rules written in CQP basis: tokenization PoS-tagging lemmatization agreement information General aspects Tree Tagger IMSLex

19 IMS Universität Stuttgart 19 A typical chunker robust – works on unrestricted text works fully automatically does not provide full but partial analysis of text no highly ambiguous attachment decisions are made

20 IMS Universität Stuttgart 20 YAC goes beyond extends the chunk definition of Abney 1.recursive embedding 2.post-head embedding provides additional information about annotated chunks 1.head lemma 2.agreement information 3.lexical-semantic and structural properties

21 IMS Universität Stuttgart 21 Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP- attachment, or sentential elements.

22 IMS Universität Stuttgart 22 Technical Framework corpus Perl-Scripts grammar rules lexicon rule application annotation of results post- processing

23 IMS Universität Stuttgart 23 Output formats CQP format, used for: interactive grammar development parsing extraction an XML format, used for: hierarchy building extraction data exchange

24 IMS Universität Stuttgart 24 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

25 IMS Universität Stuttgart 25 Linguistic coverage Adverbial phrases (AdvP) a)schön stark (beautifully strong) b)daher (from there); irgendwoher (from anywhere) c)heim (home); querfeldein (cross-country) d)innen (inside); überall (everywhere) e)"sehr bald" (very soon) f)jetzt (now); damals (at that time)

26 IMS Universität Stuttgart 26 Linguistic coverage Adjectival phrases (AP) a)möglich (possible) b)schreiend lila (screamingly purple) c)rund zwei Meter hohe around two meter high d)über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'

27 IMS Universität Stuttgart 27 Linguistic coverage Noun phrases (NP) a)Oktober (October); er (he) b)4,9 Milliarden Euro 4.9 billion Euros c)"Frankensteins Fluch" "Frankenstein's curse" d)kleine, über die Köpfe der Apostel gesetzten small, above the heads of the apostles set Flammen flames 'small flames set above the heads of the apostles'

28 IMS Universität Stuttgart 28 Linguistic coverage Prepositional phrases (PP) a)davon(thereof) b)zwischen Basel und St. Moritz between Basel and St. Moritz c)mit kleinen, über die Köpfe der Apostel gesetzten with small, above the heads of the apostles set Flammen flames 'with small flames set above the heads of the apostles

29 IMS Universität Stuttgart 29 Linguistic coverage Verbal complexes (VC) a)gemunkelt(rumored) b)muß gerechnet werden has counted to be 'has to be counted c)zu bekommen to get d)bekommen zu haben gottento have 'to have gotten'

30 IMS Universität Stuttgart 30 Linguistic coverage Clauses (CL) a)…, daß selbst Ravel sich amüsiert hätte. …, that even Ravel himself enjoyed had. '…, that even Ravel would have enjoyed.' b)…, die man in der griechischen Tragödie findet. …, which one in the Greek tragedy finds. '…, which one finds in the Greek tragedy.'

31 IMS Universität Stuttgart 31 Linguistic coverage Clauses (CL) a)…, Instrumente selbst zu bauen. …, instruments oneself to build. ' …, to build instruments oneself.' b)…, um einen Kaffee zu trinken. …, in order a coffee to drink. '…, in order to drink a coffee.'

32 IMS Universität Stuttgart 32 Feature annotation head lemma morpho-syntactic information lexical-semantic properties

33 IMS Universität Stuttgart 33 Feature annotation feature value AdvPAPNPPPVCCL lexical- semantic XXXXXX head lemma XXXXXX agreement info XXX verbal head lemma X

34 IMS Universität Stuttgart 34 Head lemma lemma attribute at the head position normally a single token multi-word proper nouns have a multi-token head lemma a separated verbal prefix is included in the head lemma of the VC kommt … an ankommen(arrive) head lemma of PP: preposition:noun

35 IMS Universität Stuttgart 35 Morpho-syntactic information intersection of the morpho-syntactic information of relevant elements invariant elements are not considered no guessing involved to solve ambiguities

36 IMS Universität Stuttgart 36 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

37 IMS Universität Stuttgart 37 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

38 IMS Universität Stuttgart 38 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:D ef| viert en Platz

39 IMS Universität Stuttgart 39 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

40 IMS Universität Stuttgart 40 Lexical-semantic properties important for parsing as well as for extraction properties can be triggers for specific internal structures, functions, and usages properties inherent in the corpus PoS-tags Johann Sebastian Bach NE NE NE text markers "Wilhelm Meisters Lehrjahre" NE NN NN

41 IMS Universität Stuttgart 41 Lexical-semantic properties properties determined by external knowledge sources (lexica, ontologies, word lists) locality: hier (here); dort (there); Stuttgart temporality: Jahr (year); damals (at that time) derivation: gesetzten (set) deverbal adjective

42 IMS Universität Stuttgart 42 Lexical-semantic properties structural information complex embeddings [ AP [ PP über die Köpfe der Apostel ] gesetzten ] above the heads of the apostles set ' set above the heads of the apostles' [ AP [ NP der "Inkatha"-Partei ] angehörenden ] to the Inkatha-party belonging 'belonging to the Inkatha-party'

43 IMS Universität Stuttgart 43 Some properties of NPs cardcardinal noun measmeasure noun nenamed entity quotNP in quotation marks streetstreet address temptemporal noun date pronpronominal NP

44 IMS Universität Stuttgart 44 Other lexical-semantic properties VC with separated prefix: pref Er kommt an (he arrives) PP with contracted preposition and article: fus am Bahnhof (at the station) complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles' AP with deverbal adjectives: vder

45 IMS Universität Stuttgart 45 Chunking process Corpus Third Level First Level Corpus Second Level Lexicon

46 IMS Universität Stuttgart 46 First level basic (non-recursive) chunks chunks with specific internal structure a)Ende September (end of Semptember) b)Jahre später (years later) c)21. Juli 2003 d)Johann Sebastian Bach lexical information is introduced within the rules itself within the Perl-scripts

47 IMS Universität Stuttgart 47 Advantages specific rules do not interact with main parsing rules additional (e.g. domain specific) rules can be included easily main parsing rules can be kept simple number of main parsing rules can be kept small

48 IMS Universität Stuttgart 48 Second level main parsing level relatively simple and general rules a)AP AdvP? (PP|NP)* AC b)NP Determiner? Cardinal? AP* NC c)PP Preposition (NP|AdvP) complex (recursive) structures are built in several iterations

49 IMS Universität Stuttgart 49 Rule blocks

50 IMS Universität Stuttgart 50 Complexity of phrases -complexity of phrases is achieved by the embedding of complex structures rather than by complex rules a)[ NP eine [ AP verständliche ] Sprache ] an understandable language b)[ NP eine [ AP für den Anwender verständliche ] Sprache ] a for the user understandable language 'a language understandable for the user'

51 IMS Universität Stuttgart 51 Complexity of phrases a)[ PP auf [ NP dem Giebel ] ] on top of the gable b)[ PP auf [ NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearing Gotteshauses ] ] Lord's house

52 IMS Universität Stuttgart 52 Third level chunks of related but different categories can be subsumed under one category NPs with determiner (NP) NPs without determiner (NCC)NP base noun chunks (NC) coordination of maximal chunks decisions are made which need full recursive chunks adverbially and predicatively used Adjectives can only be differentiated by the actual usage adverbially used AP AdvP

53 IMS Universität Stuttgart 53 Hierarchy building resulting structures of all parsing stages are collected and stored in XML-files after the parsing process collected structures are combined into a hierarchical structure only the largest instance of a structure (sharing the same head) is taken into account

54 IMS Universität Stuttgart 54 Hierarchy building a)[ NP Faszination ] fascination b)[ NP gewisse Faszination des Schattens ] certain fascination of the shadow c)[ NP eine gewisse Faszination des Schattens ] a certain fascination of the shadow d)[ NP des Schattens ] of the shadow e)[ NP eine gewisse Faszination [ NP des Schattens ] ] a certain fascination of the shadow

55 IMS Universität Stuttgart 55 Evaluation on automatic PoS- tags all chunksmaximal chunks precisionrecallprecisionrecall NP89.9391.6789.4391.68 PP94.0589.6794.0489.65 AP84.2489.2583.6789.59 VC--97.7296.62

56 IMS Universität Stuttgart 56 Evaluation on ideal PoS-tags all chunksmaximal chunks precisionrecallprecisionrecall NP96.3696.5195.5596.47 PP98.0896.5198.0796.50 AP96.3997.5096.1297.45 VC--99.0198.59

57 IMS Universität Stuttgart 57 Extraction Advantage of the system Goal Sample Extraction

58 IMS Universität Stuttgart 58 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

59 IMS Universität Stuttgart 59 Goal provide a fine-grained syntactic classification of the extracted data at the level of subcategorization scrambling adjectives subcategorizing clauses combinatory preferences with verbs syntactic behavior

60 IMS Universität Stuttgart 60 Target data predicative(-like) constructions Es war klar, daß... It was clear, that...... with adverbial pronoun Er ist davon überzeugt, daß... He is of it convinced, that...... with reflexive pronoun Es zeigt sich deutlich, daß... It shows itself clear, that...

61 IMS Universität Stuttgart 61 Target data... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit.... with clause in topicalized position Daß..., ist klar. That..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

62 IMS Universität Stuttgart 62 Sample query adjective + verb + finite clause VC AP CL

63 IMS Universität Stuttgart 63 Sample query adjective + verb + finite clause VC AP pred CL fin

64 IMS Universität Stuttgart 64 Sample query adjective + verb + finite clause VC Adjuncts* AP pred CL fin

65 IMS Universität Stuttgart 65 Sample query adjective + verb + finite clause VC (AdvP|PP|NP temp |CL rel )* AP pred CL fin

66 IMS Universität Stuttgart 66 adjective + verb + finite clause seinbleibenmachenwerden fraglich326343 unklar320103 klar2254130 offen22840 möglich160302 wichtig1802 deutlich59734 total150017716875

67 IMS Universität Stuttgart 67 adjective + verb + finite clause seinbleibenmachenwerden fraglich326343 unklar320103 klar2254130 offen22840 möglich160302 wichtig1802 deutlich59734 total150017716875

68 IMS Universität Stuttgart 68 Topicalized finite clause adjective + verb + finite clause CL fin VC (AdvP|PP|NP temp |CL rel )* AP pred

69 IMS Universität Stuttgart 69 adjective + verb + finite clause fincl_exfincl_toptotal fraglich91335426 unklar13413426 klar221159380 offen19266285 möglich2074211 wichtig1929201 deutlich13922161

70 IMS Universität Stuttgart 70 adjective + verb + finite clause fincl_exfincl_toptotal fraglich91335426 unklar13413426 klar221159380 offen19266285 möglich2074211 wichtig1929201 deutlich13922161

71 IMS Universität Stuttgart 71 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer1622211083326 möglich5324035 schwierig2459312 leicht1205931816 nötig1124827 erforderlich102115 total1708280195183111

72 IMS Universität Stuttgart 72 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer1622211083326 möglich5324035 schwierig2459312 leicht1205931816 nötig1124827 erforderlich102115 total1708280195183111

73 IMS Universität Stuttgart 73 low freq adj + verb + infin clause stehenbringenhabensein frei354 satt1910 fertig241

74 IMS Universität Stuttgart 74 low freq adj + verb + clause stehenbringenhabensein frei376 satt2711 fertig261

75 IMS Universität Stuttgart 75 adjective subcategorization APs with PP complements embedded in NPs Die [ AP dafür erforderlichen] 300 000 Mark The for this needed 300 000 Marks The 300 000 Marks needed for this Der [ AP auf Sport spezialisierte] Journalist The on sports specialised journalist The journalist specialising in sports

76 IMS Universität Stuttgart 76 multiword units and abbreviations chunks/phrases in brackets or quotes multiword units Teenage Mutant Hero Turtle (FC Italia Frankfurt) abbreviations Deutscher Aktienindex (Dax) Stickstoffdioxyd (NO2)

77 IMS Universität Stuttgart 77 Conclusion recursive chunking workable compromise between depth of analysis and robustness extracted data show correlation between collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences


Download ppt "IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004."

Similar presentations


Ads by Google