Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes.

Similar presentations


Presentation on theme: "IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes."— Presentation transcript:

1 IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

2 IMS Universität Stuttgart 2 Introduction Motivation computational lexicography corpus linguistics Approaches to text analysis symbolic vs. probabilistic approaches hand-written vs. learned on-line queries vs. chunking vs. full parsing Requirements for the extraction tool for the corpus annotation classical chunking

3 IMS Universität Stuttgart 3 Motivation maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research full parsers are not robust enough need for analyzing tools that meet the specific needs of corpus linguistic studies

4 IMS Universität Stuttgart 4 Dictonaries for human use printed monolingual dictionaries electronic dictionaries machine readable dictionaries for NLP applications

5 IMS Universität Stuttgart 5 Printed monolingual dictionaries intend to cover most important semantic and syntactic aspects maintenance of consistency and completeness is a problem: information is missing entries are incomplete information is not consistent language changes have to be covered

6 IMS Universität Stuttgart 6 Electronic dictionaries enormous amounts of information can be stored in a compact format search engines allow for easy and fast access to desired data users can choose how much and what kind of information they are interested in reference corpus as additional knowledge source

7 IMS Universität Stuttgart 7 Machine readable dictionaries NLP applications need detailed and consistent information about words detailed morphological information subcategorization frames of verbs, adjectives, nouns specific syntactic information selectional preferences collocations idiomatic usage

8 IMS Universität Stuttgart 8 Information needed syntactic information subcategorization patterns semantic information selectional preferences, collocations synonyms multi-word units lexical classes morphological information case, number, gender compounding and derivation

9 IMS Universität Stuttgart 9 Requirements for the tool it has to work on unrestricted text shortcomings in the grammar should not lead to a complete failure to parse no manual checking should be required should provide a clearly defined interface annotation should follow linguistic standards

10 IMS Universität Stuttgart 10 Requirements for the annotation head lemma morpho-syntactic information lexical-semantic information structural and textual information hierarchical representation

11 IMS Universität Stuttgart 11 A corpus linguistic approach

12 IMS Universität Stuttgart 12 Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

13 IMS Universität Stuttgart 13 Three different dimensions type of grammar symbolic grammar probabilistic grammar type of grammar development hand-written grammar learning methods depth of analysis analysis on token level only full parsing partial parsing

14 IMS Universität Stuttgart 14 Symbolic approaches +precise rules can be formulated +lexical knowledge can be included +results can be predicted and controlled -sometimes not sufficient to solve ambiguities -only phenomena which are explicit in the grammar can be dealt with

15 IMS Universität Stuttgart 15 Unification-based grammars usually complex grammars model the hierarchical structure of language handle attachment ambiguities determine relations among constituents and their grammatical function extensive use of lexical information richness and complexity of rules do not only solve ambiguities, but produces them as well usually large number of possible analysis

16 IMS Universität Stuttgart 16 Context-free Grammars (CFG) formal grammars consisting of a set of recursive rewriting rules small and modular grammar minimal interaction among rules parsing process usually fast covers only basic aspects of language robustness rules are used to overcome shortcomings in the grammar

17 IMS Universität Stuttgart 17 Probabilistic approaches +supervised or unsupervised training of rules +all possible analyses are produced +no need for comprehensive lexical or linguistic knowledge +rules can be left underspecified -depend on the training corpus -highly frequent phenomena are preferred over low frequent phenomena

18 IMS Universität Stuttgart 18 Probabilistic context-free grammar CFG rules enriched by probability make use of underspecification not as fast as CFG special case: head lexicalized context-free grammar unsupervised grammar rules are indexed by the lemma of the syntactic head extraction is performed on the rule set rather than on the annotated corpus

19 IMS Universität Stuttgart 19 Hand-written rules +good control of the rule system +negative evidence can be taken into account -depends heavily on the experties of the grammar writer

20 IMS Universität Stuttgart 20 Learning grammar rules +infer grammar form text corpora +extensional syntactic descriptions (annotations) are turned into intensional descriptions (rules) +optimal or suboptimal training data +new resources in the form of text corpora can be exploited +more or less independent of the knowledge of the grammar developer -depends heavily on the learning corpus -needs an annotated, well-balanced corpus

21 IMS Universität Stuttgart 21 memory based learning special case of learning most prominent is the data oriented parsing (DOP) fragments are stored and as such replace the grammar language generation and analysis is performed by combining the memorized fragments needs structurally annotated corpus the training corpus has great impact on the performance of the system highly sensitive to suboptimal data needs large storage capacity

22 IMS Universität Stuttgart 22 Annotation on token level +usually a form of pattern matching +completely flexible +does not depend on previous syntactic analysis +easily adaptable to different text types -full syntactic analysis has to be performed by extraction queries -queries can become rather complex -often restricted to simple contexts

23 IMS Universität Stuttgart 23 Full Parsing +provides rich and detailed information about structures, relations and functions +extraction queries simply have to collect the annotated information -slow parsing speed -lack of robustness -depend heavily on prerequisite lexical information -ambiguous output

24 IMS Universität Stuttgart 24 Chunking +relatively simple grammar rules +no need for extensive linguistic and lexicographic information +robust -usually non-hierarchical and non-recursive structures -annotated structures are simple and convey less information

25 IMS Universität Stuttgart 25 Classical chunk definition Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

26 IMS Universität Stuttgart 26 State-of-the-art systems CASS parser finite-state cascades flat, non-recursive structures small lexicon (tag-fixes) information about the head is given as an attribute Conexor symbolic constraint grammar parser full-fedged grammar for English (ENGCG) German: simple, non-recursive structure no lexical information available head lemma indicated by a special tag

27 IMS Universität Stuttgart 27 State-of-the-art systems KaRoParse top-down bottom-up parser includes recursion internal structure is flat and non-hierarchical no agreement or lexical information Schiehlen's chunker symbolic context free grammar recursion no head lemma or lexical-semantic information needs optimally tokenized text (including MWL recognition)

28 IMS Universität Stuttgart 28 State-of-the-art systems Chunkie uses TnT-tagger to assign tree fragments to sequences of PoS-tags recursion in pre-head position (maximal depth of three) head lemma information, yet no agreement or lexical information Cascaded Markov Models stochastic context free grammar rules several layers, each layer serving as input to the next hierachical phrases, including complex recursion head lemma information, yet no agreement or lexical information

29 IMS Universität Stuttgart 29 Problems for extraction Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

30 IMS Universität Stuttgart 30 An example 1.[ PC mit kleinen ], [ PC über die Köpfe ] with small above the heads [ NC der Apostel ] [ NC gesetzten Flammen ] the apostles set flames 2.[ PP mit [ NP [ AP kleinen ], [ AP über [ NP die Köpfe with small above the heads [ NP der Apostel ] ] gesetzten ] Flammen ] ] the apostles set flames `with small flames set above the heads of the apostles´

31 IMS Universität Stuttgart 31 Problems for extraction four NCs instead of only one NP AN-pair: +gesetzten + Flammen -kleine + Flammen NN-pair Köpfe + Apostel needs agreement information VN-pair setzen + Flammen needs information about the deverbal character of gesetzten a more complex analysis is needed PCs and NCs need to be combined

32 IMS Universität Stuttgart 32 Simple solution PP PC (PC|NC)* theoretical motivation? rule covers this particular example, other examples might need additional rules rule is vague and largely underspecified not very reliable internal structure is mainly left opague

33 IMS Universität Stuttgart 33 Complex solution 1.NP NC NC gen 2.PP preposition NP 3.AP PP adjective 4.NP AP* noun

34 IMS Universität Stuttgart 34 Complex solution solution for this particular example only large number of rules needed rules have to be repeated for every instance of a complex phrase in order to support extractions, the classic chunk concept has to be extended

35 IMS Universität Stuttgart 35 Conclusion Chunking Full Parsing flat non-recursive structures simple grammar robust and efficient non-ambiguous output full hierarchical representation complex grammar not very robust ambiguous output YAC

36 IMS Universität Stuttgart 36 Conclusion recursive chunking workable compromise between depth of analysis and robustness extracted data show correlation between collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences

37 IMS Universität Stuttgart 37 General Concept a recursive chunker for unrestricted German text technical framework CWB CQP output formats advantages of the architecture general framework of YAC linguistic coverage feature annotation chunking process

38 IMS Universität Stuttgart 38 A recursive chunker for unrestricted German text recursive chunker for unrestricted German text fully automatic analysis main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

39 IMS Universität Stuttgart 39 based on a symbolic regular expression grammar grammar rules written in CQP basis: tokenization PoS-tagging lemmatization agreement information General aspects Tree Tagger IMSLex

40 IMS Universität Stuttgart 40 A typical chunker robust – works on unrestricted text works fully automatically does not provide full but partial analysis of text no highly ambiguous attachment decisions are made

41 IMS Universität Stuttgart 41 YAC goes beyond extends the chunk definition of Abney 1.recursive embedding 2.post-head embedding provides additional information about annotated chunks 1.head lemma 2.agreement information 3.lexical-semantic and structural properties

42 IMS Universität Stuttgart 42 Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP- attachment, or sentential elements.

43 IMS Universität Stuttgart 43 Technical Framework corpus Perl-Scripts grammar rules lexicon rule application annotation of results post- processing

44 IMS Universität Stuttgart 44 Technical framework - CQP regular expression matching on token and annotation strings tests for membership in user specific word lists feature set operations constraints to specify dependencies

45 IMS Universität Stuttgart 45 Perl-Scripts invocation of CQP processing of the results annotation of the results into the corpus

46 IMS Universität Stuttgart 46 Postprocessing values can be checked values can be changed values can be compared range of structures can be changed

47 IMS Universität Stuttgart 47 Output formats CQP format, used for: interactive grammar development parsing extraction an XML format, used for: hierarchy building extraction data exchange

48 IMS Universität Stuttgart 48 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

49 IMS Universität Stuttgart 49 Linguistic coverage Adverbial phrases (AdvP) a)schön stark (beautifully strong) b)daher (from there); irgendwoher (from anywhere) c)heim (home); querfeldein (cross-country) d)innen (inside); überall (everywhere) e)"sehr bald" (very soon) f)jetzt (now); damals (at that time)

50 IMS Universität Stuttgart 50 Linguistic coverage Adjectival phrases (AP) a)möglich (possible) b)schreiend lila (screamingly purple) c)rund zwei Meter hohe around two meter high d)über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'

51 IMS Universität Stuttgart 51 Linguistic coverage Noun phrases (NP) a)Oktober (October); er (he) b)4,9 Milliarden Euro 4.9 billion Euros c)"Frankensteins Fluch" "Frankenstein's curse" d)kleine, über die Köpfe der Apostel gesetzten small, above the heads of the apostles set Flammen flames 'small flames set above the heads of the apostles'

52 IMS Universität Stuttgart 52 Linguistic coverage Prepositional phrases (PP) a)davon(thereof) b)zwischen Basel und St. Moritz between Basel and St. Moritz c)mit kleinen, über die Köpfe der Apostel gesetzten with small, above the heads of the apostles set Flammen flames 'with small flames set above the heads of the apostles

53 IMS Universität Stuttgart 53 Linguistic coverage Verbal complexes (VC) a)gemunkelt(rumored) b)muß gerechnet werden has counted to be 'has to be counted c)zu bekommen to get d)bekommen zu haben gottento have 'to have gotten'

54 IMS Universität Stuttgart 54 Linguistic coverage Clauses (CL) a)…, daß selbst Ravel sich amüsiert hätte. …, that even Ravel himself enjoyed had. '…, that even Ravel would have enjoyed.' b)…, die man in der griechischen Tragödie findet. …, which one in the Greek tragedy finds. '…, which one finds in the Greek tragedy.'

55 IMS Universität Stuttgart 55 Linguistic coverage Clauses (CL) a)…, Instrumente selbst zu bauen. …, instruments oneself to build. ' …, to build instruments oneself.' b)…, um einen Kaffee zu trinken. …, in order a coffee to drink. '…, in order to drink a coffee.'

56 IMS Universität Stuttgart 56 Feature annotation head lemma morpho-syntactic information lexical-semantic properties

57 IMS Universität Stuttgart 57 Feature annotation feature value AdvPAPNPPPVCCL lexical- semantic XXXXXX head lemma XXXXXX agreement info XXX verbal head lemma X

58 IMS Universität Stuttgart 58 Head lemma lemma attribute at the head position normally a single token multi-word proper nouns have a multi-token head lemma a separated verbal prefix is included in the head lemma of the VC kommt … an ankommen(arrive) head lemma of PP: preposition:noun

59 IMS Universität Stuttgart 59 Morpho-syntactic information intersection of the morpho-syntactic information of relevant elements invariant elements are not considered no guessing involved to solve ambiguities

60 IMS Universität Stuttgart 60 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

61 IMS Universität Stuttgart 61 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

62 IMS Universität Stuttgart 62 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:D ef| viert en Platz

63 IMS Universität Stuttgart 63 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

64 IMS Universität Stuttgart 64 Lexical-semantic properties important for parsing as well as for extraction properties can be triggers for specific internal structures, functions, and usages properties inherent in the corpus PoS-tags Johann Sebastian Bach NE NE NE text markers "Wilhelm Meisters Lehrjahre" NE NN NN

65 IMS Universität Stuttgart 65 Lexical-semantic properties properties determined by external knowledge sources (lexica, ontologies, word lists) locality: hier (here); dort (there); Stuttgart temporality: Jahr (year); damals (at that time) derivation: gesetzten (set) deverbal adjective

66 IMS Universität Stuttgart 66 Lexical-semantic properties structural information complex embeddings [ AP [ PP über die Köpfe der Apostel ] gesetzten ] above the heads of the apostles set ' set above the heads of the apostles' [ AP [ NP der "Inkatha"-Partei ] angehörenden ] to the Inkatha-party belonging 'belonging to the Inkatha-party'

67 IMS Universität Stuttgart 67 Some properties of NPs cardcardinal noun measmeasure noun nenamed entity quotNP in quotation marks streetstreet address temptemporal noun date pronpronominal NP

68 IMS Universität Stuttgart 68 Other lexical-semantic properties VC with separated prefix: pref Er kommt an (he arrives) PP with contracted preposition and article: fus am Bahnhof (at the station) complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles' AP with deverbal adjectives: vder

69 IMS Universität Stuttgart 69 Chunking process Corpus Third Level First Level Corpus Second Level Lexicon

70 IMS Universität Stuttgart 70 First level basic (non-recursive) chunks chunks with specific internal structure a)Ende September (end of Semptember) b)Jahre später (years later) c)21. Juli 2003 d)Johann Sebastian Bach lexical information is introduced within the rules itself within the Perl-scripts

71 IMS Universität Stuttgart 71 Advantages specific rules do not interact with main parsing rules additional (e.g. domain specific) rules can be included easily main parsing rules can be kept simple number of main parsing rules can be kept small

72 IMS Universität Stuttgart 72 Second level main parsing level relatively simple and general rules a)AP AdvP? (PP|NP)* AC b)NP Determiner? Cardinal? AP* NC c)PP Preposition (NP|AdvP) complex (recursive) structures are built in several iterations

73 IMS Universität Stuttgart 73 Rule blocks

74 IMS Universität Stuttgart 74 Second level -complexity of phrases is achieved by the embedding of complex structures rather than by complex rules a)[ NP eine [ AP verständliche ] Sprache ] an understandable language b)[ NP eine [ AP für den Anwender verständliche ] Sprache ] a for the user understandable language 'a language understandable for the user'

75 IMS Universität Stuttgart 75 Second level a)[ PP auf [ NP dem Giebel ] ] on top of the gable b)[ PP auf [ NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearing Gotteshauses ] ] Lord's house

76 IMS Universität Stuttgart 76 Third level chunks of related but different categories can be subsumed under one category NPs with determiner (NP) NPs without determiner (NCC)NP base noun chunks (NC) coordination of maximal chunks decisions are made which need full recursive chunks adverbially and predicatively used Adjectives can only be differentiated by the actual usage adverbially used AP AdvP

77 IMS Universität Stuttgart 77 Hierarchy building resulting structures of all parsing stages are collected and stored in XML-files after the parsing process collected structures are combined into a hierarchical structure only the largest instance of a structure (sharing the same head) is taken into account

78 IMS Universität Stuttgart 78 Hierarchy building a)[ NP Faszination ] fascination b)[ NP gewisse Faszination des Schattens ] certain fascination of the shadow c)[ NP eine gewisse Faszination des Schattens ] a certain fascination of the shadow d)[ NP des Schattens ] of the shadow e)[ NP eine gewisse Faszination [ NP des Schattens ] ] a certain fascination of the shadow

79 IMS Universität Stuttgart 79 Evaluation on automatic PoS- tags all chunksmaximal chunks precisionrecallprecisionrecall NP89.9391.6789.4391.68 PP94.0589.6794.0489.65 AP84.2489.2583.6789.59 VC--97.7296.62

80 IMS Universität Stuttgart 80 Evaluation on ideal PoS-tags all chunksmaximal chunks precisionrecallprecisionrecall NP96.3696.5195.5596.47 PP98.0896.5198.0796.50 AP96.3997.5096.1297.45 VC--99.0198.59

81 IMS Universität Stuttgart 81 Extraction Advantage of the system Goal Sample Extraction

82 IMS Universität Stuttgart 82 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

83 IMS Universität Stuttgart 83 Goal provide a fine-grained syntactic classification of the extracted data at the level of subcategorization scrambling adjectives subcategorizing clauses combinatory preferences with verbs syntactic behavior

84 IMS Universität Stuttgart 84 Target data predicative(-like) constructions Es war klar, daß... It was clear, that...... with adverbial pronoun Er ist davon überzeugt, daß... He is of it convinced, that...... with reflexive pronoun Es zeigt sich deutlich, daß... It shows itself clear, that...

85 IMS Universität Stuttgart 85 Target data... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit.... with clause in topicalized position Daß..., ist klar. That..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

86 IMS Universität Stuttgart 86 Sample query adjective + verb + finite clause VC AP CL

87 IMS Universität Stuttgart 87 Sample query adjective + verb + finite clause VC AP pred CL fin

88 IMS Universität Stuttgart 88 Sample query adjective + verb + finite clause VC Adjuncts* AP pred CL fin

89 IMS Universität Stuttgart 89 Sample query adjective + verb + finite clause VC (AdvP|PP|NP temp |CL rel )* AP pred CL fin

90 IMS Universität Stuttgart 90 adjective + verb + finite clause seinbleibenmachenwerden fraglich326343 unklar320103 klar2254130 offen22840 möglich160302 wichtig1802 deutlich59734 total150017716875

91 IMS Universität Stuttgart 91 adjective + verb + finite clause seinbleibenmachenwerden fraglich326343 unklar320103 klar2254130 offen22840 möglich160302 wichtig1802 deutlich59734 total150017716875

92 IMS Universität Stuttgart 92 Topicalized finite clause adjective + verb + finite clause CL fin VC (AdvP|PP|NP temp |CL rel )* AP pred

93 IMS Universität Stuttgart 93 adjective + verb + finite clause fincl_exfincl_toptotal fraglich91335426 unklar13413426 klar221159380 offen19266285 möglich2074211 wichtig1929201 deutlich13922161

94 IMS Universität Stuttgart 94 adjective + verb + finite clause fincl_exfincl_toptotal fraglich91335426 unklar13413426 klar221159380 offen19266285 möglich2074211 wichtig1929201 deutlich13922161

95 IMS Universität Stuttgart 95 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer1622211083326 möglich5324035 schwierig2459312 leicht1205931816 nötig1124827 erforderlich102115 total1708280195183111

96 IMS Universität Stuttgart 96 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer1622211083326 möglich5324035 schwierig2459312 leicht1205931816 nötig1124827 erforderlich102115 total1708280195183111

97 IMS Universität Stuttgart 97 low freq adj + verb + infin clause stehenbringenhabensein frei354 satt1910 fertig241

98 IMS Universität Stuttgart 98 low freq adj + verb + clause stehenbringenhabensein frei376 satt2711 fertig261

99 IMS Universität Stuttgart 99 adjective subcategorization APs with PP complements embedded in NPs Die [ AP dafür erforderlichen] 300 000 Mark The for this needed 300 000 Marks The 300 000 Marks needed for this Der [ AP auf Sport spezialisierte] Journalist The on sports specialised journalist The journalist specialising in sports

100 IMS Universität Stuttgart 100 multiword units and abbreviations chunks/phrases in brackets or quotes multiword units Teenage Mutant Hero Turtle (FC Italia Frankfurt) abbreviations Deutscher Aktienindex (Dax) Stickstoffdioxyd (NO2)

101 IMS Universität Stuttgart 101 Conclusion recursive chunking workable compromise between depth of analysis and robustness extracted data show correlation between collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences


Download ppt "IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes."

Similar presentations


Ads by Google