IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes.

Slides:



Advertisements
Similar presentations
Números.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
SKELETAL QUIZ 3.
Simplifications of Context-Free Grammars
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
Addition and Subtraction Equations
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
David Burdett May 11, 2004 Package Binding for WS CDL.
1 When you see… Find the zeros You think…. 2 To find the zeros...
Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.
Create an Application Title 1Y - Youth Chapter 5.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Introduction to Turing Machines
ASCII stands for American Standard Code for Information Interchange
IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The basics for simulations
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
© 2010 Concept Systems, Inc.1 Concept Mapping Methodology: An Example.
MM4A6c: Apply the law of sines and the law of cosines.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Copyright © [2002]. Roger L. Costello. All Rights Reserved. 1 XML Schemas Reference Manual Roger L. Costello XML Technologies Course.
Progressive Aerobic Cardiovascular Endurance Run
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Subtraction: Adding UP
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.
UNDERSTANDING THE ISSUES. 22 HILLSBOROUGH IS A REALLY BIG COUNTY.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Presentation transcript:

IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart 2 Introduction Motivation computational lexicography corpus linguistics Approaches to text analysis symbolic vs. probabilistic approaches hand-written vs. learned on-line queries vs. chunking vs. full parsing Requirements for the extraction tool for the corpus annotation classical chunking

IMS Universität Stuttgart 3 Motivation maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research full parsers are not robust enough need for analyzing tools that meet the specific needs of corpus linguistic studies

IMS Universität Stuttgart 4 Dictonaries for human use printed monolingual dictionaries electronic dictionaries machine readable dictionaries for NLP applications

IMS Universität Stuttgart 5 Printed monolingual dictionaries intend to cover most important semantic and syntactic aspects maintenance of consistency and completeness is a problem: information is missing entries are incomplete information is not consistent language changes have to be covered

IMS Universität Stuttgart 6 Electronic dictionaries enormous amounts of information can be stored in a compact format search engines allow for easy and fast access to desired data users can choose how much and what kind of information they are interested in reference corpus as additional knowledge source

IMS Universität Stuttgart 7 Machine readable dictionaries NLP applications need detailed and consistent information about words detailed morphological information subcategorization frames of verbs, adjectives, nouns specific syntactic information selectional preferences collocations idiomatic usage

IMS Universität Stuttgart 8 Information needed syntactic information subcategorization patterns semantic information selectional preferences, collocations synonyms multi-word units lexical classes morphological information case, number, gender compounding and derivation

IMS Universität Stuttgart 9 Requirements for the tool it has to work on unrestricted text shortcomings in the grammar should not lead to a complete failure to parse no manual checking should be required should provide a clearly defined interface annotation should follow linguistic standards

IMS Universität Stuttgart 10 Requirements for the annotation head lemma morpho-syntactic information lexical-semantic information structural and textual information hierarchical representation

IMS Universität Stuttgart 11 A corpus linguistic approach

IMS Universität Stuttgart 12 Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

IMS Universität Stuttgart 13 Three different dimensions type of grammar symbolic grammar probabilistic grammar type of grammar development hand-written grammar learning methods depth of analysis analysis on token level only full parsing partial parsing

IMS Universität Stuttgart 14 Symbolic approaches +precise rules can be formulated +lexical knowledge can be included +results can be predicted and controlled -sometimes not sufficient to solve ambiguities -only phenomena which are explicit in the grammar can be dealt with

IMS Universität Stuttgart 15 Unification-based grammars usually complex grammars model the hierarchical structure of language handle attachment ambiguities determine relations among constituents and their grammatical function extensive use of lexical information richness and complexity of rules do not only solve ambiguities, but produces them as well usually large number of possible analysis

IMS Universität Stuttgart 16 Context-free Grammars (CFG) formal grammars consisting of a set of recursive rewriting rules small and modular grammar minimal interaction among rules parsing process usually fast covers only basic aspects of language robustness rules are used to overcome shortcomings in the grammar

IMS Universität Stuttgart 17 Probabilistic approaches +supervised or unsupervised training of rules +all possible analyses are produced +no need for comprehensive lexical or linguistic knowledge +rules can be left underspecified -depend on the training corpus -highly frequent phenomena are preferred over low frequent phenomena

IMS Universität Stuttgart 18 Probabilistic context-free grammar CFG rules enriched by probability make use of underspecification not as fast as CFG special case: head lexicalized context-free grammar unsupervised grammar rules are indexed by the lemma of the syntactic head extraction is performed on the rule set rather than on the annotated corpus

IMS Universität Stuttgart 19 Hand-written rules +good control of the rule system +negative evidence can be taken into account -depends heavily on the experties of the grammar writer

IMS Universität Stuttgart 20 Learning grammar rules +infer grammar form text corpora +extensional syntactic descriptions (annotations) are turned into intensional descriptions (rules) +optimal or suboptimal training data +new resources in the form of text corpora can be exploited +more or less independent of the knowledge of the grammar developer -depends heavily on the learning corpus -needs an annotated, well-balanced corpus

IMS Universität Stuttgart 21 memory based learning special case of learning most prominent is the data oriented parsing (DOP) fragments are stored and as such replace the grammar language generation and analysis is performed by combining the memorized fragments needs structurally annotated corpus the training corpus has great impact on the performance of the system highly sensitive to suboptimal data needs large storage capacity

IMS Universität Stuttgart 22 Annotation on token level +usually a form of pattern matching +completely flexible +does not depend on previous syntactic analysis +easily adaptable to different text types -full syntactic analysis has to be performed by extraction queries -queries can become rather complex -often restricted to simple contexts

IMS Universität Stuttgart 23 Full Parsing +provides rich and detailed information about structures, relations and functions +extraction queries simply have to collect the annotated information -slow parsing speed -lack of robustness -depend heavily on prerequisite lexical information -ambiguous output

IMS Universität Stuttgart 24 Chunking +relatively simple grammar rules +no need for extensive linguistic and lexicographic information +robust -usually non-hierarchical and non-recursive structures -annotated structures are simple and convey less information

IMS Universität Stuttgart 25 Classical chunk definition Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

IMS Universität Stuttgart 26 State-of-the-art systems CASS parser finite-state cascades flat, non-recursive structures small lexicon (tag-fixes) information about the head is given as an attribute Conexor symbolic constraint grammar parser full-fedged grammar for English (ENGCG) German: simple, non-recursive structure no lexical information available head lemma indicated by a special tag

IMS Universität Stuttgart 27 State-of-the-art systems KaRoParse top-down bottom-up parser includes recursion internal structure is flat and non-hierarchical no agreement or lexical information Schiehlen's chunker symbolic context free grammar recursion no head lemma or lexical-semantic information needs optimally tokenized text (including MWL recognition)

IMS Universität Stuttgart 28 State-of-the-art systems Chunkie uses TnT-tagger to assign tree fragments to sequences of PoS-tags recursion in pre-head position (maximal depth of three) head lemma information, yet no agreement or lexical information Cascaded Markov Models stochastic context free grammar rules several layers, each layer serving as input to the next hierachical phrases, including complex recursion head lemma information, yet no agreement or lexical information

IMS Universität Stuttgart 29 Problems for extraction Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

IMS Universität Stuttgart 30 An example 1.[ PC mit kleinen ], [ PC über die Köpfe ] with small above the heads [ NC der Apostel ] [ NC gesetzten Flammen ] the apostles set flames 2.[ PP mit [ NP [ AP kleinen ], [ AP über [ NP die Köpfe with small above the heads [ NP der Apostel ] ] gesetzten ] Flammen ] ] the apostles set flames `with small flames set above the heads of the apostles´

IMS Universität Stuttgart 31 Problems for extraction four NCs instead of only one NP AN-pair: +gesetzten + Flammen -kleine + Flammen NN-pair Köpfe + Apostel needs agreement information VN-pair setzen + Flammen needs information about the deverbal character of gesetzten a more complex analysis is needed PCs and NCs need to be combined

IMS Universität Stuttgart 32 Simple solution PP PC (PC|NC)* theoretical motivation? rule covers this particular example, other examples might need additional rules rule is vague and largely underspecified not very reliable internal structure is mainly left opague

IMS Universität Stuttgart 33 Complex solution 1.NP NC NC gen 2.PP preposition NP 3.AP PP adjective 4.NP AP* noun

IMS Universität Stuttgart 34 Complex solution solution for this particular example only large number of rules needed rules have to be repeated for every instance of a complex phrase in order to support extractions, the classic chunk concept has to be extended

IMS Universität Stuttgart 35 Conclusion Chunking Full Parsing flat non-recursive structures simple grammar robust and efficient non-ambiguous output full hierarchical representation complex grammar not very robust ambiguous output YAC

IMS Universität Stuttgart 36 Conclusion recursive chunking workable compromise between depth of analysis and robustness extracted data show correlation between collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences

IMS Universität Stuttgart 37 General Concept a recursive chunker for unrestricted German text technical framework CWB CQP output formats advantages of the architecture general framework of YAC linguistic coverage feature annotation chunking process

IMS Universität Stuttgart 38 A recursive chunker for unrestricted German text recursive chunker for unrestricted German text fully automatic analysis main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

IMS Universität Stuttgart 39 based on a symbolic regular expression grammar grammar rules written in CQP basis: tokenization PoS-tagging lemmatization agreement information General aspects Tree Tagger IMSLex

IMS Universität Stuttgart 40 A typical chunker robust – works on unrestricted text works fully automatically does not provide full but partial analysis of text no highly ambiguous attachment decisions are made

IMS Universität Stuttgart 41 YAC goes beyond extends the chunk definition of Abney 1.recursive embedding 2.post-head embedding provides additional information about annotated chunks 1.head lemma 2.agreement information 3.lexical-semantic and structural properties

IMS Universität Stuttgart 42 Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP- attachment, or sentential elements.

IMS Universität Stuttgart 43 Technical Framework corpus Perl-Scripts grammar rules lexicon rule application annotation of results post- processing

IMS Universität Stuttgart 44 Technical framework - CQP regular expression matching on token and annotation strings tests for membership in user specific word lists feature set operations constraints to specify dependencies

IMS Universität Stuttgart 45 Perl-Scripts invocation of CQP processing of the results annotation of the results into the corpus

IMS Universität Stuttgart 46 Postprocessing values can be checked values can be changed values can be compared range of structures can be changed

IMS Universität Stuttgart 47 Output formats CQP format, used for: interactive grammar development parsing extraction an XML format, used for: hierarchy building extraction data exchange

IMS Universität Stuttgart 48 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

IMS Universität Stuttgart 49 Linguistic coverage Adverbial phrases (AdvP) a)schön stark (beautifully strong) b)daher (from there); irgendwoher (from anywhere) c)heim (home); querfeldein (cross-country) d)innen (inside); überall (everywhere) e)"sehr bald" (very soon) f)jetzt (now); damals (at that time)

IMS Universität Stuttgart 50 Linguistic coverage Adjectival phrases (AP) a)möglich (possible) b)schreiend lila (screamingly purple) c)rund zwei Meter hohe around two meter high d)über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'

IMS Universität Stuttgart 51 Linguistic coverage Noun phrases (NP) a)Oktober (October); er (he) b)4,9 Milliarden Euro 4.9 billion Euros c)"Frankensteins Fluch" "Frankenstein's curse" d)kleine, über die Köpfe der Apostel gesetzten small, above the heads of the apostles set Flammen flames 'small flames set above the heads of the apostles'

IMS Universität Stuttgart 52 Linguistic coverage Prepositional phrases (PP) a)davon(thereof) b)zwischen Basel und St. Moritz between Basel and St. Moritz c)mit kleinen, über die Köpfe der Apostel gesetzten with small, above the heads of the apostles set Flammen flames 'with small flames set above the heads of the apostles

IMS Universität Stuttgart 53 Linguistic coverage Verbal complexes (VC) a)gemunkelt(rumored) b)muß gerechnet werden has counted to be 'has to be counted c)zu bekommen to get d)bekommen zu haben gottento have 'to have gotten'

IMS Universität Stuttgart 54 Linguistic coverage Clauses (CL) a)…, daß selbst Ravel sich amüsiert hätte. …, that even Ravel himself enjoyed had. '…, that even Ravel would have enjoyed.' b)…, die man in der griechischen Tragödie findet. …, which one in the Greek tragedy finds. '…, which one finds in the Greek tragedy.'

IMS Universität Stuttgart 55 Linguistic coverage Clauses (CL) a)…, Instrumente selbst zu bauen. …, instruments oneself to build. ' …, to build instruments oneself.' b)…, um einen Kaffee zu trinken. …, in order a coffee to drink. '…, in order to drink a coffee.'

IMS Universität Stuttgart 56 Feature annotation head lemma morpho-syntactic information lexical-semantic properties

IMS Universität Stuttgart 57 Feature annotation feature value AdvPAPNPPPVCCL lexical- semantic XXXXXX head lemma XXXXXX agreement info XXX verbal head lemma X

IMS Universität Stuttgart 58 Head lemma lemma attribute at the head position normally a single token multi-word proper nouns have a multi-token head lemma a separated verbal prefix is included in the head lemma of the VC kommt … an ankommen(arrive) head lemma of PP: preposition:noun

IMS Universität Stuttgart 59 Morpho-syntactic information intersection of the morpho-syntactic information of relevant elements invariant elements are not considered no guessing involved to solve ambiguities

IMS Universität Stuttgart 60 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

IMS Universität Stuttgart 61 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

IMS Universität Stuttgart 62 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:D ef| viert en Platz

IMS Universität Stuttgart 63 Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| vierten Platz

IMS Universität Stuttgart 64 Lexical-semantic properties important for parsing as well as for extraction properties can be triggers for specific internal structures, functions, and usages properties inherent in the corpus PoS-tags Johann Sebastian Bach NE NE NE text markers "Wilhelm Meisters Lehrjahre" NE NN NN

IMS Universität Stuttgart 65 Lexical-semantic properties properties determined by external knowledge sources (lexica, ontologies, word lists) locality: hier (here); dort (there); Stuttgart temporality: Jahr (year); damals (at that time) derivation: gesetzten (set) deverbal adjective

IMS Universität Stuttgart 66 Lexical-semantic properties structural information complex embeddings [ AP [ PP über die Köpfe der Apostel ] gesetzten ] above the heads of the apostles set ' set above the heads of the apostles' [ AP [ NP der "Inkatha"-Partei ] angehörenden ] to the Inkatha-party belonging 'belonging to the Inkatha-party'

IMS Universität Stuttgart 67 Some properties of NPs cardcardinal noun measmeasure noun nenamed entity quotNP in quotation marks streetstreet address temptemporal noun date pronpronominal NP

IMS Universität Stuttgart 68 Other lexical-semantic properties VC with separated prefix: pref Er kommt an (he arrives) PP with contracted preposition and article: fus am Bahnhof (at the station) complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles' AP with deverbal adjectives: vder

IMS Universität Stuttgart 69 Chunking process Corpus Third Level First Level Corpus Second Level Lexicon

IMS Universität Stuttgart 70 First level basic (non-recursive) chunks chunks with specific internal structure a)Ende September (end of Semptember) b)Jahre später (years later) c)21. Juli 2003 d)Johann Sebastian Bach lexical information is introduced within the rules itself within the Perl-scripts

IMS Universität Stuttgart 71 Advantages specific rules do not interact with main parsing rules additional (e.g. domain specific) rules can be included easily main parsing rules can be kept simple number of main parsing rules can be kept small

IMS Universität Stuttgart 72 Second level main parsing level relatively simple and general rules a)AP AdvP? (PP|NP)* AC b)NP Determiner? Cardinal? AP* NC c)PP Preposition (NP|AdvP) complex (recursive) structures are built in several iterations

IMS Universität Stuttgart 73 Rule blocks

IMS Universität Stuttgart 74 Second level -complexity of phrases is achieved by the embedding of complex structures rather than by complex rules a)[ NP eine [ AP verständliche ] Sprache ] an understandable language b)[ NP eine [ AP für den Anwender verständliche ] Sprache ] a for the user understandable language 'a language understandable for the user'

IMS Universität Stuttgart 75 Second level a)[ PP auf [ NP dem Giebel ] ] on top of the gable b)[ PP auf [ NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearing Gotteshauses ] ] Lord's house

IMS Universität Stuttgart 76 Third level chunks of related but different categories can be subsumed under one category NPs with determiner (NP) NPs without determiner (NCC)NP base noun chunks (NC) coordination of maximal chunks decisions are made which need full recursive chunks adverbially and predicatively used Adjectives can only be differentiated by the actual usage adverbially used AP AdvP

IMS Universität Stuttgart 77 Hierarchy building resulting structures of all parsing stages are collected and stored in XML-files after the parsing process collected structures are combined into a hierarchical structure only the largest instance of a structure (sharing the same head) is taken into account

IMS Universität Stuttgart 78 Hierarchy building a)[ NP Faszination ] fascination b)[ NP gewisse Faszination des Schattens ] certain fascination of the shadow c)[ NP eine gewisse Faszination des Schattens ] a certain fascination of the shadow d)[ NP des Schattens ] of the shadow e)[ NP eine gewisse Faszination [ NP des Schattens ] ] a certain fascination of the shadow

IMS Universität Stuttgart 79 Evaluation on automatic PoS- tags all chunksmaximal chunks precisionrecallprecisionrecall NP PP AP VC

IMS Universität Stuttgart 80 Evaluation on ideal PoS-tags all chunksmaximal chunks precisionrecallprecisionrecall NP PP AP VC

IMS Universität Stuttgart 81 Extraction Advantage of the system Goal Sample Extraction

IMS Universität Stuttgart 82 Advantages of the system efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules

IMS Universität Stuttgart 83 Goal provide a fine-grained syntactic classification of the extracted data at the level of subcategorization scrambling adjectives subcategorizing clauses combinatory preferences with verbs syntactic behavior

IMS Universität Stuttgart 84 Target data predicative(-like) constructions Es war klar, daß... It was clear, that with adverbial pronoun Er ist davon überzeugt, daß... He is of it convinced, that with reflexive pronoun Es zeigt sich deutlich, daß... It shows itself clear, that...

IMS Universität Stuttgart 85 Target data... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit.... with clause in topicalized position Daß..., ist klar. That..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

IMS Universität Stuttgart 86 Sample query adjective + verb + finite clause VC AP CL

IMS Universität Stuttgart 87 Sample query adjective + verb + finite clause VC AP pred CL fin

IMS Universität Stuttgart 88 Sample query adjective + verb + finite clause VC Adjuncts* AP pred CL fin

IMS Universität Stuttgart 89 Sample query adjective + verb + finite clause VC (AdvP|PP|NP temp |CL rel )* AP pred CL fin

IMS Universität Stuttgart 90 adjective + verb + finite clause seinbleibenmachenwerden fraglich unklar klar offen22840 möglich wichtig1802 deutlich59734 total

IMS Universität Stuttgart 91 adjective + verb + finite clause seinbleibenmachenwerden fraglich unklar klar offen22840 möglich wichtig1802 deutlich59734 total

IMS Universität Stuttgart 92 Topicalized finite clause adjective + verb + finite clause CL fin VC (AdvP|PP|NP temp |CL rel )* AP pred

IMS Universität Stuttgart 93 adjective + verb + finite clause fincl_exfincl_toptotal fraglich unklar klar offen möglich wichtig deutlich

IMS Universität Stuttgart 94 adjective + verb + finite clause fincl_exfincl_toptotal fraglich unklar klar offen möglich wichtig deutlich

IMS Universität Stuttgart 95 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer möglich schwierig leicht nötig erforderlich total

IMS Universität Stuttgart 96 adjective + verb + infinite clause seinfallenhabenwerdenmachen bereit43146 schwer möglich schwierig leicht nötig erforderlich total

IMS Universität Stuttgart 97 low freq adj + verb + infin clause stehenbringenhabensein frei354 satt1910 fertig241

IMS Universität Stuttgart 98 low freq adj + verb + clause stehenbringenhabensein frei376 satt2711 fertig261

IMS Universität Stuttgart 99 adjective subcategorization APs with PP complements embedded in NPs Die [ AP dafür erforderlichen] Mark The for this needed Marks The Marks needed for this Der [ AP auf Sport spezialisierte] Journalist The on sports specialised journalist The journalist specialising in sports

IMS Universität Stuttgart 100 multiword units and abbreviations chunks/phrases in brackets or quotes multiword units Teenage Mutant Hero Turtle (FC Italia Frankfurt) abbreviations Deutscher Aktienindex (Dax) Stickstoffdioxyd (NO2)

IMS Universität Stuttgart 101 Conclusion recursive chunking workable compromise between depth of analysis and robustness extracted data show correlation between collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences