Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, obo@cuni.cz, Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University, Prague Conclusion Designed and implemented a new scripting language AX for performing selection of sentences based on linguistically motivated criteria. Prepared an AX script (15 filters and 21 rules) to demonstrate its utility. The script selects sentences suitable for extraction of Czech verb frames. The extraction of frames itself has not been performed yet, but the utility of the selection can be illustrated by improvement of the Czech Collins parser’s accuracy (measured by the number of verb occurrences with correctly assigned daughters). When combining the linguistically motivated selection with a filter selecting only short sentences, the accuracy of 73.2 % can be achieved. Approximately 10 % of input sentences pass both filters. Observed verbs with all the daughters recognized correctly In all the sentences In the selected sentences only Sentences of any length55 %65 % Sentences with up to 10 words68 %73 % An AX script is an arbitrary sequence of filters and rulesets. Input for the script is a sentence represented as a sequence of feature structures corresponding one to one to the input word forms. Word forms with ambiguous morphological information are still represented as single feature structures. For example, the Czech word form má can serve either as a personal pronoun or as a finite verb. A single feature structure can hold both of the variants: [cat-pron, lemma-“můj“, case-nom, gend-fem, num-sg |cat-verb, lemma-”mít“, tense-pres, person-third, gend-masc, num-sg] Filters are used to strike out sentences not suitable for further analysis or for extracting the lexico-syntactic information: Early filters decide according to simple criteria, such as too many punctuation marks. Later filters make use of results of partial syntactic analysis and reject sentences after a more sophisticated decision, such as discovering noun phrases ordered in a manner where syntactic ambiguity is very common and would spoil the observed verb frame. Filters are expressed as regular expression of feature structures. Rulesets are used to perform partial syntactic analysis of the sentence. Rulesets may produce more possible “readings” of the sentence. Some of the readings may be rejected by following filters. (Formally, a reading is a sequence of feature structures.) A filter rejects sentences with strange symbols. Sentence 1 Sentence 2 A ruleset combines aux.+main verb. This might be ambiguous, several readings can be generated. Reading 1 rejected. Reading 2 accepted. Reading 1 rejected. Sentence 3 Reading 2 rejected. Reading 3 rejected. AX Overall Scheme Sentence 1 was rejected by the first filter. Sentence 2 was accepted, one reading passed the last filter and the final sequence of feature structures will be printed out Sentence 3 was rejected because none of the readings passed the second filter. A Dependency Treebank Lexicon of Syntactic Behavior A Big Corpus (with morphological information only) ? ? ? ? ?? ? ? Subcorpus of Nice Examples Automatic Extraction of Lexico-Syntactic Data AX Treebanks contain the required syntactic information but cover too few lexemes in too few situa- tions. Syntactic Analysis Needs Lexicons For instance, the Prague Dependency Treebank (1.5 million tokens in 98,263 sentences) covers only 5,400 Czech verbs out of an estimated total of 40,000. Only 500 verbs occur in the PDT more than 50 times. Therefore the PDT is not sufficient as a source of valencies of Czech verbs. Corpora without syntactic annotation contain enough examples but many of the sentences available are too complex to extract the syntactic information auto- matically. Proposed solution: First “pick nice examples”, then extract the lexico- syntactic information tra- ditionally. AX (automatic extraction) is a new scripting language designed to make the following tasks easy: Dealing with (ambiguous) morphological information. Partial parsing and grammatically consistent simplification of sentences (if needed to check for more complex phenomena). Selection of sentences to keep, based on linguistic criteria (both morphological and syntactic ones). Print-out of the simplified version of selected sentences. If the script was prepared carefully, this can already be the desired lexico-syntactic information. Arbitrary texts augmented with morphological information (not necessarily disambiguated). AX rule “combine main and aux. verb parts”: combined \gap [cat-trace] --> aux {gap: ![cat-verb]*} main | main {gap: ![cat-verb]*} aux :: # unification requirements follow aux = [cat-verb, lemma-”být”], main = [cat-verb_participle], aux.person = main.person, aux.number = main.number, combined = [cat-complex_verb], combined.lemma = main.lemma, combined.person = main.person, combined.number = main.number end Fill (restrict by unifying) the output variable combined with features relevant for further analysis. Ensure a grammatical agreement between the words assigned to variables aux and main. Restrict, which words can be assigned to input variables aux and main. Find a subsequence in the input reading that matches the given regular expression. Assign words to variables aux and main and mark the region between them with the label gap. Replace the matching region with the content of variable combined, the region labelled gap and an extra feature structure trace marking the former location of the second part of the complex verb (if useful in further analysis). A filter rejects sentences with two main verbs. Sample AX Script and AX Rule A Sample AX Rule: Full text, acknowledgement and the list of references in the proceedings of ESSLLI Student Session, Vienna, 2003.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Similar presentations

Presentation on theme: "Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Similar presentations

Presentation on theme: "Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL."— Presentation transcript:

Similar presentations

About project

Feedback