# Conclusion There is no single best strategy to extract an optimal set of candidate data from a corpus. You need to know a least some structural and distributional.

## Presentation on theme: "Conclusion There is no single best strategy to extract an optimal set of candidate data from a corpus. You need to know a least some structural and distributional."— Presentation transcript:

Conclusion There is no single best strategy to extract an optimal set of candidate data from a corpus. You need to know a least some structural and distributional properties of the phenomena you are searching for. Preparation of candidate data influences distributions. Distributional properties determine the outcome of AMs. Know the distributional assumptions underlying the AMs you use.

Adj-N Hello, I am called Richard Herring and this is my web-site.
As for Java being slow as molasses, that's often a red herring. In the first place, the ‘less genes, more behavioural flexibility’ argument is a total red herring. Smadja gives the example that the probability that any two adjacent words in a text will be red herring is greater than the probability of red times the probability of herring. We can, therefore, use statistical methods to find collocations [Smadja, 1993]. Okay, I think I misled you by introducing a red herring, for which I am very sorry. Home page for the British Red Cross, includes information on activities, donations, branch addresses,... Welcome to Red Pepper, independent magazine of the green and radical left. Hotel Accommodations by Red Roof Inn where you will find affordable Hotel Rates. Web site of the Royal Air Force aerobatic team, the Red Arrows.

Extraction of Coocurrence Data

Basic Assumptions The collocates of a collocation cooccur more frequently within text than arbitrary word combinations. (Recurrence) Stricter control of cooccurrence data leads to more meaningful results in collocation extraction.

Word (Co)occurrence Distribution of words and word combinations in text approximately described by Zipf’s law. Distribution of combinations is more “extreme” than that of individual words.

Word (Co)occurrence Zipf 's law: nm is the number of different words occurring m times i.e., there is a large number of low-frequency words, and few high-frequency ones

An Example corpus size: 8 million words from the Frankfurter Rundschau corpus 569,310 PNV-combinations (types) have been selected from the extraction corpus including main verbs, modals and auxiliaries. Considering only combinations with main verbs, the number of PNV-types reduces to 372,212 (full forms). Corresponding to 454,088 instances

Distribution of PNV types according to frequency (372,212 types)

Distribution of PNV types according to frequency (10,430 types with f>2)

Word (Co)occurrence Collocations will be preferably found among highly recurrent word combinations extracted from text. Large amounts of text need to be processed to obtain sufficient number of high-frequency combinations.

Control of Candidate Data
Extract collocations from relational bigrams Syntactic homogeneity of candidate data (Grammatical) cleanness of candidates e.g. N+V pairs: Subject+V vs. Object+V Text type, domain, and size of source corpus influence the outcome of collocation extraction

Terminology Extraction corpus tokenized, pos-tagged or syntactically analysed text Base data list of bigrams found in corpus Cooccurrence data bigrams with contingency tables Collocation candidates ranked bigrams

Frequency counts (from corpora)
Types and Tokens Frequency counts (from corpora) identify labelled units (tokens), e.g. words, NPs, Adj-N pairs set of different labels (types) type frequency = number of tokens labelled with this type example: ... what the black box does ...

Frequency counts (from corpora)
Types and Tokens Frequency counts (from corpora) identify labelled units (tokens), e.g. words, NPs, Adj-N pairs set of different labels (types) type frequency = number of tokens labelled with this type example: ... what the black box does ... box

Counting cooccurrences
Types and Tokens Counting cooccurrences bigram tokens = pairs of word tokens bigram types = pairs of word types contingency table = four-way classification of bigram tokens according to their components

Contingency Tables contingency table for pair type (u,v)

Collocation Extraction: Processing Steps
Corpus preprocessing tokenization (orthographic words) pos-tagging morphological analysis / lemmatization partial parsing (full parsing)

Collocation Extraction: Processing Steps
Extraction of base data from corpus adjacent word pairs Adj-N pairs from NP chunks Object-V & Subject-V from parse trees Calculation of cooccurrence data compute contingency table for each pair type (u,v)

Collocation Extraction: Processing Steps
Ranking of cooccurrence data by "association scores" measure statistical association between types u and v true collocations should obtain high scores using association measures (AM) N-best list = listing of N highest-ranked collocation candidates

Base Data: How to get? Adj-N adjacency data numerical span NP chunking (lemmatized)

Base Data: How to get? V-N adjacency data sentence window (partial) parsing identification of grammatical relations (lemmatized)

Base Data: How to get? PP-V adjacency data PP chunking separable verb particles (in German) (full syntactic analysis) (lemmatization?)

Adj-N In the first place, the ‘less genes, more behavioural flexibility’ argument is a total red herring. In/PRP the/ART first/ORD place/N ,/\$, the/ART ‘/\$’ less/ADJ genes/N ,/\$, more/ADJ behavioural/ADJ flexibility/N ’/\$’ argument/N is/V a/ART total/ADJ red/ADJ herring/N ./\$.

Adj-N: poswi = N span size 1 (adjacency) wj, j= -1 first/ORD place/N

Adj-N: poswi = N span size 2 wj, j = -2, -1 more/ADJ flexibility/N

span size 2 wj, j = -2, -1 first/ORD place/N the/ART place/N less/ADJ genes/N ‘/\$’ genes/N behavioural/ADJ flexibility/N more/ADJ flexibility/N ’/\$’ argument/N flexibility/N argument/N red/ADJ herring/N total/ADJ herring/N

Adj-N (S (PP In/PRP (NP the/ART first/ORD place/N ) ,/\$,

Adj-N (S (PP-mod In/PRP (NP the/ART first/ORD place/N ) ,/\$,

Adj-N: NP chunks NP chunks (NP the/ART first/ORD place/N )

N-V: Object-VERB spill the beans bury the hatchet
Good for you for guessing the puzzle but from the beans Mike spilled to me, I think those kind of twists are more maddening than fun. bury the hatchet Paul McCartney has buried the hatchet with Yoko Ono after a dispute over the songwriting credits of some of the best-known Beatles songs.

N-V: Object-Mod-VERB keep <one‘s> nose to the grindstone
I'm very impressed with you for having kept your nose to the grindstone, I'd like to offer you a managerial position. We’ve learned from experience and kept our nose to the grindstone to make sure our future remains a bright one. She keeps her nose to the grindstone.

keep <one‘s> nose to the grindstone
N-V: Object-Mod-VERB keep <one‘s> nose to the grindstone (VP {kept, keeps, ...} {(NP-obj your nose), (NP-obj our nose), (NP-obj her nose), ... } (PP-mod to the grindstone) )

PN-V: P-Object-VERB zur Verfügung stellen (make available) Peter stellt sein Auto Maria zur Verfügung (Peter makes his car available to Maria) in Frage stellen (question) Peter stellt Marias Loyalität in Frage (Peter questions Maria’s loyalty) in Verbindung setzen (to contact) Peter setzt sich mit Maria in Verbindung (Peter contacts Maria)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) pair type: (u,v) = (black, box)

Contingency Tables for Relational Cooccurrences
(big, dog) (black, box) (black, dog) (small, cat) (small, box) (old, box) (tabby, cat) f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8

Contingency Tables for Relational Cooccurrences
f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8

Contingency Tables for Relational Cooccurrences
f(u,v) = 2 f1(u) = 3 f2(v) = 4 N = 8

Contingency Tables for Relational Cooccurrences

Contingency Tables in Perl
%F = (); %F1 = (); %F2 = (); \$N = 0; while ((\$u, \$v)=get_pair()) { \$F{"\$u,\$v"}++; \$F1{\$u}++; \$F2{\$v}++; \$N++; }

Contingency Tables in Perl
foreach \$pair (keys %F) { (\$u,\$v) = split /,/, \$pair; \$f = \$F{\$pair}; \$f1 = \$F1{\$u}; \$f2 = \$F2{\$v}; \$O11 = \$f; \$O12 = \$f1 - \$f; \$O21 = \$f2 - \$f; \$O22 = \$N - \$f1 - \$f2 - \$f; # ... }

Reminder: Contingency Table with Row and Column Sums

Why are Positional Cooccurrences Different?
adjectives and nous cooccurring within sentences "I saw a black dog"  (black, dog) f(black, dog)=1, f1(black)=1, f2(dog)=1 "The old man with the silly brown hat saw a black dog"  (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat) f(black, dog)=1, f1(black)=3, f2(dog)=4

Why are Positional Cooccurrences Different?
"wrong" combinations could be considered as extraction noise ( association measures distinguish noise from recurrent combinations) but: very large amount of noise statistical models assume that noise is completely random but: marginal frequencies often increase in large steps

Contingency Tables for Segment-Based Cooccurrences
within pre-determined segments (e.g. sentences) components of cooccurring pairs may be syntactically restricted (e.g. adj-noun, nounSg-verb3.Sg) for given pair type (u,v), set of all sentences is classified into four categories

Contingency Tables for Segment-Based Cooccurrences
u  S = at least one occurrence of u in sentence S u  S = no occurrences of u in sentence S v  S = at least one occurrence of v in sentence S v  S = no occurrences of v in sentence S

Contingency Tables for Segment-Based Cooccurrences
fS(u,v) = number of sentences containing both u and v fS(u) = number of sentences containing u fS(v) = number of sentences containing v NS = total number of sentences

Frequency Counts for Segment-Based Cooccurrences
adjectives and nous cooccurring within sentences "I saw a black dog"  (black, dog) fS(black, dog)=1, fS(black)=1, fS(dog)=1 "The old man with the silly brown hat saw a black dog"  (old, dog), (silly, dog), (brown, dog), (black, dog), ... , (black, man), (black, hat) fS(black, dog)=1, fS(black)=1, fS(dog)=1

Segment-Based Cooccurrences in Perl
foreach \$S { %words = map {\$_ => 1} words(\$S); %pairs = map {\$_ => 1} pairs(\$S); foreach \$w (keys %words) { \$FS_w{\$w}++; } foreach \$p (keys %pairs) { \$FS_p{\$p}++; \$NS++;

Contingency Tables for Distance-Based Cooccurrences
problems are similar to segment-based cooccurrence data but: no pre-defined segments accurate counting is difficult here: sketched for special case all orthographic words numerical span: nL left, nR right no stop word lists

Contingency Tables for Distance-Based Cooccurrences
nL = 3, nR = 2

Contingency Tables for Distance-Based Cooccurrences
nL = 3, nR = 2 occurrences of v

Contingency Tables for Distance-Based Cooccurrences
nL = 3, nR = 2 occurrences of v window W(v) around them

Contingency Tables for Distance-Based Cooccurrences
nL = 3, nR = 2 occurrences of v window W(v) around them occurrences of u

Contingency Tables for Distance-Based Cooccurrences
nL = 3, nR = 2 occurrences of v window W(v) around them occurrences of u cross-classify occurrences of u against window W(v)

Contingency Tables for Distance-Based Cooccurrences

Contingency Tables for Distance-Based Cooccurrences

N-V: Subject-Verb Beispiele

Cooccurrence Data Token-Type Distinction

Extraction Strategies
required: PoS-tagging basic phrase chunking infinitives with zu (to) are treated like single words, separated verb prefixes are reattached to the verb

Extraction Strategies
Full forms or base forms ? depends on language and collocation type required: morphological analysis

3 Extraction Strategies
Strategy 1: Retrieval of n-grams from word forms only (wi) Strategy 2: Retrieval of n-grams from part-of-speech annotated word forms (wti) Strategy 3: Retrieval of n-grams from word forms with particular parts-of-speech, at particular positions in syntactic structure (wticj )

Spans tested wi wi+1 wi wi+1 wi+2 wi wi+2 wi+3 wi wi+3 wi+4

Results of Strategy 1 Retrieval of PP-verb collocations from word forms only is clearly inappropriate as function words like articles, prepositions, conjunctions, pronouns, etc. outnumber content words such as nouns, adjectives and verbs. Blunt use of stop word lists leads to the loss of collocation-relevant information, as accessibility of prepositions and determiners may be crucial for the distinction of collocational and noncollocational word combinations.

most useful/informative span: wi wi+1 wi+2 examples
Results of Strategy 1 most useful/informative span: wi wi+1 wi+2 examples bis & & Uhr FRANKFURT & A. & M. 949 in & diesem & Jahr um & & Uhr Di. & bis & Fr 10 & bis & Tips & und & Termine 597 in & der & Nacht 582

useful/informative span size is language specific
we have learned useful/informative span size is language specific we find a number of different constructions e.g. NP, PP, ... names, time phrases, conventionalized constructions, ...

Results of Strategy 2 wti wti+1 with preposition ti and noun ti+1
PPs with arbitrary preposition-noun co-occurrences such as am Samstag (on Saturday), am Wochenende (at the weekend), für Kinder (for children) Fixed/conventionalized? PPs such as zum Beispiel (for example)

Results of Strategy 2 wti wti+1 with preposition ti and noun ti+1
PPs with a strong tendency for particular continuation such as nach Angaben + NPgen (`according to'), im Jahr + Card (in the year). Potential PP-collocates of verb-object collocations such as zur Verfügung (at the disposal)

Results of Strategy 2 wti wti+2 with preposition ti and noun ti+1
typically cover PPs with pre-nominal modification Cardinal, for instance, is the most probable modifier category co-occurring with bis ... Uhr (until o’clock) Adjective is the predominant modifier category related to im ... Jahr (1272 of 1276 cases total), vergangenen (Adj, last, 466 instances)

Results of Strategy 2 wti wti+3 with preposition ti and noun ti+1
typically exceeds phrase boundaries im Jahres (indat yeargen), for instance, originates from PP NPgen e.g. im September dieses Jahres (in the September of this year)

Results of Strategy 2 wti wti+1 wti+2 with preposition ti and noun ti+1 and verb ti+2
Frequent preposition-noun-participle or -infinitive sequences are good indicators for PP-verb collocations, especially for collocations that function as predicates such as support-verb constructions and a number of figurative expressions. zur Verfügung gestellt (made available) in Frage gestellt (questioned) in Verbindung setzen (to contact)

Results of Strategy 2 wti wti+2 wti+3 wti wti+3 wti+4 with preposition ti and noun ti+2 and verb ti+3 with preposition ti and noun ti+3 and verb ti+4 a variety of PPs with prenominal modification are covered but also phrase boundaries are more likely to be exceeded durch Frauen helfen  durch X (Y) Frauen helfen

Results of Strategy 3 wtick wtjck wtlcm
PP-Collocate V-Collocate Right Neighbour Co-occurring Main Verb zur Verfügung stehen 189 404 stellen 240 457 in Kraft treten 99 126 setzen 12 23 bleiben 5

Conclusion There is no single best strategy to extract an optimal set of candidate data from a corpus. You need to know a least some structural and distributional properties of the phenomena you are searching for. Preparation of candidate data influences distributions. Distributional properties determine the outcome of AMs. Know the distributional assumptions underlying the AMs you use.

Download ppt "Conclusion There is no single best strategy to extract an optimal set of candidate data from a corpus. You need to know a least some structural and distributional."

Similar presentations