09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.

09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations

09.06.2016COGS 523 - Bilge Say2 Related Readings Manning and Schutze (1999). Foundations of Statistical Natural Language Processing.Chapter 5 on Collocations Optional: Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58. Mouton de Gruyter, Berlin. [extended manuscript: http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdfhttp://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf] and his web site http://www.collocations.de/http://www.collocations.de/

09.06.2016COGS 523 - Bilge Say3 A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations are characterized by limited compositionality. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. ex. strong tea Collocations

09.06.2016COGS 523 - Bilge Say4 Idioms are the most extreme examples of non-compositionality; ex. kick the bucket Most collocations exhibit milder forms of compositionality; ex. international best practice

09.06.2016COGS 523 - Bilge Say5 Collocations are important for a number of applications: natural language generation, computational lexicography, parsing, corpus linguistic research Also sociolinguistics ex. strong tea; not powerful tea

09.06.2016COGS 523 - Bilge Say6 Manning and Schutze Example Corpus of the following analyses: New York Times (August – November 1990) 115 MB of text 14 million words

09.06.2016COGS 523 - Bilge Say7 Approaches to finding collocations: Frequency Mean and variance Hypothesis testing Likelihood ratios Mutual Information (pointwise)

09.06.2016COGS 523 - Bilge Say8 Frequency If two words occur together a lot, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. heuristic: pass the candidate phrases through a part-of speech filter

C(w1, w2)w1w2 80871ofthe 58841inthe 26430tothe 21842onthe 21839forthe 13899ina 13689ofa 8753hasbeen Tag PatternExample A Nlinear function N regression coefficients A A NGaussian random variable A N Ncumulative distribution function N A Nmean squared error N N Nclass probability function N P Ndegrees of freedom C(w1, w2)w1w2 11487NewYork 7261UnitedStates 5412LosAngles 3301lastyear 3191SaudiArabia 2699lastweek 2514vicePresident 2378PersianGulf (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say10 wC(strong, w) wC(powerful, w) support50 force13 safety22 computers10 sales21 position8 opposition19 men8 showing18 computers8 sense18 man7 message15 symbol6 defense14 military6 (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say11 Mean and Variance Frequency based approach works for fixed phrases well. But many collocations consist of two words that stand in a more flexible relationship to one another she knocked on his door; they knocked at the door; 100 women knocked on Donaldson’s door; a man knocked on the metal from door

09.06.2016COGS 523 - Bilge Say12 The mean is simple the average offset. For the example, the mean offset between knocked and door is 4.0 Variance measures how much the individual offsets deviate from the mean. Sample standard deviation is the square root of the mean. For the example, the standard deviation between knocked and door is 1.15

09.06.2016COGS 523 - Bilge Say13 We can use this information to discover collocations by looking for pairs with low deviation. A low deviation means that the two words usually occur at about the same distance. Zero deviation means that the two words always occur at exactly the same distance.

09.06.2016COGS 523 - Bilge Say14 (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say15 sample deviation sample meanCountword1word2 0.430.9711657NewYork 4.481.8324previousgames 0.152.9846minuspoints 0.493.87131hundredsdollars 4.030.4436editorialAtlanta 4.030.0078ringNew 3.960.19119pointhundredth 3.960.29106subscribersby 1.071.4580strongsupport 1.132.577powerfulorganizations 1.012.00112RichardNixon 1.050.0010Garrisonsaid (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say16 Hypothesis testing High frequency and variance can be accidental If two constituent words of a frequent bigram like new companies are regularly occurring words (as new and companies are), then we expect the two words to co-occur a lot just by chance.

09.06.2016COGS 523 - Bilge Say17 What we really want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics.

09.06.2016COGS 523 - Bilge Say18 How can we apply the methodology of hypothesis testing to the problem of finding collocations? We first formulate a null hypothesis which states that what should be true if two words do not form a collocation. P(w1, w2)= P(w1)*P(w2)

09.06.2016COGS 523 - Bilge Say19 The t test Now we need a statistical test that tells us how probable or improbable it is that a certain constellation will occur. A test that has been widely used for collocation discovery is the t test. t= (x-η)/(√s 2 /N) x the sample mean; s 2 sample variance; N is the sample size ; η is the mean of distribution

new companies P(new)= 15828/14307668 P(companies)= 4675/14307668 P(new, companies)= P(new)* P(companies) C(new, companies)=8 x(new, companies)=8/14307668 t= (x-η)/(√s 2 /N)= x(new, companies)- P(new, companies) √ x(new, companies)/14307668 ≈ 0.99999

09.06.2016COGS 523 - Bilge Say21 0.99999 is not larger than 2.576 the critical value for ά= 0.005. We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation.

09.06.2016COGS 523 - Bilge Say22 tC(w1)C(w2)C(w1, w2)w1w2 4.47214220 AyatollahRuhollah 4.4721412720BetteMidler 4.47203011720AgathaChristie 4.4720775920videocassetterecorder 4.47202432020unsaltedbutter 2.371414907901720fistmade 2.2446134841057020overmany 1.3685147341347820intothem 1.2176140931477620likepeople 0.8036150191562920timelast (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say23 It turns out that most bigrams attested in a corpus occur significantly more often than chance. Language is very regular so that very few completely unpredictable events happen. The t test and other statistical tests are most useful as a method for ranking collocatins.

09.06.2016COGS 523 - Bilge Say24 Hypothesis testing of difference The t test can also be used for a slightly different collocation discovery problem: to find words whose co-occurrence patterns best distinguish between two words. ex. to find words that best differentiate the meanings of strong and powerful.

tC(w)C(strong w)C(powerful w)Word 3.1622933010computers 2.8284233708computer 2.449428906symbol 2.449458806machines 2.2360226605Germany 7.07103685500support 6.32573616587enough 4.6904986220safety 4.58253741210sales 4.02491093191opposition (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say26 Pearson’s chi-square test t test assumes that probabilities are approximately normally distributed, which is not true in general. X 2 the essence of the test is to compare the observed frequencies in a table with the frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.

X 2 = Σi,j (O i,j -E i,j ) 2 /E i,j Expected = (8+4667/N)+(8+15820/N) X 2 ≈ 1.55; 1.55 is not larger than 3.841 the critical value for ά= 0.05. We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation. w=neww~=new w=companies 8 (new companies) 4667 (e.g. old companies) w~=companies 8 (new machines) 14287181 (e.g. old machines) (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say28 Likelihood ratios More appropriate for sparse data than the X 2 test. And likelihood ratio is more interpretable than the X 2 test. Two alternative explanations for the occurrence frequency of a bigram w1w2 Hypothesis 1: P(w2|w1)= p= P(w2|-w1) Hypothesis 2: P(w2|w1)= p1=/= p2= P(w2|-w1) Hypothesis 1 is a formalization of independence Hypothesis 2 is a formalization of dependence which is good evidence for an interesting collocation

09.06.2016COGS 523 - Bilge Say29 -2logλC(w1)C(w2) C(w1, w2)w1w2 1291.4212593932150mostpowerful 99.3137993210politicallypowerful 82.9693293410powerfulcomputers 80.39932342413powerfulforce 57.279322916powerfulsymbol 51.66932104powerfullobbies 51.521719325economicallypowerful 51.05932434powerfulmagnet 34.1593232powerfulcudgels (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say30 One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e 0.5x82.96 ≈ 1.3X10 18 time more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest.

09.06.2016COGS 523 - Bilge Say31 λ is a likelihood ratio of a particular form, then the quantity -2logλ is asymptotically X 2 distributed. We can use tables of X 2 to test H 1 against H 2. E.g. value 34.15 for powerful cudgels reject H 1 for this bigram on a confidence level of 0.005

Relative Frequency Ratios Ratios of frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. e.g. Karim Obeid occurs 68 times in the 1989 corpus so relative frequency ratio r is r=(2/14307668)/ (68/11731561) Relative frequency ratios are useful to find subject-specific collocations. The application proposed is to compare a general text with a subject-specific text.

09.06.2016COGS 523 - Bilge Say33 Ratio19901989w1w2 0.0241268KarimObeid 0.0372244EastBerliners 0.0372244MissManners 0.039924117earthquake 0.0409210HUDofficials 0.0482234EASTGERMANS 0.0496233Muslimcleric 0.0496233JohnLe 0.0512232PragueSpring 0.0529231Amongindividual (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say34 Mutual Information SymbolDefinitionCurrent useFano I(x,y)log(p(x,y)/p(x)p(y) pointwise mutual information mutual information I(X;Y)E log(p(X,Y)/p(X)p(Y) mutual information average MI / expectation of MI

I 1000 w1w2w1w2BigramI 23000 w1w2w1w2Bigram 16.95511 Schwartz eschews14.4610661 Schwartz eschews 15.021191fewest visits13.0676221fewest visits 13.78591 FIND GARDEN11.25222671 FIND GARDEN 12.005311 Indonesian pieces8.97436631 Indonesian pieces 9.8226271Peds survived8.0417019176Peds survived 9.2113821 marijuana growing5.7315828513 marijuana growing 7.37241591doubt whether5.2668038467doubt whether 6.6868791new converts4.767397131new converts 6.00661151like offensive1.95354962766like offensive 3.811592831must think0.41140937621must think (Manning and Schutze, 1999)

09.06.2016COGS 523 - Bilge Say36 Next Week Biber et al. Register and Discourse Variations Chapter.

09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.

Similar presentations

Presentation on theme: "09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.

Similar presentations

Presentation on theme: "09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations."— Presentation transcript:

Similar presentations

About project

Feedback