# How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton.

## Presentation on theme: "How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton."— Presentation transcript:

How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton

What do you think? (zero-freq senses don’t count)

The WSD task select correct sense in context sense inventory given in a dictionary old problem corpus methods are best

Lower bound Gale Church Yarowsky 1992 Baseline system: always choose commonest Around 70% Only small sample available SEMCOR Bigger sample, still too small SENSEVAL Big problem

Overview Mathematical model Evaluation (against SEMCOR) Implications for WSD evaluation

Model: assumptions Meanings unrelated Word sense frequency distribution same as word frequency distribution

Model All k word senses in a bag Randomly select 2 for a 2-sense word k(k-1)/2 possible 2-sense words

Set the frequency For a 2-sense word with freq 101, possibilities include 100:1 split How many times? 50:51 split How many times?

Words to model word senses Brown, or BNC How many types for each frequency Smooth to give monotonic-decreasing

Brown rawBrown smooth BNC rawBNC smooth 116278 486507 26097 123633 … 504343.13742700.45 514741.86688679.45 … 1001011.03262244.37 … Freq # of words having that freq

Using Brown frequencies 100:1 split How many times? 16278*11.03 = 179,546 50:51 split How many times? 43.13*41.86 = 1805 Ratio 179,546:1805 = 99 100:1 split is 99 times likelier than 51:50

Generalising For a 2-sense word with fr=n select ‘commonest’ fr = m n/2 < m < n select another from subset where fr =n-m find all possible selections Calculate average ratio, commonest:other answer title question

Model: answers (BNC) n2-sense ‘words’ 3-sense ‘words’ 4-sense ‘words’ 1083.258.940.0 2588.974.258.2 5092.381.869.1 10094.687.077.1 20096.290.783.1 50097.694.289.1

SEMCOR 250,000 word corpus Manually sense-tagged WordNet senses

Evaluate model against SEMCOR n2-sense words # % BNC 3-sense words # % BNC 1055 73.6 83.241 64.3 58.9 25-class96 79.8 88.970 68.1 74.2 50-class45 83.1 92.359 72.4 81.8 100-class16 79.4 94.624 77.8 87.0

Discussion Same trend Assumption untrue: SFIP principle: a reading must be sufficiently frequent, insufficiently predictable to get into a dictionary generous vs pike generous: donation/person/helping pike: fish or weapon or hill or turnpike

Discussion More data, more meanings (without end) not changing ratios for known senses but addition of new senses Models pike not generous Dominated by singletons

SENSEVAL Evaluation exercise for WSD 1998; 2001; 2004 Two task-types: Lexical sample Choose a small samples of words and disambiguate multiple instances of each All-words Choose a text or two, disambiguate all words

Lower bound and SENSEVAL All-words Samples too small to see extent of skew freq of 2-sense word =3: lower bound=67% Lexical sample Skew in manual sample selection “good” candidate words show “balance” (amazing) Are systems better than baseline? SENSEVAL-3: systems scarcely beat baseline Not proven (and not likely)

What is the commonest sense Varies with domain More mileage than disambiguation cf default strategy in commercial MT McCarthy Koeling Weeds Carroll ACL-04 3-sentence window does not allow domain-identification methods Domain-id task more interesting and worthwhile than WSD

Thank you

Download ppt "How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton."

Similar presentations