Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.

Similar presentations


Presentation on theme: "1 Chinese WordSketch Engine Online, corpus-based summaries of word usage."— Presentation transcript:

1 1 Chinese WordSketch Engine Online, corpus-based summaries of word usage

2 2 Designers of the original WordSketch Engine for English Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University

3 3 Facing the problem: lexical choice “You shall know a word by the company it keeps” (Firth, 1957) The meaning of face depends on the collocation ( 詞語搭配 ) – 學漢語的外國人要面對詞語選擇的問題 – 許多種動物正在面臨絕種的問題 Similarly with save –Save money –Save life

4 4 Look in a dictionary? A corpus? Some modern English dictionaries give some collocation ( 詞語搭配 ) information –Chinese dictionaries give very limited help Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available

5 5 Pre-computer corpus! Oxford English Dictionary: 20 million index cards

6 6 KWIC Concordance

7 7 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method

8 8 Limitation of KWIC analysis A s corpora get bigger: too much data –50 lines for a word: read all –500 lines: could read all, takes a long time –5000 lines: no Instead, create a statistical summary of word usage –Show most salient 最有顯著性 collocates (Mutual Information)

9 9 Mutual Information Church and Hanks 1989 MI: How much more often does a word pair occur, than one might expect by chance :

10 10 Collocation listing For right collocates of save (>5 hits) wordf(x+y)f(y)wordf(x+y)f(y) forests6170life364875 $1.26180dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

11 11 Limitations of collocation listing Some items are not genuine collocates –yours appears only because it is adjacent to save The collocates can belong to any part of speech –It would better if they were classified into POS –and the role they play in the sentence Thus, –for arrest in “The police were quick to arrest a number of suspects on the spot” We would like to see –Keyword: arrest –Subject: police –Object: suspect(s) –Modifier: on the spot We would not be especially interested in to, a and number –These non-collocates happen to be close to the keyword

12 12 Wordsketch Attempts to meet these requirements A corpus-derived one-page summary of a word’s grammatical and collocational behaviour Implemented for English and Czech Chinese and Irish implementations in progress

13 13 Grammatical relations Salience is calculated for a keyword and its collocates in particular grammatical configurations –Police and arrest as subject and verb –Suspect and arrest as object and verb –Bank and post office in a coordinative relationship (and/or) –And others: altogether 27, for English

14 14 The corpus: Chinese Gigaword A Linguistic Data Consortium corpus –Very large: over 1 billion characters –Compiled by David Graff & Ke Chen in 2003 –Minimally tagged 286 newswire stories, half from each of: –CNA Taiwan (740 million traditional characters) –Xinhua PRC (380 million simplified characters) Corpus was segmented and tagged using Academia Sinica tools

15 15 http://corpora.fi.muni.cz/chinese/ 逮捕 教 學習 銀行 捉

16 16

17 17

18 18

19 19

20 20

21 21

22 22 Grammar: examples of current constraints Object relation 1:trans_verb particle? any_noun? "DE"? any_noun? 2:common_noun And/or relation 1:any_noun listcomma 2:any_noun & 1.tag = 2.tag 1:any_noun conj 2:any_noun & 1.tag = 2.tag

23 23 Functions KWIC concordance –Sorting, filtering etc Word sketch Automatic thesaurus Sketch difference –discriminate near-synonyms

24 24

25 25 Further work Annotate Gigaword corpus with AS semantic tagset Improve grammatical relations, especially sentence objects, to account for –topicalization ( 啤酒, 葡萄酒, 他都愛喝 ) – 把 fronting ( 請把啤酒喝完 ) Create “Dr Eye” style interface, to show common collocations online, in a text.


Download ppt "1 Chinese WordSketch Engine Online, corpus-based summaries of word usage."

Similar presentations


Ads by Google