1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
L EARNERS ’ D ICTIONARY Deny A. Kwary
1 Chinese WordSketch Online, corpus-based summaries of word usage.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
How To Teach Vocabulary. Best Practices What does effective, comprehensive vocabulary instruction look like? It has identified four key components: 1.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Today Writing: using the comma –Quiz Other punctuation Listening test Corpus linguistics talk, Part 3 The healthy diet Recipes.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Research methods in corpus linguistics Xiaofei Lu.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Using corpora for bespoke language teaching
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
WHEN DOES IT HAPPEN? MAKING SENSE OF ENGLISH. EVENTS ARE ANCHORED IN TIME 當小明看到貓在追狗 … When? 甚麼時候發生的? 「標記時間」是描述事件的首要任務 2.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
Resources for English Writing English Writing. Types of Resources Dictionaries Writing websites Writing Centers on the internet.
Using the Sketch Engine for second language learning Simon Smith & Alice Chen.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
1 A Study on Implementation of Southern-Min Taiwanese Tone Sandhi System Iu n Un-gian Lau Kiat-gak Li Sheng-an.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
语料库研究中的 主题词分析方法及其扩展 中国外语教育研究中心 梁茂成 An extension to the keyword approach in corpus analysis.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
The 6 Most Common Language in the World 世界上最多人說的 6 種語言 ( )
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
GDEX: Automatically finding good dictionary examples in a corpus.
Transliteration Variants:
Computational and Statistical Methods for Corpus Analysis: Overview
Introduction to Corpus Linguistics: Exploring Collocation
Corpora, Language Technology and Maltese
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

1 Chinese WordSketch Engine Online, corpus-based summaries of word usage

2 Designers of the original WordSketch Engine for English Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University

3 Facing the problem: lexical choice “You shall know a word by the company it keeps” (Firth, 1957) The meaning of face depends on the collocation ( 詞語搭配 ) – 學漢語的外國人要面對詞語選擇的問題 – 許多種動物正在面臨絕種的問題 Similarly with save –Save money –Save life

4 Look in a dictionary? A corpus? Some modern English dictionaries give some collocation ( 詞語搭配 ) information –Chinese dictionaries give very limited help Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available

5 Pre-computer corpus! Oxford English Dictionary: 20 million index cards

6 KWIC Concordance

7 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method

8 Limitation of KWIC analysis A s corpora get bigger: too much data –50 lines for a word: read all –500 lines: could read all, takes a long time –5000 lines: no Instead, create a statistical summary of word usage –Show most salient 最有顯著性 collocates (Mutual Information)

9 Mutual Information Church and Hanks 1989 MI: How much more often does a word pair occur, than one might expect by chance :

10 Collocation listing For right collocates of save (>5 hits) wordf(x+y)f(y)wordf(x+y)f(y) forests6170life $ dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

11 Limitations of collocation listing Some items are not genuine collocates –yours appears only because it is adjacent to save The collocates can belong to any part of speech –It would better if they were classified into POS –and the role they play in the sentence Thus, –for arrest in “The police were quick to arrest a number of suspects on the spot” We would like to see –Keyword: arrest –Subject: police –Object: suspect(s) –Modifier: on the spot We would not be especially interested in to, a and number –These non-collocates happen to be close to the keyword

12 Wordsketch Attempts to meet these requirements A corpus-derived one-page summary of a word’s grammatical and collocational behaviour Implemented for English and Czech Chinese and Irish implementations in progress

13 Grammatical relations Salience is calculated for a keyword and its collocates in particular grammatical configurations –Police and arrest as subject and verb –Suspect and arrest as object and verb –Bank and post office in a coordinative relationship (and/or) –And others: altogether 27, for English

14 The corpus: Chinese Gigaword A Linguistic Data Consortium corpus –Very large: over 1 billion characters –Compiled by David Graff & Ke Chen in 2003 –Minimally tagged 286 newswire stories, half from each of: –CNA Taiwan (740 million traditional characters) –Xinhua PRC (380 million simplified characters) Corpus was segmented and tagged using Academia Sinica tools

15 逮捕 教 學習 銀行 捉

16

17

18

19

20

21

22 Grammar: examples of current constraints Object relation 1:trans_verb particle? any_noun? "DE"? any_noun? 2:common_noun And/or relation 1:any_noun listcomma 2:any_noun & 1.tag = 2.tag 1:any_noun conj 2:any_noun & 1.tag = 2.tag

23 Functions KWIC concordance –Sorting, filtering etc Word sketch Automatic thesaurus Sketch difference –discriminate near-synonyms

24

25 Further work Annotate Gigaword corpus with AS semantic tagset Improve grammatical relations, especially sentence objects, to account for –topicalization ( 啤酒, 葡萄酒, 他都愛喝 ) – 把 fronting ( 請把啤酒喝完 ) Create “Dr Eye” style interface, to show common collocations online, in a text.