1 Chinese WordSketch Online, corpus-based summaries of word usage.

Slides:



Advertisements
Similar presentations
Part Two: Using Xaira to explore corpora Richard Xiao
Advertisements

Corpus Linguistics Richard Xiao
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
齊來學中文 Let’s Learn Chinese 英國中文學校聯會出版 Published by the UK Federation of Chinese Schools.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
L EARNERS ’ D ICTIONARY Deny A. Kwary
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
How To Teach Vocabulary. Best Practices What does effective, comprehensive vocabulary instruction look like? It has identified four key components: 1.
第一章.  簡單式之被動語態  開竅要訣  Test yourself  Advice  現在式被動 : S+am/is/are+P.P  過去式被動 : S+was/were+P.P  未來式被動 : S+will be+P.P.
1 Web of Science 利用指引 逢甲大學圖書館 參考服務組 單元五 存檔或輸出.
資料庫名稱 中國期刊全文資料庫 (China Journal Full-text Database)
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
CE1 week 3 Vocabulary Jobs. Homework: Sketch Engine Go to –Sign up for a 30 day Sketch Engine account –Experiment.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
各種線上電子資源的特異功能 SwetsWise 的 alert, TOC alert 與 Favorites 2003/4/28 修改.
行政院國家科學委員會工程技術發展處自動化學門 * 試以國立成功大學製造工程研究所 鄭芳田教授 產學合作計畫 : 智慧預測保養系統之設計與實作 成果報告盤點為範例 國科會工程處專題計畫成果典藏 自動化學門成果報告盤點範例.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Linguistics phonetic symbols. 先下載 IPA 字型檔案,執行安裝。 由於這個程式的字型目錄設定錯誤, 所以等重新開機時就會發現字型消失。 所以必須根據以下步驟來讓 Windows 加入 IPA 字型。
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1 © 2011 台灣培生教育出版 (Pearson Education Taiwan). 2 學習目標 1. 當面對可預測的變異性時,同步管理並改善供應鏈 中的供給。 2. 當面對可預測的變異性時,同步管理並改善供應鏈 中的需求。 3. 當面對可預測的變異性時,使用總體規劃將利潤最 大化。
行 政 中 立 的理論與實務 項 靖項 靖項 靖項 靖東海大學公共行政學系副教授. 2 林幸男 黃芳仁 林幸男 黃芳仁 政黨分贓制度 政黨分贓制度 功績制度 功績制度.
Today Writing: using the comma –Quiz Other punctuation Listening test Corpus linguistics talk, Part 3 The healthy diet Recipes.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Textbook Evaluation 《 漢語會話 301 句》下冊 康玉華 來思平 編著 北京語言大學出版社 Conversational Chinese 301 Evaluator: May &Tina.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
Resources for English Writing English Writing. Types of Resources Dictionaries Writing websites Writing Centers on the internet.
Using the Sketch Engine for second language learning Simon Smith & Alice Chen.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
English Workshop 日常生活中常有一些不如意的 事,當看到別人遭遇挫折或問 題時,你可以利用下面所提供 的用語來表達你對他/她的同 情或鼓勵。
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Yang Liu State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer.
如何學好高中英文 -- 由幾個例子談起 國立屏東女中 英文老師 詹莉莉. 試指出下列畫底線部份字的意思 1. Why is a river so rich? That’s because a river has two banks. 2. Since no one wants to answer the.
语料库研究中的 主题词分析方法及其扩展 中国外语教育研究中心 梁茂成 An extension to the keyword approach in corpus analysis.
Lesson 9-2 Practical Chinese Reader. Tā xìng shénme, jiào shénme? Tā shì nǎ guó rén? Tā shì lǎoshī ma? Tā shì xuésheng ma? Tā shì liúxuéshēng ma? Tā xuéxí.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
The 6 Most Common Language in the World 世界上最多人說的 6 種語言 ( )
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
美洲華語 第二冊第四課 說故事 : 叔叔的謎語. 中中和哥哥都說,叔叔的這頂帽子真好看 。 叔叔說:「我說一個謎語,你們誰猜對了, 我就把帽子送給誰。」
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
GDEX: Automatically finding good dictionary examples in a corpus.
Transliteration Variants:
Computational and Statistical Methods for Corpus Analysis: Overview
Corpora, Language Technology and Maltese
Presentation transcript:

1 Chinese WordSketch Online, corpus-based summaries of word usage

2 Participants Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University Simon Smith, 銘傳大學 ( 中研院 ) 黃居仁, 中研院 巫宜靜, 清華大學 ( 中研院 )

3 Facing the problem: lexical choice “You shall know a word by the company it keeps” (Firth, 1957) The meaning of face depends on the collocation ( 詞語搭配 ) – 學漢語的外國人要面對詞語選擇的問題 – 許多種動物正在面臨絕種 Similarly with save –Save money –Save life –Save a seat for me

4 Look in a dictionary? A corpus? Some modern English dictionaries give some collocation ( 詞語搭配 ) information –Chinese dictionaries give very limited help Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available

5 Pre-computer corpus! Oxford English Dictionary: 20 million index cards

6 KWIC Concordance

7 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method

8 Limitation of KWIC analysis A s corpora get bigger: too much data –50 lines for a word: read all –500 lines: could read all, takes a long time –5000 lines: no Instead, create a statistical summary of word usage –Show most salient 最有顯著性 collocates (Mutual Information)

9 Mutual Information Church and Hanks 1989 MI: How much more often does a word pair occur, than one might expect by chance :

10 Collocation listing For right collocates of save (>5 hits) wordf(x+y)f(y)wordf(x+y)f(y) forests6170life $ dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141

11 Limitations of collocation listing Some items are not genuine collocates –yours appears only because it is adjacent to save The collocates can belong to any part of speech –It would better if they were classified into POS –and the role they play in the sentence Thus, –for arrest in “The police were quick to arrest a number of suspects on the spot” We would like to see –Keyword: arrest –Subject: police –Object: suspect(s) –Modifier: on the spot

12 Wordsketch Attempts to meet these requirements A corpus-derived one-page summary of a word’s grammatical and collocational behaviour Implemented for English and Czech Chinese and Irish implementations in progress

13 The corpus: Chinese Gigaword A Linguistic Data Consortium corpus –Very large: over 1 billion characters –Compiled by David Graff & Ke Chen in 2003 –Minimally tagged 286 newswire stories, half from each of: –CNA Taiwan (740 million traditional characters) –Xinhua PRC (380 million simplified characters) Corpus was segmented and tagged using Academia Sinica tools

14 逮捕 教 學習 銀行 捉

15

16

17

18

19

20

21 Functions KWIC concordance –Sorting, filtering etc Word sketch Automatic thesaurus Sketch difference –discriminate near-synonyms In development –key words in a subcorpus / text type –how word varies with text type

22

23 Grammar writing Uses CQL (Corpus query language) –Christ and Schulze, U. Stuttgart, 1994 defining an object: v (adj|n|det|num|adv)* n rewriting in CQL with BNC/CLAWS-5 tags [tag="VV.*"] [tag="(A[JTV]|D|O).*"]* [tag="NN.*"]

24 Further work Improve grammatical relations, especially sentence objects, to account for –topicalization ( 啤酒, 葡萄酒, 他都愛喝 ) – 把 fronting ( 請把啤酒喝完 ) Create “Dr Eye” style interface, to show common collocations online, in a text

25 English version available For personal use – 歡迎註冊及多善加利用 !