Presentation is loading. Please wait.

Presentation is loading. Please wait.

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia.

Similar presentations


Presentation on theme: "Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia."— Presentation transcript:

1 Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan

2 Languages of Concern --Modern Mandarin Chinese, -- Archaic, Ancient, and Near Modern Chinese (the diachronic record of three thousand years of Chinese ) --Formosan Languages (endangered, one of the richest branch of Austronesian languages)

3 Sharable Resources for Chinese Computational Linguistics Corpora Lexicons Procedures http://rocling.iis.sinica.edu.tw/ROCLING/

4 Sharable Resources for Chinese Computational Linguistics--Corpora -Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) -Sinica Treebank -Standard Segmentation Corpus -ROCLING Corpus -Mandarin-Across-Taiwan (MAT) Speech Database

5 Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) 5 million words, segmented and tagged Direct WWW Access -http://www.sinica.edu.tw/~tibe/2- words/modern-words/index.html OR -http://www.sinica.edu.tw/ftms-bin/kiwi.sh License Information - http://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm

6 Sinica Treebank 1.0 38,725 Trees 239,532 Words Direct WWW Access (1000 sample trees) http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm License Information http://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm

7 Mandarin-Across-Taiwan (MAT) Speech Database Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences). MAT-160 ( 160 speakers) - http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm MAT-2000 http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm

8 A Database of Chinese Characters (i.e. Kanji) For each character: The Component Composition ( 部件組成 ) Information is important Over 10,000 Components ( 部件 ) have been identified for Chinese, roughly 2,000 of them productive http://www.dmpo.sinica.edu.tw:8000/~words/sou/sou.html --optional: radicals, number of strokes, variants

9 Sharable Resources for Chinese Computational Linguistics-Procedures Segmentation Standard for Chinese Language Processing Segmentation Standard http://godel.iis.sinica.edu.tw/ROCLING/juhuashu1.htm Standard Segmentation Corpus (2 million words, segmented) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segcorp_E.htm Standard Segmentation Lexicon (42,138 entries, w/ frequency) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segdic_E.htm Segmentation Program (free download ) http://godel.iis.sinica.edu.tw/CKIP/ws/

10 Sharable Resources in Languages Other than Modern Mandarin Classical Chinese Corpora http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html Corpus of Formosan Austronesian Languages Under construction, part of the National Digital Archive Initiative Lexical Databases of other Sino-Tibetan and Tibeto-Burmese Languages

11 Synchronic and Diachronic Chinese Corpora Three Projects Sponsored by the CCK Foundation (1990-1995) Chu-Ren Huang, Keh-jiann Chen and Pei-chuan Wei, Academia Sinica Paul Thompson, SOAS, University of London Chaofen Sun, Stanford University

12 Mechanisms for Scholarly Exchange and Collaboration Department of International Programs, NSC http://www.nsc.gov.tw/int/2_cooperation/index_02.html Canada: NRC France: CNRS Japan: EAACST Germany: DFG, DAAD, DKFG Netherlands: NWO, IIAS USA: NSF, NIH UK: Royal Society of London, ETC

13 Other Resources in our area: Singapore (K.T. Lua) Consortium of Asian Language Resources http://cslp.comp.nus.edu.sg/cslp/index.htm ---Last Updated Oct. 1999 ----Contains detailed information of about 50 (mostly Chinese) linguistics resources including comprehensive review, as well as license information

14 Other Resources: HowNet: An attribute-bases Semantic Network (Dong Zhengdong) http://www.keenage.com

15 Future 1. Linguistic Ontology: Wordnets --Bi- or Multi-lingual Wordnets in EuroWordNet style --Collaboration among Chinese speaking communities (Academia Sinica, City University of Hong Kong, Peking University)

16 Future 2. Language Archives under the Digital Archive National Project --Digital Archive Initiatives Started in 2001 --The Language Resource Project (PI: Huang) includes 3 corpus projects on 20 th Century Taiwan Mandarinn Near Modern Chinese (17- 18 Century) Pilot project on Formosan language corpora --Expected to become a National Project in 2002

17 Future 3. A universal and sharable scheme for encoding Chinese characters 4. Join the Open Language Archives Community (OLAC) http://www.language-archives.org 5. Participation and Conformation to International Standards for Language Engineering (ISLE)


Download ppt "Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia."

Similar presentations


Ads by Google