Presentation on theme: "Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities (2011-2013) & Text Corpora."— Presentation transcript:
Jing-Shin Chang National Chi Nan University, TAIWAN @ IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities (2011-2013) & Text Corpora
ACLCLP ACLCLP: The Association for Computational Linguistics and Chinese Language Processing Annual Meeting (once a year in September) Journal: International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (four issues per year) Membership : Individual members:222, Corporate members:10
2013 Academic Activities: PACLIC 27 Workshop on Computer-Assisted Language Learning, Taipei, Taiwan, November 21, 2013 ROCLING 2013 The 25 th Conference on Computational Linguistics and Speech Processing, Kaohsiung, Taiwan, October 4-5, 2013 Seminar on Next Generation Automatic Speech Recognition (NGASR), Taipei, Taiwan, September 6, 2013 2013 Speech Signal Processing Workshop, Chung-Li, Taiwan, June 30, 2013 Information Retrieval Workshop 2013, Taipei, Taiwan, December 2013 (Tentative)
2012 Academic Activities: ROCLING 2012 The 24 th Conference on Computational Linguistics and Speech Processing, Chung- Li, Taiwan, September 21-22, 2012 Seminar on an NLP Day, Taipei, Taiwan, March 13, 2012 2012 Speech Signal Processing Workshop, Hsinchu, Taiwan, June 30, 2012
2011 Academic Activities: ROCLING 2011 The 23 rd Conference on Computational Linguistics and Speech Processing, Taipei, Taiwan, September 8-9, 2011 Workshop on Corpus and Translation, Taipei, Taiwan, October 29, 2011 Oriental COCOSDA 2011 2011 International Conference on Speech Database and Assessments, Hsinchu, Taiwan, October 26-28, 2011 2011 Short Course on Digital Speech Processing and Applications, Taipei, Taiwan, June 28 & June 30, 2011 2011 Speech Signal Processing Workshop Workshop on Speech Processing Technology and Application, Taipei, Taiwan, June 29, 2011 Seminar on Next Generation Automatic Speech Recognition (NGASR) Taipei, Taiwan, January 27, 2011 Taipei, Taiwan, July 12, 2011 Information Retrieval Workshop 2011, Taipei, Taiwan, December 28, 2011
Corpus Program The Corpus Program is developed and maintained by CKIP group in Academia Sinica; the News Corpus includes 14 million words. The CKIP group began collecting Chinese texts since 1990 mainly from newspapers and magazines. In the past years, this projects has been funded by the CCK Foundational for International Scholars Exchange, the National Science Council of R.O.C., and Academia Sinica at various staffs.
Academia Sinica Balanced Corpus The Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 4.0) is open to the research community through the WWW (http://www.sinica.edu.tw/SinicaCorpus/).http://www.sinica.edu.tw/SinicaCorpus/ The size of this corpus is Ten million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which can filter the data, generate statistics, sort, and identify collocations.
Chinese Electronic Dictionary The CKIP Electronic Dictionary is an electronic lexicon for Mandarin Chinese containing 88,000 entries. Each entry contains: print form (Chinese characters), word frequency (based on a 5 million words corpus), pronunciation (National Phonetic Alphabets, Zhu4yin1fu2hao4 and Chinese Phonetic Alphabet, Han4Yu3Pin1Yin1), syntactic category (based on CKIP classification of 198 categories), semantic feature (base on CKIP classification of 123 concept nodes for nouns).
Sinica Treebank Sinica Treebank 3.0 contains 6 files, 61,087 syntactic tree structures, and 361,834 words. The tree structures were extracted from the Sinica Corpus, and every structure is segmented and parsed. Each segmented word of a tree structure is tagged with its part-of-speech and argument. Sinica Treebank 3.0 is provided free on the website for syntactic and semantic research use. 1,000 syntactic tree structures are available.
CoNLL X Shared Task Chinese Data CoNLL X Shared Task Chinese Data is derived from the Sinica Treebank, in dependency format. CoNLL X Shared Task Chinese Data is divided into two parts: the testing data is available from here and the training data needs application from ACLCLP (since Nov. 2011).
Sinica Bilingual Ontological Database The Academia Sinica Bilingual Ontological Database (http://bow.sinica.edu.tw/) is a Chinese- English bilingual ontological database covering 110,000 Chinese terms. This database is developed based on the frame of SUMO (Suggested Upper Merged Ontology, http://www.ontologyportal.org, owned by IEEE) and language usage in Taiwan. The information provided in this database includes Chinese-English bilingual ontological structure and concept content and the integration of Chinese language information and ontology. Based on the language usage and lexical information, this database provides an infrastructure of knowledge planning and management. This database enables information from different sources to become inter-operable. The copyright of all original English content in SUMO is owned and authorized by the Institute of Electrical and Electronics Engineers, Inc. All original Chinese content is developed under the project of Linguistic Anchoring of Digital Archives: Reference Resources Construction and Implementation Services (NSC94-2422-H-001-009-), sponsored by National Science Council and executed by Academia Sinica. The copyright of the Academia Sinica Bilingual Ontological Database belongs to Academia Sinica. All open source data are stored both in plain text and XML format.
CIRB030 An information retrieval (IR) test collection is used to evaluate the performance of IR systems. It is a helpful and powerful tool for investigation of the developing systems and the developed systems. CIRB030 (Chinese Information Retrieval Benchmark, version 3.0) test collection is such kind of test collection, which is designed to be used for evaluation of Chinese document retrieval