Presentation is loading. Please wait.

Presentation is loading. Please wait.

GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.

Similar presentations


Presentation on theme: "GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information."— Presentation transcript:

1 GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information and Communications Technology (NICT) Licensing and Distribution of Resources and Applications

2 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 2 Organizing Creation & Utilization of Language Corpora Creation of language corpora needs some cost. Utilization needs a system to distribute corpora. Some activities started early in 1990s. 1992 LDC in U.S.A. 1995 ELRA in Europe

3 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 3 Japanese Activities GSK: Gengo Shigen Kyokai (Language Resource Association) Launched in 1999, Reformed as an NPO in 2003, Project accepted in 2005 for 3 years, Text corpora are its main concern at present. NII-SRC distributes speech corpora.

4 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 4 GSK and NII-SRC Language Resource Association (GSK) A nonprofit organization collecting and distributing text and speech corpora. http://www.gsk.or.jp/ NII-Speech Resources Consortium (NII-SRC) Collects and distributes most major speech corpora. http://research.nii.ac.jp/src/eng/ These two organizations try to play central roles for collecting and distributing speech and language corpora in Japan.

5 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 5 Knowledge Information Processing Technologies Committee Language Resource Sub-committee JEITA (Japan Electronics and Information Technology Industries Association) Natural Language Processing Portal Site SHACHI: Language Resource Metadata DB NICT: National Institute of Information and Communications Technology GSK NII-SRC TCL NII: National Institute of Informatics

6 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 6 Purpose of GSK Collection, distribution, investigation, research, and standardization of electronic data and software tools necessary for the promotion of science, technology, education and industry concerning natural language.

7 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 7 GSK Organization President Two vice presidents 11 board members 25 steering committee members All are voluntary workers.

8 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 8 No-fee Distribution ProviderUser GSK Agreement Distribution permission Corpus Payment As a rule, the cost of handling corpora falls on the user, though the corpus itself is free of charge.

9 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 9 Agency Commission GSK Request Form Payment Agreement Provider User The providers of the corpora entrust GSK with requests received from users. GSK mediates between users and providers.

10 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 10 Advertizing ProviderUser GSK Ad request Ad rate Payment Agreement Publicity Corpora providers entrust GSK with advertizing useful information on their data or corpora.

11 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 11 Some Examples of GSK Corpora JEITA Multimodal Corpus Japanese Web N-ram Version 1 CICC Multilingual Dictionary IPAL Lexicon of Basic Japanese

12 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 12 JEITA Multimodal Corpus A corpus of collected person-to-person task- oriented dialogues. 80 min. of video for 9 conversations concerning topics of “faces” and “travel” included. Speech data transcribed and provided with annotations indicating morphemes, dialogue structure and prosody. Contained in 1 DVD-R (800 MB).

13 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 13 Japanese Web N-gram Version 1 N-grams that have been extracted from Google crawling publicly available Japanese webpages. Pages requiring special permission to brows or indicated with nonarchaive/noindex are not included. N-grams (1-7) with frequency greater than 20 were extracted from approximately 20 billion sentences. Contained in 6 DVD-Rs (26 GB after gzip compression).

14 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 14 CICC Multilingual Dictionary A collection of Malay, Indonesian, Chinese, and Thai Dictionaries containing 50,000 basic words, POS tags; some contains English translations. Technical Term Dictionary for each language is also available. Contained in 1 CD-ROM for each language. CICC: Center for the International Cooperation for Computation

15 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 15 IPAL Lexicon of Basic Japanese Containing 861 verbs, 136 adjectives, and 1,081 Nouns and glossary. English translations also provided for nouns contained in glossary. Contained in 1 CD-ROM.

16 Regional Conference on Localized ICT Development and Dissemination across Asia Jan. 15, Vientiane, Laos 16 Summary 1. There are several distributers of language resources in Japan. 2. GSK is the only consortium of language resources qualified as NPO in Japan. 3. GSK plans to collaborate with Language Grid Project.


Download ppt "GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information."

Similar presentations


Ads by Google