Indian Language Initiatives at LDC Denise DiPersio

Slides:



Advertisements
Similar presentations
WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
Advertisements

The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LSA Symposium: The Open Language Archives Community.
Subject Based Information Gateways in The UK Coordinated Activities in The UK Within the UK Higher Education community, the JISC (Joint Information Systems.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Update & Future of DoD Linguistic Training LTC Jason Weece Director, Foreign Area Officer Program Office Defense Language Institute Foreign Language Center.
INTERNATIONAL SCHOLARSHIP PROGRAM GILMAN B E N J A M I N A. Sponsored by: U.S. Department of State, Bureau of Educational and Cultural Affairs Administered.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Kakia Chatsiou GreekGram: Building a parallel grammar for Modern Greek LAC day GreekGram Building a parallel grammar for Modern Greek Kakia.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.
1 Adaptive Management Portal April
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Anne Pauwels Heritage and Community Languages in higher education: Some Initiatives from Australia.
Research methods in corpus linguistics Xiaofei Lu.
Facilitate Open Science Training for European Research Where Librarians can learn and teach Open Science for European Researchers LIBER 2015 London,
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Information and Communication Technologies in the field of general education in Armenia NATIONAL CENTER OF EDUCATIONAL TECHNOLOGIES.
Herding CATS: the Community of Academic Technology Staff Lou Zweier, Director CSU Center for Distributed Learning The California State University NLII,
Why do we study English? Form 9, unit 6.
ELN – Natural Language Processing Giuseppe Attardi
National Science Portals: New Potential Partnerships for Global Discovery Eleanor G. Frierson Deputy Director, National Agricultural Library (U.S.), Co-chair.
CENIC March DLIFLC Mission & Vision DLIFLC provides culturally-based foreign language education, training, evaluation, research, and sustainment.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Malaysian Grid for Learning October DC 2004, Shanghai, China. © 2004 MIMOS Berhad. All Rights Reserved Metadata Management System DC2004: International.
Virtual Health Information Infrastructures: Scale and Scope Ann Séror, MBA, PhD 1 1 eResearch Collaboratory, Quebec City, QC, Canada, Url:
Adapting to Trends in Language Resource Development: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman University of Pennsylvania, Linguistic.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
Overview: FY12 Strategic Communications Plan Meredith Fisher Director, Administration and Communication.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Teaching Intellectual Property Website: A Meeting Point for Teachers of Intellectual Property.
Data Center Models and Impact on Scientific Research Communities Christopher Cieri University of Pennsylvania, Linguistic Data Consortium ccieri AT ldc.upenn.edu.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
LREC 2008, May 26 – June 1, Marrakesh 15 Years of Language Resource Creation and Sharing: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman.
Welcome Plans for the day Key milestones progress Requirements for final report Update on communication with social partners Identify any problems &
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
EVikings II WP3: Language Technologies. HLT Human Language Technologies (HLT) play a crucial role in the Information Society For small languages it is.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson The University of Texas at Austin Latin American Digital Library Initiative,
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Huda Sarfraz Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences cases of local language content development.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
University of St Andrews Towards e-Research June 16 th 2005 Research-related computing developments in St Andrews Birgit Plietzsch, Anna Clements, Jeremy.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
Vision Transtech India– About Us Established in 2004 A Global Services company Adopters of New technology Customization 150+ highly skilled resources always.
ELanguages creative collaboration for teachers globally.
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Language Translation Services –Wordpar.com
Computational and Statistical Methods for Corpus Analysis: Overview
Computational Linguistics: New Vistas
Presentation transcript:

Indian Language Initiatives at LDC Denise DiPersio

Tamil Internet Conference 2011 Philadelphia, PA 17 June Overview  Introduction to LDC  Tamil Projects/Resources  Indian Language Projects/Resources

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Origin and Model  Linguistic Data Consortium established in 1992 Via open, competitive government solicitation, won by U. Penn Initial 5-year funding followed by self-sufficiency through membership fees, data licenses Power of the collective  Language resource distributor/archive Centralized distribution, archiving, licensing Resources from donations, funded projects, community initiatives, LDC initiatives  Membership Members support the consortium through fees, data, services Ongoing rights to data published in membership years Reduced fees on older corpora, extra copies

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Roles  Data collection  Language resource (LR) production, including quality control  LR distribution and archiving  Intellectual property rights management and license management  Human subjects protocol management  Annotation, lexicon building  Creation of tools, specifications, best practices  Knowledge transfer: documentation, metadata, consulting, training  Corpus creation research and academic publication  Resource coordination in large multisite programs  Serving multiple research communities Funding panelists, workshop participants, oversight committee members

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Data Collection  News text  Web text (newsgroups, blogs, chatrooms, twitter)  Biomedical texts and abstracts  Printed, handwritten and hybrid documents  Broadcast programming (news, conversation)  Conversational telephone speech  Lectures, meetings, interviews  Read and prompted speech  Role play  Video (broadcast, web)  Animal vocalizations

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Annotation  Data scouting, selection, triage  Audio-audio alignment: bandwidth, signal quality, language, dialect, program, speaker  Quick and careful transcription, aligned at turn, sentence, word level  Phonetic, dialect, sociolinguistic feature, supralexical  Tokenization, tagging of morphology, part-of-speech, gloss  Syntactic, semantic, discourse functions, disfluency, sense disambiguation  Identification/classification of entities, relations, events and coreference  Translation, alignment of translated text  Identification/classification of entities/events in video  Document zoning

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Distribution  Since 1992, LDC has distributed Nearly 75,000 copies of 1300 titles to more than 3000 organizations in over 65 countries Approximately 8000 scholars and research groups receive LDC’s monthly newsletter  Non-exclusive distribution of donated data  LDC research communities span human language technologies, computer science, social sciences  Uniform licensing within and across research communities  Stable infrastructure LRs permanently accessible, ongoing access to data Standardized, simple terms of use and distribution methods

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Data Scholarships  Formalizes LDC’s long practice of $0 distribution of data to students without the means to otherwise license it  Competitive process Student submits application that contains: Data set requested, proposed need and use of data Description of research agenda Demonstration of high probability of success for work Letter of support from department chair/advisor including statement of financial need Two cycles completed; next will be Fall recipients Argentina, China, India, Indonesia, Mexico, UK, USA ~USD40,000 in data awarded

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: REFLEX/LCTL 1/3  REFLEX-LCTL (Less Commonly Taught Languages) Goal: to create human language technologies for the target languages, especially machine translation, information extraction Language selection criteria Large population of native speakers Relatively few language resources (electronic text, intentional difficulty variation in LR creation) Linguistic and geographic diversity Include some related languages Make use of existing collaborations Thirteen languages: Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu, Uzbek, Yoruba Bengali, Panjabi, Urdu – related languages

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: REFLEX/LCTL 2/3 LDC created language packs for each language consisting of a monolingual news text corpus (500k words) a parallel text corpus (250k words) a lexicon (10k entries) a grammatical sketch an encoding converter a sentence segmenter a tokenizer a name transliterator a part of speech tagger and tagged text a named entity tagger and tagged text a morphological analyzer and tagged text

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: REFLEX/LCTL 3/3 Resources identified through individual scouting, “Harvest Festivals”, native speakers Tamil Language Pack Text sources included websites (for monolingual and parallel text) Collaboration with Harold Schiffman, Vasu Renganathan Tamil lexicon – An English Dictionary of the Tamil Verb Consulted on encoding conversion Project sponsor has not yet released pack for publication; potential use in ongoing technology evaluations Will be published in LDC catalog when cleared for distribution

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: Language Resource Wiki  Language Resource (LR) Wiki designed to be Publicly accessible, world-readable Portal of found resources “harvested” in REFLEX-LCTL project Editable by authenticated others outside LDC  Pages for seven languages, including Tamil Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu Breton, Ewe pages in progress Language summary, linguistic resources, encoding and fonts, data sources, portals, tools and other natural language processing resources

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: Language Resource Wiki

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Projects: CALLFRIEND  CALLFRIEND project supported the development of language identification technology  LDC recruited native speakers in the target languages to make telephone calls to other native speakers  Calls were unscripted and lasted between 5-30 minutes  Target languages: American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese  CALLFRIEND Tamil LDC96S59 60 telephone conversations Demographic data: sex, age, education Call information: channel quality, number of speakers Calls originated inside the continental United States and Canada

Tamil Internet Conference 2011 Philadelphia, PA 17 June Tamil Resources  An English Dictionary of the Tamil Verb Second Edition LDC2009L01 Harold Schiffman, Vasu Renganathan (U Penn, Department of South Asia Studies) Translations for 6597 English verbs and definitions for 9716 Tamil verbs Associated sound files for pronunciation; example sentences Windows search and browse application Complementary copy in conference packet

Tamil Internet Conference 2011 Philadelphia, PA 17 June Indian Language Projects/Resources: Hindi  Hindi Surprise Language Exercise (2003) Goal: to assemble found resources under timed conditions LDC collected newswire, web data, some parallel text Not all resources can be released due to intellectual property, license restraints Further work needed for public release  Hindi WordNet LDC2008L02 Joint distribution with IIT Bombay First WordNet for an Indian language  CALLFRIEND Hindi LDC96S52

Tamil Internet Conference 2011 Philadelphia, PA 17 June Indian Language Resources: POS Tagsets  Indian Language Part of Speech Tagsets (IL-POST) Developed by Microsoft Research India; Anna University, Chennai; Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi; Tamil University, Tamilnadu Goal: to provide a common tagset framework for Indian languages that offers flexibility, cross-linguistic compatibility and reusability across languages LDC currently distributes three IL-POST sets at no cost: Bengali, Hindi, Sanskrit IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE corpus (parallel newswire) IL-POST Hindi LDC2010T24 – 98k words from web text IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories More languages planned, Tamil among them

Tamil Internet Conference 2011 Philadelphia, PA 17 June LDC: Need to Know  LDC website,  The LDC Corpus Catalog,  Submitting Corpora and Other Resources to LDC,  LDC Online,  Member Resources,  Questions?  Thank you!