ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.

Slides:



Advertisements
Similar presentations
Part Two: Using Xaira to explore corpora Richard Xiao
Advertisements

Word Processing and Desktop Publishing Software
By : Swaran Lata Country Manager,W3C India Office 6,CGO complex, Electronics Niketan New Delhi
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
To facilitate communications To support household activities, for personal business, or for education To serve as a productivity/ business tool To assist.
Documentation Generators: Internals of Doxygen John Tully.
Lecture 1 Introduction to the ABAP Workbench
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Language Resources in Indonesia Language Technology & Applied Information Laboratory Directorate for Information Technology and Electronics Agency for.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Int 1 Revision Word Processing Most people are familiar with word processing packages such as Microsoft Word, Open Office and Word Perfect. Here are some.
Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Overview of Search Engines
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
“C” Programming Language What is language ? Language is medium of communication. If two persons want to communicate with each other, they have to use.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Reading Aid for Visually Impaired Veera Raghavendra, Anand Arokia Raj, Alan W Black, Kishore Prahallad, Rajeev Sangal Language Technologies Research Center,
Welcome! The Topic For Today Is Word Processing and Desktop Publishing.
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Utkal University We Work On Image Processing Speech Processing Knowledge Management.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Crescendo Transcriptions Pvt. Ltd. Translation Manual.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
G. Anushiya Rachel Project Officer
Ashima Wadhwa Assistant Professor(giBS)
Search Engine Architecture
S.Rajeswari Head , Scientific Information Resource Division
Text-To-Speech System for English
We Translate… You Market!!
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Part of the Multilingual Web-LT Program
Multilingual Information Access in a Digital Library
ITS 2.0 Enriched Terminology Annotation Showcase
Computational Linguistics: New Vistas
Word Processing and Desktop Publishing Software
Language Centered Research, Test Beds and Applications
Presentation transcript:

ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications & Information Technology) ‘Anusandhan Bhawan’, C 56/1 Sector 62, Noida – , India

ÓC-DAC Noida’2004 Technology : Angla Bharati (Rule base) developed by IIT Kanpur. System developed jointly by IIT,Kanpur and CDAC Noida Operating system support : LINUX/ WINDOWS Performance : 85% correct parsing, 60% correct translation Embedded Text Editor,Pre Processor and Post editor Lexicon :25,000 root words Translation Support System

ÓC-DAC Noida’2004 Translation Support System (English to Hindi) Morphological Analyzer Lexical Dictionary Pattern Directed Parsing Parsing CORPUS CORPUS Rule Base Pseudo Language Output Hindi Text Generator Hindi Text Generator Post Editor Post Editor English Sentence

ÓC-DAC Noida’2004

Test suite for Translation Support Systems

ÓC-DAC Noida’2004 Knowledge Management Parallel Corpus & Tools

ÓC-DAC Noida’2004 Gyan Nidhi : Parallel Corpus ‘GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 12 Indian languages, a project sponsored by TDIL, DIT, MC &IT, Govt of India

ÓC-DAC Noida’2004 Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus What it is? The multilingual parallel text corpus contains the same text translated in more than one language. What Gyan Nidhi contains? GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitize 1 million pages altogether containing at least 50,000 pages in each Indian language and English. National Book Trust India Sahitya Akademi Navjivan Publishing House Publications Division SABDA, Pondicherry Source for Parallel Corpus

ÓC-DAC Noida’2004 GyanNidhi Block Diagram

ÓC-DAC Noida’2004 Platform : Windows Data Encoding : XML, UNICODE Portability of Data : Data in XML format supports various platforms Applications of GyanNidhi Automatic Dictionary extraction Creation of Translation memory Example Based Machine Translation (EBMT) Language research study and analysis Language Modeling Gyan Nidhi: Multi-Lingual Aligned Parallel Corpus

ÓC-DAC Noida’2004 Tools: Prabandhika: Corpus Manager Categorisation of corpus data in various user-defined domains Addition/Deletion/Modification of any Indian Language data files in HTML / RTF / TXT / XML format. Selection of languages for viewing parallel corpus with data aligned up to paragraph level Automatic selection and viewing of parallel paragraphs in multiple languages –Abstract and Metadata –Printing and saving parallel data in Unicode format

ÓC-DAC Noida’2004 Sample Screen Shot : Prabandhika

ÓC-DAC Noida’2004 Tools: Vishleshika : Statistical Text Analyzer Vishleshika is a tool for Statistical Text Analysis for Hindi extendible to other Indian Languages text It examines input text and generates various statistics, e.g.: Sentence statistics Word statistics Character statistics Text Analyzer presents analysis in Textual as well as Graphical form.

ÓC-DAC Noida’2004 Sample output: Character statistics Above Graph shows that the distribution is almost equal in Hindi and Nepali in the sample text. Most frequent consonants in the Hindi Most frequent consonants in the Nepali Results also show that these six consonants constitute more than 50% of the consonants usage.

ÓC-DAC Noida’2004 Vishleshika: Word and sentence Statistics

ÓC-DAC Noida’2004 Speech Technology and tools

ÓC-DAC Noida’2004 Vishleshika Statistical AnalysisTool Gyan Nidhi Corpus Phonetically Rich sentence set Manual Verification and Editing Studio Recording by Professionals Segmentation and labeling using Praat / Emulabel XML Meta Data Creation Annotated Speech Corpora for Hindi, Punjabi and Marathi languages

ÓC-DAC Noida’2004

ModuleDescription TTS Shell TTS shell is multi-threaded interface that call different TTS modules and returns messages that user can process to generate different events. Voice Builder It is a utility that helps in building syllable database. It reduces the space utilization and helps in performing fast search. Query Tool for Voice Builder Tool for reading voice file and retrieving the information about the “UNIT” from the file i.e.: Wave Data. Text Parser This unit breaks the Normalized text into logical units like: Sentences, Words and Syllables etc Prosody Matching & Syllable concatenation “PSOLA” technique for smooth joining of speech samples is being followed Synthesizer Function: For writing wave data directly onto a sound card or wave file. Modules under TTS

ÓC-DAC Noida’2004 Other Areas of expertise OCR for Devanagri Script Digital Library for Indian languages Word Processing tools like Spell Checker, Transliteration, Terminology Development, Document analysis, Font converters Indian Language eContent Creation

ÓC-DAC Noida’2004 Areas for future work Machine Translation Standardization Lexware Database design Working on the global approach ‘BhashaSetu’ which is a amalgamation of different approaches to squeeze the best of each approach Development of Translation system Test Bed Knowledge Management Automatic Text Summarization tool for Hindi and other Indian languages Standardization of Parts of Speech TagSet for Hindi extendible to other Indian languages Parts of Speech Tagger development for Indian languages Automated Terminology Development tools Sentence alignment tool for Indian languages Development of manually tagged parallel corpus up to word level Speech Technology Speech to Speech Translation System Development of Semi-automated speech annotation tools

ÓC-DAC Noida’2004 Thank You