Role of Speech Technology in Enhancing Human-Computer Interactions in Tamil G. Anushiya Rachel Project Officer Speech Lab, SSN College of Engineering
Introduction Human-computer interactions Graphical user interface Voice user interface Communicate with computers/machines as with a human being through speech Personal assistants such as Siri, Google assistant, Cortana, Alexa Requirements of the computer Recognize user’s voice/speech (speaker/speech recognition) Respond appropriately through speech (speech synthesis) Possible application of speech: Speech recognition/synthesis, speaker verification/identification, language identification, emotion recognition/synthesis Presently, text-to-speech (TTS) synthesis systems, a restricted vocabulary speech recognition system, and a speech-enabled enquiry system developed for Tamil
Text-to-Speech Synthesis Converts any given text in a language to speech Basic components Text pre-processing Text to phonetic-prosodic translation Signal processing component to generate speech Language dependent and independent modules Restricted and unrestricted domain synthesizers
Unit Selection Synthesis (USS) Waveform concatenation approach Pre-recorded speech units combined based on the given text, such that target and concatenation costs are reduced Speech units could be words or sub-word units (eg: phonemes, CV units, syllables) Synthesized speech Natural Contains glitches at the concatenation points Larger speech unit – better quality Large footprint size (in the order of GBs) Larger speech unit – more amount of data required
Unit Selection Synthesis (USS)
HMM-Based Speech Synthesis System (HTS) Statistical parametric approach Uses source-filter model to synthesize speech Synthesized speech Highly intelligible Slightly less natural Small footprint size (in the order of few kBs)
Requirements Text data Domain specific or unrestricted Speech data Record in quiet/studio environment Amount of data Basic unit could be phone, diphone, syllable, etc. Larger the unit greater the amount of training data Letter-to-sound rules of the language Time aligned transcriptions
Letter-to-sound rules
Time-Aligned Transcriptions
HMM-Based Speech Synthesis Effect of context information on quality Monophone – context independent (/aa/ /g/ /aa/ /y/ /a/ /m/) Triphone – right and left contexts (/x-aa+g/ /aa-g+aa/ /g-aa+y/ ….) Pentaphone – 2 contexts to right and left Pentaphone with additional features Web demo: http://speech.ssn.edu.in Prosody modification To improve naturalness of speech, pitch contour can be modified Emotions can also be incorporated
Polyglot HTS Bilingual synthesizers for Tamil and Indian English Tamil phonemes mapped to similar Indian English phonemes Separate synthesizers for Tamil and English Perceptually similar phonemes merged Acoustically similar phonemes merged Polyglot synthesizers for Tamil, Hindi, Malayalam, Telugu GMM-based voice conversion used Characteristics of each speaker adapted to desired speaker’s characteristics
Mobile Application and Screen Reader Android mobile application allows the user to type the desired text and synthesizes it. Tamil TTS system is integrated with the “Talkback” feature in Android phones, which serves as a screen-reader Linux-based screen reader that synthesizes selected text has also been developed.
Speech-Enabled Interactive Enquiry System Communicates to the user entirely through speech Consists of three components - speech recognition system, TTS synthesis system, and database containing relevant information
Speech-Enabled Interactive Enquiry System Developed to provide information on agriculture, specifically, paddy, sugarcane, and ragi Obtains user’s query through a series of questions Questions formulated such that they elicit 1 to 3-word responses Garbage models used to eliminate out-of-vocabulary words Recognized result verified from the user in the event of a doubt Information relevant to the user’s query fetched from a database and synthesized by the TTS system
Future Directions Development of an unrestricted vocabulary speech recognition system for Tamil Identification of emotions from speech Synthesis of emotional speech
Demo