Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster.

VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio

Overview Speech Synthesis Markup Language (SSML)
Phases of Text to Speech Synthesis Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production Recorded speech

SSML Speech Synthesis Markup Language Stages:
enables developers to override default specifications Stages: Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production

Structure Analysis Division of text into basic elements e.g. sentence, paragraph to support more natural phrasing <s> - sentence - paragraph Structure inferred from punctuation and formatting, but … Dr. Lewis works at the clinic on Sunset Dr. in western Portland. Dr. Smith lives at 214 Elm Dr. He weighs 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 20 lb. bass. <s>Dr. Smith lives at 214 Elm Dr. </s> <s>He weighs 214 lb.</s> <s>He plays bass guitar. </s> <s>He also likes to fish; last week he caught a 20 lb. bass.</s>

Text Normalisation Annotation of text so that it is spoken correctly
Ambiguous examples: 1/2 - may be spoken as “half,” “January second,” “February first,” or “one of two.” Dr. – may be ‘doctor’ or ‘drive’ e.g. Dr. John Dr.” is rewritten as “Doctor John Drive” St. – may be ‘saint; or ‘street’ e.g. St. John St. is written as “Saint John Street.” Acronyms e.g. ACM or IEEE should be spelled out, others are pronounced as words e.g. RAM, ROM addresses: e.g. First part: “Cat Azman,” “C.A.Tazman,” or “C. Atazman?” Last part: “Bee dot com” or “B.E.E. dot com?”

 New in VoiceXML 2.0. Speech Synthesis Markup. Syntax
 OriginalText Description Language element whose alias attribute provides substitute text to be spoken instead of the contained text. This allows the document to contain both a written and a spoken form for a string

 Dr. Smith lives at
214 Elm Dr. He weighs 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 20 lb. bass. Dr. Smith lives at 214 Elm Dr. He weighs 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 20 lb. bass.

<say-as> Speak enclosed text in the given style
Implemented (with limitations) in some platforms Example: numbers Contained text can be interpreted as a number. The allowed number formats are ordinal, cardinal, and digits. <say-as type="number:ordinal">12</say-as> is spoken as "twelfth“ <say-as type="number:digits">12</say-as> is spoken as "one two". Other types: acronyms, currency, time, date, duration, measures, telephone, spell-out, names, and net. Bevocal provides a set of extended tags for items such as: airline, equity, street, city, state, citystate, address

Text to phoneme conversion
Specify pronunciation of words that are difficult to pronounce, e.g. read = ‘reed’ / ‘red’ wind: Wind the watch when you face into the wind <phoneme> - uses the standard phonetic alphabet, the International Phonetic Alphabet (IPA). He plays <phoneme alphabet = "ipa" ph="U0062 U0258 U0073"> bass </phoneme> guitar. He also likes to fish; last week he caught a 20 lb. <phoneme alphabet = "ipa" ph="U0062 U00E6 U0073"> bass </phoneme>. Unicode numbers

Attributes of <phoneme>
alphabet—The phonetic alphabet used to specify the pronunciation of the word contained in the <phoneme> element ph—The phonetic spelling of this word expressed using the alphabet. The only valid values for this attribute are ph="ipa" and vendor-defined strings of the form ph = "x-organization" or ph = "x-organization-alphabet ". Using the IPA requires some linguistic training. For an excellent tutorial on the IPA symbols and sounds, see For an overview of the IPA and a full chart of symbols, see The sounds used in English and their IPA symbols are illustrated in You can hear each sound by clicking the word that contains the sound. To identify the corresponding Unicode number, go to move the cursor above the IPA symbol, and the Unicode value will appear.

Prosody analysis Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses. most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech. For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. Prosody rules and algorithms are not perfect and are a topic of ongoing research. Prosody rules for different spoken national languages may be quite different. For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different.

<prosody> : pitch
refers to the “highness or lowness” of speech (currently not implemented in bevocal cafe) measured by the frequency (Hz, vibrations per second) of the sound can be specified with: A number followed by “Hz” A relative change expressed as a percentage: for example, "+18.2%" or "-10.3%" A relative change as a relative number: for example, "+10" or "-8.7" One of the following words: "x-high", "high", "medium", "low", "x-low", or "default"

<prosody> : range
Range - specifies the variability of the pitch. specified using the same options as pitch e.g. (currently not implemented in bevocal cafe) <prosody pitch = "medium" range = "x-low">

<prosody>: contour
describes the actual pitch contour for the text. (currently not implemented in bevocal cafe) set of time segments with a target pitch specified for each time segment. Each time segment is defined as a percentage of the total time for speaking the contained text e.g. (25%, 25%, 25%, 25%) would speak the contained text in four equal segments. An interpolation algorithm smoothes the transitions between the time segments. For example, a contour can be used to describe the increase in pitch at the end of a question as follows: <prosody contour = "(90%, medium) (10%, high)"> You said what? </prosody>

<prosody> : rate, duration
Rate. The speaking rate expressed using words-per-minute (currently not implemented in bevocal cafe), specified using any of the following: A number A relative change expressed as a percentage; for example, "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10" or "-8.7" One of the following words: "x-fast", "fast", "medium", "slow", "x-slow", or "default" The student’s name is <prosody rate=“-10%"> John Scott </prosody> Duration. A value in seconds or milliseconds for the desired time to read the element contents e.g. <prosody duration = "10s">

<prosody> : volume
Volume. Specifies how loudly or quietly the words are spoken, specified by: A number in the range from 0.0 to 100.0 A relative change expressed as a percentage for example; "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10" or "-8.7" One of the following words: "loud", "medium", "soft", "low", "x-soft", or "silent" <prosody volume = "loud"> text to be spoken </prosody>

<emphasis> formerly <emph>
level: values “strong” “moderate,” “none” and “reduced”. “none” used to prevent the speech synthesis processor from emphasizing words that it might typically emphasize <emphasis level = "strong">help</emphasis>

<break> specifies when to insert silence (or pause) in text
strength - the strength of the prosodic break. Values are "none" "x-small", "small","“medium" (the default value), "large", or "x-large" time – e.g. "250ms", "3s". Welcome to the Student System <break time = "250ms"/> Please say one of the following: …

Waveform Production Process of converting a textual representation to acoustical sounds which humans hear and interpret as human-like speech. <voice> - uses a different voice from the default specified for TTS <voice age=“3" gender="female"> text to speak </voice> <audio> - specifies what audio to present to user <desc> - specifies text-only output describing the audio output (e.g. dog barking)

Other SSML elements <speak> - defines a container for a speech synthesis document not required when SSML tags are used in PCDATA within VoiceXML. <lexicon> - specifies a pronunciation lexicon document which the speech synthesis engine uses to generate the pronunciation of words. format not yet defined, see documentation of VoiceXML browser vendor - places a marker into the text to be processed by the speech synthesis engine, e.g. When encountered, the speech synthesis pauses and throws an event referencing the marker name. A built-in event handler processes the event and causes the speech synthesis engine to resume.

<audio>: playing prerecorded audio files
Output can consist of a combination of prerecorded files, audio streams, or synthesised speech e.g. <prompt> Welcome to the Student System <audio src = “AudioSample.wav” /> How can I help you? </prompt> <audio> can have alternative content in case the audio sample is not available e.g. <audio src = “welcome.wav” > Welcome to the Student System </audio>

Recording speech input using <record>
<record> is a form element similar to <field> It is used to collect a recording from the user that can be played back or submitted to a server It has a <prompt> element and can have a <filled> element It can have a grammar for a spoken command to terminate the recording

Attributes of <record>
name - The name of a variable that holds the value of the recorded item. expr - The value of the recorded item variable. beep—There are two possible values: beep = "true" and beep = "false" If true, a beep tone is presented to the user just before the recording begins. The default is false. maxtime—The maximum duration of the recording, beginning when the recording starts. For example, maxtime = "10s" where "10s" means 10 seconds. finalsilence—The interval of silence indicating the end of speech. For example, finalsilence = "3s" (not implemented in IBM Voice Server SDK) dtmfterm—There are two possible values: dtmfterm = "true“ and dtmfterm = "false" If true, then any DTMF key press not matched by an active grammar will terminate the input. The default is true. type—Media format of the resulting recording. A media type is a file format written in the form type/subtype. For audio files, the type is always audio.

Example using <record>
<form> <record name = "msg" beep = "true" maxtime = "5s” finalsilence = "5000ms" dtmfterm = "true" type = "audio/x-wav”> <prompt timeout = "5s"> Record your message after the beep. </prompt> </record> <filled> <!-- when recording is completed, replay recorded message –-> <prompt> You said <audio expr="msg"/> </prompt> </filled> </form>

Submitting recording to the server
In this example, a recording has been stored in the variable ‘msg’ and the system confirms if the user wishes to keep it: <field name="confirm“ type = “boolean”> <prompt> Your message is <audio expr="msg"/>. </prompt> <prompt> To keep it, say yes. To discard it, say no. </prompt> <filled> <if cond="confirm"> <submit next="save_message.jsp" enctype="multipart/form-data" method="post" namelist="msg"/> </if> <clear/> </filled> </field>

<record> shadow variables (1)
NB: ‘name’ represents the name of the form item variable name$.duration - The duration of the recording in milliseconds name$.size - The size of the recording in bytes name$.termchar - The DTMF key used by the caller to terminate the recording. This variable is undefined if a key was not used to terminate the audio. name$.maxtime - true indicates the recording was terminated because the maxtime duration was reached. false indicates the recording was not terminated due to maxtime.

<record> shadow variables (2)
name$.utterance - The string of words spoken by the user if the recording was terminated by speech recognition input. This shadow variable is undefined if the recording was not terminated by speech recognition input. name$.confidence - The confidence level (0.0 – 1.0) if the recording was terminated by speech. This shadow variable is undefined if the recording was not terminated by speech recognition input. The confidence level refers to the speech recognizer's estimate of the accuracy of its results, in this case the accuracy of the contents of name$.utterance.

Dealing with user hang up during recording
When a user hangs up during recording, the recording terminates and a connection.disconnect.hangup event is thrown. Audio recorded up until the hangup is available through the <record> variable e.g. <catch event=“connection.disconnect.hangup”> … action such as submit recording to server… </catch>

Exercise: SSML markup Create a file using some SSML markup for TTS.
Examples: He drove his new car, <prosody pitch="-10%" range="-20%" volume="-20%">not his ugly old car</prosody>, because he wanted to seem more <emphasis level=“strong”> impressive </emphasis> My user number is <say-as interpret-as=“digits”> </say-as> Sample file: tts.vxml

Exercise: recording and using audio files
Create a simple application that includes a field in which you ask the user to speak some information, such as name and address, that is recorded by the system for later playback. Play back a pre-recorded file (music to be played as introduction)

Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster.

Similar presentations

Presentation on theme: "Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster.

Similar presentations

Presentation on theme: "Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster."— Presentation transcript:

Similar presentations

About project

Feedback