Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology

Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology http://www.core.kochi-tech.ac.jp/hunter

Current state: Fragmentation of knowledge as a result of the ongoing creation of research niches A voracious, yet protective and covetous knowledge industry

Current state: Isn’t CALL just a subset of User Experience (UX?)

URGENT: Just-in-time learner sociology URGENT: Near-instant learner profiling Upgrade: Learner => USER User Experience (UX) practice UZANTO ’ s MindCanvas: -user profiling for a large target group in a matter of hours RUMM: rapid user mental modelling GEMS: game emulation This may be very fruitfully adapted to the foundation explorations leading to CALL decision-making. Hunter (2006)*: Learners are evolving The expanding palette: Emergent CALL paradigms (Invited virtual presentation) Antwerp CALL 2006 http://www.core.kochi-tech.ac.jp/hunter/professional/CALLparadigms/index.html

Now text-to-speech and speech-to-text (T2S2T) software have become truly usable in a very practical sense. This blurs the line between speech and text in a very immediate way. http://www.nextuptech.com/ http://www.nuance.com/naturallyspeaking/

Usable T2S2T No more typing. No more reading. No more hands. Composition by speaking...ooh! Information acquisition by listening...ahh! If we do this, we will be in a new orality.

What?? Audio is lame: VIDEO is the game. We are in the youtube era. Get a second life!

T2S will be fully usable in 2 (or x) years; we must assume the future and shift our place of work there.

QUESTION: For second language learning systems development, is audio going out?

TODAY: A search for principles governing the use of voice in CALL

Investigation of voice and cognition

Walter Ong, 1982 Orality and Literacy: The Technologizing of the Word PRIMARY ORAL cultures (no system of writing) think differently from CHIROGRAPHIC cultures

Walter Ong, 1982 Orality and Literacy: The Technologizing of the Word: “Electronic media (e.g. telephone, radio and television) brought about a second orality” [paraphrase] “Both primary and secondary oralities afford a strong sense of membership in a group.” [paraphrase]

BUT Secondary orality is "essentially a more deliberate and self-conscious orality, based permanently on the use of writing and print," and produces much larger groups. Walter Ong, 1982 Orality and Literacy: The Technologizing of the Word: “Electronic media (e.g. telephone, radio and television) brought about a second orality” [paraphrase] “Both primary and secondary oralities afford a strong sense of membership in a group.” [paraphrase]

Kathleen Welch* rejects claims that Ong posits mutually exclusive, competitive, reductive orality-literacy divide. Welch argues that Ong emphasizes -a mingling of these types of consciousness -tenacity of established forms as new ones appear Welch, K. (1999) Electric Rhetoric: Classical Rhetoric, Oralism, and a New Literacy. MIT Press. p. 59

Welch argues that TV's ubiquity has resulted in a new, electronic literacy. We shall not go there today.

Workable T2S2T promises to change the nature of cognitive load constraints in text production/decoding, and hence in language learning task.

Workable T2S2T There is now S2T (Dragon Voice) for Indian English*, British English... but not for Japanese English yet. (Ever?) * http://labnol.blogspot.com/2007/01/dragon-naturallyspeaking-9-speech.htmlhttp://labnol.blogspot.com/2007/01/dragon-naturallyspeaking-9-speech.html

Workable T2S2T There is now S2T (Dragon Voice) for Indian English*, British English... but not for Japanese English yet. (Ever?) * http://labnol.blogspot.com/2007/01/dragon-naturallyspeaking-9-speech.htmlhttp://labnol.blogspot.com/2007/01/dragon-naturallyspeaking-9-speech.html So the tech is therethere for computers to decode human speech better than humans can...?

HOWEVER we don’t know much about how orality works. Perhaps that is because orality is so ingrained in us.

Walter Ong, 1982 Orality and Literacy: The Technologizing of the Word Secondary orality 163 years Primary orality 200,000 years The three stages of consciousness Invention of phonetic alphabet in 8th century BCE* *Rhys Carpenter (1933) The antiquity of the Greek alphabet. American journal of archaeology 37: 8-29. Literacy 2800 years Telegraphy USA, 1844

WIRED FOR SPEECH* Orality has been part of human life for a long time. After 200,000 years of evolution: “...humans have become voice-activated, with brains that are wired to equate voices with people and to act quickly on that information.” Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press.

Lotman, J., trans. R. Vroon (1977) The structure of the artistic text. Michigan Slavic Studies, 7. Writing can never exist without orality. p. 8 Speeches that were studied as rhetoric could only be studied if they were transcribed. Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. Writing: ‘a secondary modelling system’

Lotman, J., trans. R. Vroon (1977) The structure of the artistic text. Michigan Slavic Studies, 7. “...to this day no concepts have yet been formed for effectively, let alone gracefully, conceiving of oral art as such without reference, conscious or unconscious, to writing.” p.10 Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. Writing: ‘a secondary modelling system’

Psychodynamics of orality “...you know what you can recall.” Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge.

Psychodynamics of orality Pythagoras and the acousmatics wikipedia.org May 20, 2007: edited from Chion, M.(1994). "Audio-Vision: Sound on Screen", Columbia University Press. The term acousmatic dates back to Pythagoras, who is believed to have tutored his students from behind a screen so as not to let his presence distract them from the content of his lectures.

Psychodynamics of orality Pythagoras and the acousmatics wikipedia.org May 20, 2007: edited from Chion, M.(1994). "Audio-Vision: Sound on Screen", Columbia University Press. In cinema, acousmatic sound is sound one hears without seeing an originating cause - an invisible sound source. Radio, phonograph and telephone, all which transmit sounds without showing the source cause, are acousmatic media.

Psychodynamics of orality Acousmatic is ubiquitous in CALL. Aren’t there situations where acousmatic sound is appropriate? and situations where it is not?

Orality and writing production Kellogg: Sentence Production Demands: Verbal Working Memory “Orthographic as well as phonological representations must be activated for written spelling.” o Bonin, Fayol, & Gombert (1997) “Verbal WM is necessary to maintain representations during grammatical, phonological, and orthographic encoding.” o Levy & Marek (1999) o Chenoweth & Hayes (2001) o Kellogg, Olive, & Piolat (2006) Kellogg, R. (2006) Training writing skills: A cognitive developmental perspective. EARLI SigWriting 2006 Antwerp. http://webhost.ua.ac.be/sigwriting2006/Kellogg_SigWriting2006.pdf

In the 1930s, 9 out of 10 words a man heard by age 20 were spoken directly to him. In the 1970s, 9 out of 10 words a man heard by age 20 were spoken through a loudspeaker. John Thackara* tells of Ivan Illich’s finding that Audio sources in life Illich (1982): “Computers are doing to communication what fences did to pastures and what cars did to streets.” * book: In the Bubble blog: http://www.doorsofperception.com/

Human beings “can quickly distinguish one person’s voice from another.” p. 3 *we know these things from differing heartbeat responses Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. We are innately orate

Human beings “can quickly distinguish one person’s voice from another.” p. 3 -even in the womb we can distinguish our mother’s voice from that of another.* *we know these things from differing heartbeat responses Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. We are innately orate

Human beings “can quickly distinguish one person’s voice from another.” p. 3 -even in the womb we can distinguish our mother’s voice from that of another.* -a few days after birth, newborns prefer their mother’s voice to that of others, and can distinguish one unfamiliar voice from another.* *we know these things from differing heartbeat responses Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. We are innately orate

Human beings “can quickly distinguish one person’s voice from another.” p. 3 -even in the womb we can distinguish our mother’s voice from that of another.* -a few days after birth, newborns prefer their mother’s voice to that of others, and can distinguish one unfamiliar voice from another.* -by 8 months of age we can attend to one voice even when another is speaking at the same time. *we know these things from differing heartbeat responses Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. We are innately orate

Word choice carries social information. UX work makes choices such as blaming: 1. “Speak up.” 2. “I’m sorry, I didn’t catch that.” 3. “We seem to have a bad connection. Could you please repeat that?” Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. Humans: experts at extracting social from speech

Word choice carries social information. UX work makes choices such as voice quality: Booming deep voice: “Could I possible ask you if you wouldn’t mind doing a tiny favor?” High-pitched, soft voice: “Pick up that shovel and start digging!” Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. Humans: experts at extracting social from speech

“...the conscious knowledge that speech can have a non-human origin is not enough for the brain to overcome the historically appropriate activation of social relationships by voice [even when voice quality is low and speech understanding is poor].” Nass, C. & S. Brave. (2005) Wired for speech. (2005). MIT Press. Humans: automatically react socially to ‘voice’

“...in an oral noetic economy, mnemonic serviceability is sine qua non...”p. 70 In other words, oral information must be arranged in a certain way [a visual way] if it is to be remembered. Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. Interiority of sound

The eye cannot perceive interiority, only surfaces. Taste and smell are not much help in registering interiority/exteriority. Touch can detect interiority but in the process damages it. Hearing can register interiority without violating it. Sight isolates, sound incorporates. Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. Incorporating interiority

In primary oral cultures, need for an aide memoire: Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. p. 33 Oral memory -heavily rhythmic speech -balanced patterns -epithetic expressions -formulary expressions -standard thematic settings

In primary oral cultures, thought and expression are additive rather than subordinate. Ong, W. (1982) Orality and literacy: The technologizing of the word. 1997 reprint: Routledge. p. 37 ff. Oral memory

Tentative observations based on the exploratory hands-on experience of second language users. PhD technical writing class, KUT, May 24, 2007 Innisfree 1 Innisfree 2 Innisfree 3 Coney Island 1 Coney Island 2 Coney Island 3

PhD technical writing class, KUT, May 24, 2007 Tentative observations based on the exploratory hands-on experience of second language users.

PhD technical writing class, KUT, May 24, 2007 Self-reported estimates of comprehension of samples. Tentative observations based on the exploratory hands-on experience of second language users.

How might language learning support systems be influenced by the new T2S2T technological reality?

Articulation at the phrase level In the learner’s awareness: S2T software foregrounds articulation T2S foregrounds intonation, blending, pausing

Articulation at the phrase level Can S2T be used to improve pronunciation? Mitra, S., Tooley, J., Inamdar, P. and Dixond, P. (2003) Improving English Pronunciation: An Automated Instructional Approach. Information Technologies and International Development Volume 1, Number 1, Fall 2003, 75–84. Massachusetts Institute of Technology. http://www.mitpressjournals.org/doi/abs/10.1162/itid.2003.1.1.75

Mitra, S., Tooley, J., Inamdar, P. and Dixond, P. (2003) Improving English Pronunciation: An Automated Instructional Approach. Information Technologies and International Development Volume 1, Number 1, Fall 2003, 82. Massachusetts Institute of Technology. http://www.mitpressjournals.org/doi/abs/10.1162/itid.2003.1.1.75 Articulation at the phrase level Can S2T be used to improve pronunciation?

The looming prospect of a text-reduced world: Specificity as a foreign language e.g. University X web site: Japanese interview => English web site

T2S2T brings richness to materials design. T2S2T should imply that there will be a broad, instantaneous choice of interface with data. Aside from tangible choices of medium, other parameters demand attention: input density -number of communication objects per signal input complexity -degree of text reduction -visual field richness -number of simultaneous signals

Sometimes signals are 1. complementary, e.g. Chang’s* sound track supplies one of many possible intonations for a hypertext. 2. conflicting, e.g. phone user in a movie theater 3. mutually irrelevant, e.g. Muzak vs. supermarket sale signs 4. channel competing, e.g. powerpoint text and speech e.g. mosquito buzz vs. TV images 5. internal-external conflicting e.g. on-screen text back-checking during S2T writing *http://www.yhchang.com

A cubist look at text and attention : Chang, Young-Hae. NIPPON.html/ NIPPON.html/ Here, is text so reduced as to be iconic? How is this parallel to sound objects? *http://www.yhchang.com

A marvel in this age of niche books: many answers from one source: Nass and Brave, Wired for speech. evolving from

Improving voice interfaces by applying knowledge of human speech gender choice gender stereotyping voice personalities accent, race, ethnicity user emotion / voice emotion voice and content emotion synthetic vs. recorded variation of synthetic voice character assignment of humanity input type error and blame Nass and Brave, Wired for speech.

gender choice gender stereotyping voice personalities accent, race, ethnicity user emotion / voice emotion voice and content emotion synthetic vs. recorded variation of synthetic voice character assignment of humanity input type error and blame Nass and Brave, Wired for speech. Emotion can direct users towards or away from an aspect of an interface. Emotion affects cognition, e.g. in vehicle driving support software. Finding: people find it easier and more natural to attend to voice emotions consistent with their own present emotions. p. 77 Improving voice interfaces by applying knowledge of human speech

A promising task design tool: Baddeley and Hitch’s 1986 model of working memory, with its 3 components. Three-component model of working memory -assumes an attentional controller, the central executive, aided by two subsidiary systems: 1. the phonological loop, capable of holding speech-based information, and 2.the visuospatial sketchpad, which performs a similar function for visual information. The two subsidiary systems form active stores that are capable of combining information from sensory input, and from the central executive. Hence a memory trace in the phonological store might stem either from a direct auditory input, or from the subvocal articulation of a visually presented item such as a letter. Please read this on Hunter’s web site.

Working memory model extended Phonological loop: Important for short-term storage -ALSO for long term phonological learning Associated with -development of vocabulary in children -speed of FLA in adults Central Executive Phonological Loop Visuo-spatial Sketchpad Visual semantics Episodic LTM Language Baddeley, A. D. (2000) The episodic buffer: a new component of working memory? Trends in cognitive sciences 4(11) 417-423.

Working memory model extended Phonological loop effects: 1.Phonological similarity 2.Word-length 3.Articulatory suppression 4.Code transfer 5.Central rehearsal code, not operation Central Executive Phonological Loop Visuo-spatial Sketchpad Visual semantics Episodic LTM Language Baddeley, A. D. (2000) The episodic buffer: a new component of working memory? Trends in cognitive sciences 4(11) 417-423.

A most promising task design tool: Baddeley’s model of working memory, with its (since 2000) 4 components. Central Executive Phonological Loop Visuo-spatial Sketchpad Episodic Buffer Visual semantics Episodic LTM Language The episodic buffer: -assumed capable of storing information in a multi-dimensional code. -thus provides a temporary interface between the slave systems and LTM. -assumed to be controlled by the central executive -serves as a modelling space that is separate from LTM, but which forms an important stage in longterm episodic learning. Shaded areas: ‘crystallized’ cognitive systems capable of accumulating long-term knowledge Unshaded areas: ‘fluid’ capacities (such as attention and temporary storage), themselves unchanged by learning. Baddeley, A. D. (2000) The episodic buffer: a new component of working memory? Trends in cognitive sciences 4(11) 417-423.

Current state: Isn’t CALL just a subset of User Experience (UX?)

Thank you for your kind attention. Don ’ t hesitate to write to me. Lawrie Hunter Kochi University of Technology http://www.core.kochi-tech.ac.jp/hunter

Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology

Similar presentations

Presentation on theme: "Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology

Similar presentations

Presentation on theme: "Text to speech to text: a third orality? Lawrie Hunter Kochi University of Technology"— Presentation transcript:

Similar presentations

About project

Feedback