MIDOS Architecture Meeting Aaron Adler March 2, 2009.

MIDOS Architecture Meeting Aaron Adler March 2, 2009

MIDOS: Multimodal Interactive DialOgue System Architecture and code overview ~45k lines of Java code in the package ~6k lines of C# code Feel free to ask questions

Outline Quick motivation High level architecture More details Code layout

Multimodal Interaction

Motivation Many unspecified parameters – Large cognitive effort to specify all parameters Multimodal interaction – Symmetric Dynamic dialogue – Driven by qualitative physics

Motivation Sketching is great for communicating designs – Some things are hard to sketch Use speech – Some details are uncertain Have a conversation Multimodal dialogue – speech and sketching – makes the computer more of a partner

User Study 1: Multimodal Device Descriptions Empirical study of informal speech Key results – Disfluencies – Pauses -> topic change – Talking and sketching about the same topics

User Study 2: Human Multimodal Dialogue Experimenter, participant Speech, sketching recording, shared sketching surface Key results: – Pen color: refer back, new topic, real world – Disfluent speech – Concurrent speech/sketching about same topics – Simple questions

Project Sketch

Color Correspondence

Color Paths

Sketch Modification

Architecture JavaC# Sketch input Sketch output Speech recognition Speech synthesis Commands (open, save, etc.) Sketching (himetric units) Speech Physics calculations Question selection Speech / Sketch generation File opening / saving

Information Request Flow Physics ? Generates Select and generate question Question Ask Reply Information Request Question Reply Erase the ink that was used in the question and reply Information Request Information Request Information Request Statement Velocity Update Velocity Update Velocity Update Velocity Update Information Request Information Request Information Request

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Speech Input Sketch Input Sketch Output Speech Output

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Speech Input Sketch Input Sketch Output Speech Output Analyze shapes: Motion paths Conflicting motion Spring directions Analyze shapes: Motion paths Conflicting motion Spring directions

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Speech Input Sketch Input Sketch Output Speech Output Analyze physics: Collisions Missing information Under-specification Analyze physics: Collisions Missing information Under-specification

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Speech Input Sketch Input 1) Pick request 2) Form question 1) Pick request 2) Form question 3) Ask the question

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Sketch Output Speech Output 2) Compare to expectations 3) Determine top score 4)Update physics 2) Compare to expectations 3) Determine top score 4)Update physics 1)User reply

Physics Simulator Qualitative physics – Don’t have exact numbers, do have angles Modest – Won’t be a complete simulation Generate sensible questions – If something is ambiguous can ask the user Goal is to generate dialogues with the user

Simulation Method Two possible approaches: – Calculation approach: calculate how far things move directly – Step approach: use small, incremental movements Using the calculation approach

Motion Objects can have translational motion OR rotational motion, but not both Translational motion – Velocity angle, or set of angles Rotational motion – Clockwise, counter clockwise, none, unknown Stores velocity and acceleration for each – Acceleration merged with velocity at each time step – Gravity is just a downward acceleration

Torque Possibilities ?? Counterclockwise Clockwise Ambiguous – Ask the user

Shapes All the objects are turned into polygons – Used for the collision detection and calculations Even ellipses are approximated at polygons Uses polylines for ropes, springs Seems like a reasonable approximation Reduces the number of special cases

Generating Questions Physics engine determines what information is unknown or incomplete – Information Requests are generated Rank and pick the Information Request to process Generate a question from the Information Request – Generate speech and sketching – Questions have expected speech and sketch responses

What Question to Ask? Question applicability governed by: – Current arrangement/state of the shapes – Facts that are currently known – Previous questions (to follow up on conflicts, etc.) Some questions are dependent on facts from answer to previous question – If we know that a collision occurs – Then we can ask about which block hits the other one (the specifics of the collision)

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Speech Input Sketch Input

Multimodal Output Coordinate speech and sketching outputs – AT&T Natural Voices speech synthesizer – Computer generated strokes Simple language to generate output timing – Allow strokes to be drawn during a particular part of the speech utterance

What direction does (this shape) rotate in? Which of {these directions} does this shape move in? Multimodal Language Examples Draw one stroke Draw one or more strokes

Multimodal Language Examples

(These two) (bodies) collide (here.) Where on (this) body does the contact occur? Pause Clear all the strokes

Multimodal Output Coordinate speech and sketching outputs – Complex to do by hand – Generate human-like sketch output No low-level access to the speech synthesizer Calculate word and phrase timings ahead of time Language to generate output timing

Multimodal Output Language (): match one stroke {}: match one or more strokes automatic: adjust for number agreement Allows the computer to pause where needed to keep speech and sketching synchronized

Multimodal Output How to time the outgoing speech and sketching Previously used hand coded timings Too hard to do with complicated or dynamic output

Multimodal Output No low-level access to the speech synthesizer – Don’t know when it’s done speaking Calculate word and phrase timings ahead of time – Can’t even just add word timing together Build a little language to help

Underestimate or Overestimate Speech Times? Best to underestimate the word timing – why? – If you overestimate, possibly start speaking again too soon – If you underestimate, speech won’t overlap May try to start talking sooner, but it can’t output more than one thing at once

Overestimating Speech Estimated Speech Actual Speech Sketching

Underestimating Speech Estimated Speech Sketching Actual Speech Estimated Speech

Speech Synthesizer AT&T Natural Voices Synthesizer – Easy to integrate – Sounds reasonable Java and C# interfaces – Using C# interface Has documentation and samples

Sketch Output Timed display of strokes and speech – Should allow strokes to be drawn during a particular part of the speech utterance Computer-generated strokes – Can be drawn in a specified duration – Circle objects – Points moved so that strokes appear more like human-drawn strokes

System Components Dialogue Manager Interpret and Score Question Generation Qualitative Physics Simulator Information Request Generator Sketch Output Speech Output

Multimodal Input Sketching Input – N-best interpretations (line, arc, polyline, etc.) – Use context to determine type – Compare to expected stroke Speech Input – Microsoft speech recognizer in dictation mode – N-best results list – Compare to expected speech

Processing Steps 1.Ask the user a question 2.Match and score the user speech against expected speech 3.Match and score the user sketching against the expected sketching 4.Pick the best scoring combination 5.Evaluate the best scoring combination 1.If successful, continue 2.If not, ask a follow-up question, return to first step 6.Generate statements about the new information and update the system state

Speech Recognizer Tried various recognizers Microsoft recognizer proves easiest to get running – Reasonable results – N-best list of results – Easy API from C#

Sketch Input Using C# allows the capture of pen pressure data – C# handles the rendering of the stokes Passing the data to Java allows the reuse of code – Saving to XML – Existing sketch recognition code (Simple Classifier)

Scoring the Reply, Speech Uses n-best speech results – Decreasing score as you move down the n-best list and decreasing score as you match less of the expected speech Yes it is100 yes it is90 this is80 as it is70 U.S. is60 yes is50 U.S. news40 U.S. in his30 yes in his20 as in his10 Expected Yes Yeah += Yes it is100 yes it is90 this is0 as it is0 U.S. is0 yes is50 U.S. news0 U.S. in his0 yes in his20 as in his0

Scoring the Reply, Sketching Uses scores from the sketch recognition results – Decreasing score for worse matches Not expecting particular shapes, Expecting selection, path or location interpretation Polyline100 Line80 Arc60 Complex55 Ellipse0 Path 90 Path 70 Location 60 Selection 20 …

Interesting Complications – Optional or addition speech and sketching User provides additional, redundant, extraneous, or conflicting information – Responses that answer multiple questions or provide extra information For example, “This shape moves in this direction, and so does this one.” or “Yes these two shapes collide, and this shape collides into the other one.”

Combining Input Modalities Pick best scoring inputs Input timing is critical Recognize conflicting or missing information Update physics or ask for clarification

Spring Direction Input Example “It moves in this direction” “It contracts” “It expands”

Yes/No Input Table

Multimodal Input Table Details Possible operators: – Identity, Not, Param, Star Two lists of operators: – Required, Optional Possible results: – Success, Conflict, Insufficient, None

Multimodal Input Table Revisiting user input handling Need to: – Handle different multimodal combinations of input – Recognize conflicting or missing information – Look for input that matches expectations – Look at timing of input Generalizes code and makes it easier to add new questions

Spring Direction Input Table

Dynamic Dialogue Computer driven interaction – No predetermined conversation – Questions determined by physics – Differs from other systems where user asks questions

Possible Additions Fine tuning modality combination – How long to leave strokes on the screen? Focus the user’s attention Length of the user’s reply Speech lattice? Dialogue management – Picking questions, colors, gestures to use? Initial sketch input (LADDER?)

Code In “converse” package Backend: core system, lots of I/O code here Dialogue: core system Display: Java debugging UI Information: the main information request code – Table: code for the input checking table – Requests: the specific information request classes Physics: the qualitative physics simulator, hopefully you don’t need too much of this Sketch: sketch related code Speech: speech related code

Code Data all stored in MultimodalActionHistory – Can technically store the audio wav files too? – Stores all the sketching – Could be replayed, haven’t tried though – Could back up to a previous state as all states are stored, but haven't done that either Initializer (Java) times all the speech output and stores to a file – If you add new outgoing speech you’ll get an error if you haven’t run this

Code Multimodal output alignment Input speech / sketching in chunks Matching is greedy, can’t match strokes after a {} Chunking gets around this problem. New step: determine number / singular / plural word forms (separate file) Timing is a separate step – Takes into account the time the strokes will take to draw etc., how long the speech will take to speak

Java Code display.DesignAssistant: – main class, starts backend backend – process incoming and outgoing messages send messages to C#, get messages back dialogue – speech and stroke scoring. output alignment, dialogue management Display – ``fact'' display -- really statements, code still calls facts – Java editing operations, coordinate conversion

Java Code information – types of input, types of responses. information.QuestionGenerator – actually generates the questions – stores the text for the questions – Pie areas, strokes, offset speech and sketching classes help with timing information.Reply – Class for the user reply information.requests – all the information requests information.table – table operators, identity, not, param, star,

Java Code (physics) physics – lots of angle stuff, rotation, translation, directions – constraints physics.facts – these might need to get moved – perhaps made more general physics.filters – various sorts for looking at more global properties, collisions, etc physics.modifiers – updating physics physics.projectors – predicting the next state

Java Code Sketch – sketched shapes, etc. – Some domain specific stuff in here Speech – recognized / chosen speech

C# Code Converse UI solution Communications – file IO / sketch / speech comm with java ConverseUI – the UI itself DesignAssistant (main class) DataProcessing – Queues, Catalog, MUGPoint, MUGStroke, Projection, SolidDraw, etc. – Also has the code that does the stroke generation – Catalog keeps track of what should be displayed and the different types of displayed objects PointConverter – Getting the right timing on incoming stroke points. Speech – Interface with MS speech recognizer

Egg Cracker Walkthrough

S: U: Hi. Motivation: Get the user used to talking / hearing the system. Verify that it has started up.

S: Are all these bodies the same? U: No, just these two. Motivation: Asking about similar objects – detect objects of same size and shape or same components

S: Are any of these fixed? U: Only this one. Motivation: Asking if stationary/fixed – Objects are not attached to anything

S: ?? I’ve updated the diagram to show that the body is anchored Action: Updated screen to reflect that the block is anchored.

S: So is this body touching anything? U: Yes, this one and that one. S: And this one? U: No.

S: Is this touching these bodies? U: Yes it is. Motivation: Asking about touching objects – unconnected object, not fixed, close to other objects.

S: I’ve updated the position of the body. Action: Update the position of the block so that it is touching the other blocks.

S: The block has a pivot and torques in both directions. Is it initially balanced? U: Yes. [Could also ask this later on in the conversation, need to try it out]

S: How does this work? U: First this block is pulled up. Motivation: Signaling that the preliminary questions are over; time to talk about the rest of it.

S: How far does it move? U: Until the bottom is here. S: Ok. Motivation: motion indicated with no objects blocking indicated path. Need to clarify motion range

S: I’ve updated the position of the body. Does the body move this way or that way? Action: Update block position. Motivation: Change of contact from last update

U: This way.

S: So the body hits this body pushing it to the right, does it hit this surface? U: Maybe – it doesn’t matter. It goes down and hits this platform. Motivation: Confirming assumption derived from physics which was indicated by the collision. Then asking about a surface that is in the path of motion.

Note: Round ball going off edge – not sure how the DF change there would work

S: So after the collision, it rotates counterclockwise? U: Yes.

S: Do these two collide? U: Yes Motivation: Verify collisions

S: What direction does this rotate in initially? U: It is balanced to start. And so is this one. Motivation: Unknown rotation/balance 1 2

S: The collision causes this to rotate. How far does it rotate? U: It rotates at least this far so that the block falls into here. This rotates too. Motivation: Unknown rotation/balance 1 2

S: Does this body collide with this one? U: Yes. Motivation: Verify predicted collision

S: Does the body bounce or stop after the collision? U: It stops. Motivation: Ambiguous collision

S: What about this body? Where does it end up? U: It doesn’t matter. Motivation: Ambiguous collision

S: Is there anything else? U: No that’s it. Action: Moved body off screen

MIDOS Architecture Meeting Aaron Adler March 2, 2009.

Similar presentations

Presentation on theme: "MIDOS Architecture Meeting Aaron Adler March 2, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MIDOS Architecture Meeting Aaron Adler March 2, 2009.

Similar presentations

Presentation on theme: "MIDOS Architecture Meeting Aaron Adler March 2, 2009."— Presentation transcript:

Similar presentations

About project

Feedback