Presentation is loading. Please wait.

Presentation is loading. Please wait.

July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer.

Similar presentations

Presentation on theme: "July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer."— Presentation transcript:

1 July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park

2 Some Grid Use at Maryland Global Land Cover Facility –13 TB of raw and derived data from 5 satellites Digital archives –Preserving the meaning of metadata structure Access grid –No-operator information studies classroom

3 Expanding the Search Space Scanned Docs Identity: Harriet … Later, I learned that John had not heard …

4 Indexable Speech What if we could collect everything? –1 billion users of speech-enabled devices –Each producing >10K words per day –Much of it not worth finding Comparison case: Web search –Google indexes ~10 billion Web pages –Perhaps averaging ~1K words each –Much of it not worth finding

5 A Web of Speech? Web in 1995Speech in 2004 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M Last Mile (Download time) 1 second (no graphics) Streaming Display Capability (Computers/US population) 10%100% Search SystemsLycos Yahoo

6 The Need for Scalable Solutions

7 Some Spoken Word Collections Broadcast programming –News, interview, talk radio, sports, entertainment Storytelling –Books on tape, oral history, folklore Incidental recording –Speeches, courtrooms, meetings, phone calls

8 Indexing Options Transcript-based (e.g., NASA) –Manual transcription, editing by interviewee Thesaurus-based (e.g., Shoah Foundation) –Manually assign descriptors to points in an interview Catalog-based (e.g., British Library) –Catalog record created from interviewers notes Speech-based (MALACH) –Create access points with speech processing

9 Supporting Intellectual Access Source Selection Search Query Selection Ranked List Examination Recording Delivery Recording Query Formulation Search System Query Reformulation and Relevance Feedback Source Reselection Speech Processing Computational Linguistics Information Retrieval Information Seeking Human-Computer Interaction Digital Libraries

10 Some Technical Challenges Fast ASR systems are way too slow –6 orders or magnitude slower than tokenization Situational sublanguage induces variability –Impedes interactive vocabulary acquisition Knee in the WER/MAP curve comes early –30-40% for broadcast news –Somewhere below 30% for conversations Skimmable summaries from imperfect ASR –Particularly important for linear media Classic IR measures focus on documents –Conversationalboundaries are ambiguous

11 Start Time Error Cost

12 Shoah Foundation Collection Substantial scale –116,000 hours; 52,000 interviews; 32 languages Spontaneous conversational speech –Accents, elderly, emotional, … Accessible –$100 million collection and digitization investment Manually indexed (10,000 hours) –Segmented, thesaurus terms, people, summaries Users –A department working full time on dissemination

13 Interview Excerpt Audio characteristics –Accented (this one is unusually clear) –Separate channels for interviewer / interviewee Dialog structure Interviewers have different styles Content characteristics –Domain-specific terms –Named entity mentions and relationships

14 MALACH Languages EnglishCzechRussianSlovakPolish Collected24,8745737,0805731,400 Cataloged22,8205317,016464989 Indexed22,8202270100 Digitized13,7353743,052427835 Completed11,4642228700 As of January 31, 2004 Testimonies (average 2.25 hours each)

15 Observational Studies 8 independent searchers –Holocaust studies (2) –German Studies –History/Political Science –Ethnography –Sociology –Documentary producer –High school teacher 8 teamed searchers –All high school teachers Thesaurus-based search Rich data collection –Intermediary interaction –Semi-structured interviews –Observational notes –Think-aloud –Screen capture Qualitative analysis –Theory-guided coding –Abductive reasoning

16 Relevance Criteria Criterion Number of Mentions All (N=703) Think-Aloud Relevance Judgment (N=300) Query Form. (N=248) Topicality 535 (76%)219234 Richness 39 (5.5%) 14 0 Emotion 24 (3.4%) 7 0 Audio/Visual Expression 16 (2.3%) 5 0 Comprehensibility 14 (2%) 1 10 Duration 11 (1.6%) 9 0 Novelty 10 (1.4%) 4 2 6 Scholars, 1 teacher, 1 film producer, working individually

17 Topicality Total mentions 6 Scholars, 1 teacher, 1 movie producer, working individually

18 Automatic Search Boundary Detection Interactive Selection Content Tagging Speech Recognition Query Formulation Test Collection Design

19 Automatic Search Boundary Detection Speech Recognition Query Formulation Topic Statements Ranked Lists Evaluation Relevance Judgments Mean Average Precision Interviews Content Tagging Manual: Topic boundaries Automatic: Topic boundaries Manual:~5 Thesaurus labels 3-sentence summaries Automatic:Thesaurus labels Automatic: 35% interview-tuned 40% domain-tuned Training: 38 existing Evaluation:25 new

20 CLEF-2005 CL-SR Track Test collection distributed by ELDA –~7,800 segments from ~300 English interviews Hand segmented / known boundaries –63 topics (title/description/narrative) 38 for training, 25 for blind evaluation 5 languages (EN, SP, CZ, DE, FR) –Relevance judgments Search-guided + post-hoc judgment pools 5 participating teams –DCU, Maryland, Pitt, Toronto/Waterloo, UNED One required cross-site baseline run –ASR segments / English TD topics

21 Additional Resources Thesaurus –~3,000 core concepts Plus alternate vocabulary + standard combinations –~30,000 location-time pairs, with lat/long –Both is-a and part-whole relationships In-domain expansion collection –186,000 3-sentence summaries Indexers scratchpad notes Digitized speech –.mp2 or.mp3

22 English ASR Training: 200 hours from 800 speakers ASR2003A ASR2004A

23 VHF 00017 -062567.005 Warsaw (Poland), Poland 1935 (May 13) - 1939 (August 31), awareness of political or military events, schools Sophie P[…], Henry H[…] AH talks about the college she attended before the war. She mentions meeting her husband. She discusses young peoples' awareness of the political events that preceded the outbreak of war. graduated HS, went to college 1 year, professional college hotel management; met future husband, knew that they'd end up together; sister also in college, nice social life, lots of company, not too serious; already got news from Czechoslovakia, Sudeten, knew that Poland would be next but what could they do about it, very passive; just heard info from radio and press no no no they did no not not uh i know there was no place to go we didn't have family in a in other countries so we were not financially at the at extremely went so that was never at plano of my family it is so and so that was the atmosphere in the in the country prior to the to the war i graduate take the high school i had one year of college which was a profession and that because that was already did the practical trends f so that was a study for whatever management that eh eh education and this i i had only one that here all that at that time i met my future husband and that to me about any we knew it that way we were in and out together so and i was quite county there was so whatever i did that and this so that was the person that lived my sister was it here is first year of of colleagues and and also she had a very strongly this antisemitic trend and our parents there was a nice social life young students that we had open house always pleasant we had a lot of that company here and and we were not too serious about that she we got there we were getting the they already did knew he knew so from czechoslovakia from they saw that from other part and we knew the in that that he is uhhuh the hitler spicy we go into this year this direction that eh poland will be the next country but there was nothing that we would do it at that time so he was a very very he says belong to any any organizations especially that the so we just take information from the radio and from the dress

24 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -2044.00 54.01 224.90 391.70 326.00 287400.00 75031.00 44.5% ?? Segment duration (s)

25 Keywords vs. Segment duration

26 Nodes descending from parents of leaves

27 Years spoken in ASR

28 Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.6575 3rd Qu.: 1.0000 Max. : 13.0000 Spoken dates in release ASR

29 Current classifier performance: MAP:.2374, even post-mixing of scratchpad/summary from 20NN, remixed with time-label densities estimated w/ Gaussian kernel at 5x def. bandwidth 46,601 (1,175) 3,610 ( 169) 1,437 (168) 613 ( 47)

30 An Example English Topic Number: 1148 Title: Jewish resistance in Europe Description: Provide testimonies or describe actions of Jewish resistance in Europe before and during the war. Narrative: The relevant material should describe actions of only- or mostly Jewish resistance in Europe. Both individual and group-based actions are relevant. Type of actions may include survival (fleeing, hiding, saving children), testifying (alerting the outside world, writing, hiding testimonies), fighting (partisans, uprising, political security) Information about undifferentiated resistance groups is not relevant.

31 5-level Relevance Judgments Classic relevance (to food in Auschwitz) DirectKnew food was sometimes withheld IndirectSaw undernourished people Additional relevance types ContextIntensity of manual labor ComparisonFood situation in a different camp PointerMention of a study on the subject Binary qrels

32 Title queries, adjudicated judgments +Persons Comparing Index Terms

33 Title queries, adjudicated judgments jewish kapo(s)fort ontario refugee camp Searching Manual Transcripts

34 Category Expansion Spoken Words (hand transcribed) Thesaurus Terms 3,199 Training segments Spoken Words (ASR transcript) Thesaurus Terms test segments kNN Categorization Index F=0.19 (microaveraged) Title queries, linear score combination, adjudicated judgments ASR Words Thesaurus Terms

35 ASR-Based Search Mean Average Precision Title queries, adjudicated judgments +27% Average of 3.4 relevant segments in top 20

36 Rethinking the Problem Segment-then-label models planned speech well –Producers assemble stories to create programs –Stories typically have a dominant theme The structure of natural speech is different –Creation: digressions, asides, clarification, … –Use: intended use may affect desired granularity Documentary film: brief snippet to illustrate a point Classroom teacher: longer self-contextualizing story

37 Activation Matrix Time

38 Training Data: 196,000 Segments SubjectPersonLocation-Time Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein Dresden-1939 Schooling Gunter Wendt Maria Dresden-1939 Relocation Transportation-rail interview time + Segment summaries + Indexers notes

39 Preprocessing Training Data Normalize labeled categories? –Food in hiding -> food AND hiding Develop class models –Existing hierarchy, types of personal relationships Determine the extent for each label and class –Merge the extent of repeated labels

40 Characteristics of the Problem Clear dependencies –Correlated assignment of applications –Living in Dresden negates living in Berlin Heuristic basis for class models –Persons, based on type of relationship –Date/Time, based on part-whole relationship –Topics, based on a defined hierarchy Heuristic basis for guessing without training –Text similarity between labels and spoken words Heuristic basis for smoothing –Sub-sentence retrieval granularity is unlikely

41 Modeling Location BerlinDresden Presence in a new location negates presence in the prior location Location granularity varies (inclusion relationships are known) Germany

42 A Class Model for People father mother sister father mother sister friend nobody Several people may be discussed simultaneously Small inventory of relationship types Relationship type is known for most people that are mentioned

43 Search Compute a score at each time based on: –How likely is each descriptor? (~TF) –How selective is each descriptor? (~IDF) –What related descriptors are active? (~expansion) Determine passage start time based on: –Score trajectory (sequence of scores) –Additional heuristics (e.g., pause, speaker turn) Rank passages based on score trajectory –e.g., by peak score within the passage

44 Timelines for the whole interview text

45 Some Open Issues Is the expressive power of a lattice needed? –An activation matrix is an unrolled lattice What states do we need to represent? –Balance fidelity, accuracy, and complexity How to integrate manual onset marks? How much training data do we need? –Annotating new data costs ~$100/hour How will people use the system we build?

46 Non-English ASR Systems 10/01 4/02 10/02 4/03 10/03 4/04 10/04 4/05 10/05 4/06 10/06 30 40 50 60 70 WER [%] Czech Russian Slovak Polish 45h + LM Tr 84h + LM Tr + LM Tr+TC + standard. 57.92% 45.91% 41.15% 38.57% 35.51% + adapt. 20h + LM Tr 66.07% 50.82% 34.49% Hungarian 100h + LM Tr + stand.+LM Tr+TC 100h + LM Tr + stand.+LM Tr+TC 45.75% 40.69%

47 Planning for the Future Tentative CLEF-2006 CL-SR Plans: –Adding a Czech collection –Larger English collection (~900 hours) Adding word lattice as standard data –No-boundary evaluation design –ASR training data (by special arrangement) Transcripts, pronunciation lexicon, language model Possible CLEF-2007 CL-SR Options: –Add a Russian or Slovak collection? –Much larger English collection (~5,000 hours)?

48 The CLEF CL-SR Team Shoah Foundation –Sam Gustman IBM TJ Watson –Bhuvana Ramabhadran –Martin Franz U. Maryland –Doug Oard –Dagobert Soergel Johns Hopkins –Zak Schefrin U. Cambridge (UK) –Bill Byrne Charles University (CZ) –Jan Hajic –Pavel Pecina U. West Bohemia (CZ) –Josef Psutka –Pavel Ircing UNED (ES) –Fernando López-Ostenero USAEurope

49 More Things to Think About Privacy protection –Working with real data has real consequences Are fixed segments the right retrieval unit? –Or is it good enough to know where to start? What will it cost to tailor an ASR system? –$100K to $1 million per application? Do we need to change what we collect? –Speaker enrollment, metadata standards, …

50 Final Thoughts The moving hand, having writ, moves on –Ephemeral webcasting –Forgone acquisition opportunities

51 For More Information The MALACH project – CLEF-2005 evaluation – NSF/DELOS Spoken Word Access Group –


Download ppt "July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer."

Similar presentations

Ads by Google