Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining, Information Extraction and Search in Spoken Documents

Similar presentations


Presentation on theme: "Data Mining, Information Extraction and Search in Spoken Documents"— Presentation transcript:

1 Data Mining, Information Extraction and Search in Spoken Documents
Julia Hirschberg CS 4706 12/2/2018

2 Today Data mining from text Searching audio data instead of text
Information extraction from spoken documents Speech data mining 12/2/2018

3 Data Mining Discovery of trends and patterns across very large datasets, usually for decision-making purposes Fraud detection in banking, telephony Stock market Indications of demographic disasters New causes of diseases …finding things you don’t know you’re looking for Information retrieval vs. ‘mining for nuggets’ 12/2/2018

4 Dating Mining in Computational Linguistics
Finding lexical co-occurrence information Finding parallel text corpora on the web for MT Finding ‘new’ topics in news stories TDT task Exploring citation links: Networks of influence Information extraction, e.g. find mutual acquaintances 12/2/2018

5 Snowball (Agichtein et al ’01):
Seed set of patterns (e.g. Norman Mailer, 59  <firstname> <lastname>, <age>; the 59-year-old Mailer  the <age>-year-old <lastname>) Find more patterns by looking for e.g. Mailer close to 59 Mailer turned 59 last week. Though Mailer is 59… 12/2/2018

6 But Searching Audio Data is Harder
Large amounts of audio data available: on the web, in company archives, in our homes We have tools supporting random access to text – but for audio we’re limited to serial search How can we develop methods to search audio as easily as text? 12/2/2018

7 Applications Searching online TV and radio news and archives
Library of Congress Searching a/v archives, movies Searching trial recordings and legislative sessions Searching meetings, customer care exchanges, focus groups Telephone calls and voic 12/2/2018

8 Current Approach Train/adapt a speech recognizer for the corpus
Produce an ASR transcript Segment spoken `documents’ into sentences, turns, topics Index (errorful) transcripts for Information Retrieval and link to audio via timestamps Enables audio search by content 12/2/2018

9 Some Examples SpeechBot searching internet broadcasts
Google Voice Search: search audio by voice (not yet) SCANMail searching voic 12/2/2018

10 Information Extraction and QA from Speech
DARPA GALE project: improve information gathering from text, speech, translations Current Domain: newswire and news broadcasts in English, Arabic, and Mandarin 3 competing teams ASR/MT bakeoffs ‘Distillation’ evaluations QA User studies Requires identification and annotation of information and ‘formatting’ in speech 12/2/2018

11 Sample Distillation Questions
List facts about <event> Find people who are mutual acquaintances of <person1> and <person2> Identify persons arrested from <organization> and give their name and role in that organization Produce a biography of <person> Provide information on <organization> Find statements made by or attributed to <person> about <topic> How did <country> react to <event> 12/2/2018

12 Nightingale Architecture
Automatic Annotation Distillation Speaker modeling Information assimilation ASR MT Audio diarization Prosodic metadata Target Language Source Language Punctuation Capitalization Info repository Linguistic structure Prosodic analysis Names Relations Intelligence delivery Topic modeling 12/2/2018

13 Information Annotation
Spoken documents … Lack many cues found in text documents Format (sentences, turns, paragraphs) Include spontaneous speech phenomena which are difficult for ASR and NLP technologies to handle Disfluencies, fragments Contain errors Annotation can turn a weakness into a strength 12/2/2018

14 From an ASR Transcript aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

15 To Speaker Segmentation (Diarization)
Speaker: 0 - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston Speaker: 1 - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

16 Add Speaker Role Labels
Anchor - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston Reporter - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 12/2/2018

17 Perform Sentence Detection and Punctuation
Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

18 Detect Story Boundaries
Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

19 Detect Disfluencies (and Keep/Remove)
Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 12/2/2018

20 Detect Named Entities Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush. Claire Shipman is covering the vice president Claire you begin tonight please. 12/2/2018

21 Resolve References Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston. Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush [Governor George W. Bush]. Claire Shipman is covering the vice president Claire [Claire Shipman] you begin tonight please. 12/2/2018

22 Speech Data Mining How does it differ from text data mining?
Must handle errorful transcription Lacks (reliable) formatting Contains spontaneous speech phenomena We need to bring additional sources to bear on the problem 12/2/2018

23 Maskey et al 2004: Improving Proper Name Transcription in Voicemail
How can we improve transcription of proper names without increasing the size of the ASR lexicon? Use meta-data available at runtime to hypothesize caller’s and callee’s names Caller ID string – “cname” Name of mailbox owner – “mname” 12/2/2018

24 Corpus Scanmail corpus
100 hours of voic messages from 140 employees of AT&T. Manually transcribed with “cname” and “mname” tags Gender balanced ~12% non-native speakers 238 random messages for testing, rest (~ 10,000 messages) for training Training corpus consisted of 100 hours of voic messages collected from the voic boxes of 140 employees at AT&T, called Scanmail. The corpus is approximately gender balanced, and has 12% of messages by non-native speakders. The corpus was manually transcribed, with caller id and mailbox owner parts bracketed. 238 messages were selected randomly for testing, and the rest for training. In the test set, 317 word tokens were caller names, and 219 were mailbox owner names. 12/2/2018

25 Approach Create a class-based language model
Create a name network to give instances for the classes of the model Replace the class-based language model at runtime with the appropriate name networks, identified from the cname and mname of the call 12/2/2018

26 Name Network To get values for “mname” and “cname”, an internal AT&T employee directory (~ 40,000 people) listing used “cname” created from variations of static titles (Miss, Mr), full first names and nicknames (Alexander, Alex), and last names (Jones) 12/2/2018

27 Name Network Probability within class – training corpus
Probability within first names – AT&T directory listing 12/2/2018

28 Experimental Results Word Error Rates (WER) improvement small
Absolute reduction of 0.6% Named Error Rate (NER) improvement significant Absolute reduction of 20 % Large reduction in NER important: Getting a name right is important to business users Scanmail users expressed a strong desire for the system to recognize their own names correctly 12/2/2018

29 Next Class HTK Toolkit and HW5 (Fadi Biadsy) 12/2/2018


Download ppt "Data Mining, Information Extraction and Search in Spoken Documents"

Similar presentations


Ads by Google