Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS.

Similar presentations


Presentation on theme: "Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS."— Presentation transcript:

1

2 Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS MeetingSpeech/EE Meeting 1

3 2

4 3

5 Degree of difficulty Some queries are harder than others – More bits? (Long tail: long/infrequent) – Or less bits? (Big fat head: short/frequent) Query refinement/Coping Mechanisms: – If query doesn’t work – Should you make it longer or shorter? Solitaire  Multi-Player Game: – Readers, Writers, Market Makers, etc. – Good Guys & Bad Guys: Advertisers & Spammers 4

6 Spoken Web Search is Easy Compared to Dictation 5

7 Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research 6

8 7 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages – Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds – Turn into a big liability? Examples of Big Bets – Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) – Sales & Marketing >> Production & Distribution Goal: Maximize Sales (Not Inventory) Small

9 8 Millions (Not Billions)

10 9 Population Bound With all the talk about the Long Tail – You’d think that the Web was astronomical – Carl Sagan: Billions and Billions… Lower Distribution $$  Sell Less of More But there are limits to this process – NetFlix: 55k movies (not even millions) – Amazon: 8M products – Vanity Searches: Infinite??? Personal Home Pages << Phone Book < Population Business Home Pages << Yellow Pages < Population Millions, not Billions (until market saturates)

11 10 It Will Take Decades to Reach Population Bound Most people (and products) – don’t have a web page (yet) Currently, I can find famous people (and academics) but not my neighbors – There aren’t that many famous people (and academics)… – Millions, not billions (for the foreseeable future)

12 11 Equilibrium: Supply = Demand If there is a page on the web, – And no one sees it, – Did it make a sound? How big is the web? – Should we count “silent” pages – That don’t make a sound? How many products are there? – Do we count “silent” flops – That no one buys?

13 12 Demand Side Accounting Consumers have limited time – Telephone Usage: 1 hour per line per day – TV: 4 hours per day – Web: ??? hours per day Suppliers will post as many pages as consumers can consume (and no more) Size of Web: O(Consumers)

14 13 How Big is the Web? Related questions come up in language How big is English? – Dictionary Marketing – Education (Testing of Vocabulary Size) – Psychology – Statistics – Linguistics Two Very Different Answers – Chomsky: language is infinite – Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know?

15 14 Chomskian Argument: Web is Infinite One could write a malicious spider trap – http://successor.aspx?x=0  http://successor.aspx?x=1  http://successor.aspx?x=2 http://successor.aspx?x=0 http://successor.aspx?x=1 http://successor.aspx?x=2 Not just academic exercise Web is full of benign examples like – http://calendar.duke.edu/ http://calendar.duke.edu/ – Infinitely many months – Each month has a link to the next

16 15 Big How Big is the Web? 5B? 20B? More? Less? More (Chomsky) – http://successor?x=0 http://successor?x=0 Less (Shannon) Entropy (H) Query 21.1  22.9 URL 22.1  22.4 IP 22.1  22.6 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Millions (not Billions) MSN Search Log 1 month  x18 Cluster in Cloud  Desktop  Flash Comp Ctr ($$$$)  Walk in the Park ($) More Practical Answer

17 16 Entropy (H) – Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities – How hard is search? – How much does personalization help?

18 17 How Hard Is Search? Millions, not Billions Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) Personalized Search IP – H(URL | Query, IP) – 1.2 – 1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half!

19 18 How Hard are Query Suggestions? The Wild Thing? C* Rice  Condoleezza Rice Traditional Suggestions – H(Query) – 21 bits Personalized IP – H(Query | IP) – 5 – 5 bits (= 26 – 21) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half! Twice

20 19 Conclusions: Millions (not Billions) How Big is the Web? – Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky – How hard is search? – Query Suggestions? – Personalization? Cluster in Cloud ($$$$)  Walk-in-the-Park ($) Goal: Maximize Sales (Not Inventory) Entropy is a great hammer

21 Search Companies have massive resources (logs) “Unfair” advantage: logs Logs are the crown jewels Search companies care so much about data collection that… – Toolbar: business case is all about data collection The $64B question: – What (if anything) can we do without resources? 20

22 21

23 22

24 23

25 Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS MeetingSpeech/EE Meeting 24

26 Definitions Towards: – Not there yet Zero Resources: – No nothing (no knowledge of language/domain) – The next crisis will be where we are least prepared – No training data, no dictionaries, no models, no linguistics Motivate a New Task: Spoken Term Discovery – Spoken Term Detection (Word Spotting): Standard Task Find instances of spoken phrase in spoken document Input: spoken phrase + spoken document – Spoken Term Discovery: Non-Standard task Input: spoken document (without spoken phrase) Output: spoken phrases (interesting intervals in document) ? 25

27 The next crisis will be where we are least prepared ? 26

28 27

29 RECIPE FOR REVOLUTION 2 Cups of a long-standing leader 1 Cup of an aggravated population 1 Cup of police brutality ½ teaspoon of video Bake in YouTube for 2-3 months – Flambé with Facebook and sprinkle on some Twitter 28

30 29

31 30

32 31

33 Longer is Easier /iy/ encyclopedias male female 32

34 What makes an interval of speech interesting? Cues from text processing: – Long (~ 1 sec such as “University of Pennsylvania”) – Repeated – Bursty (tf * IDF) tf: lots of repetitions within documents IDF: with relatively few repetitions across documents Unique to speech processing: – Given-New: First mention is articulated more carefully than subsequent – Dialog between two parties (A & B): multi-player game A: utters an important phrase B: what? A: repeats the important phrase 33

35 34

36 35

37 36

38 37

39 38

40 39

41 40

42 41

43 n 2 Time & Space (Update: ASRU-2011) But the constants are attractive Sparsity Redesigned algorithms to take advantage of sparsity – Median Filtering – Hough Transform – Line Segment Search Approx Nearest Neighbors (ASRU-2012) 42

44 43

45 44

46 45

47 46

48 47

49 48

50 NLP on Spoken Documents Without Automatic Speech Recognition Mark Dredze, Aren Jansen, Glen Coppersmith, Ken Church EMNLP-2010 49

51 WE DON’T NEED SPEECH RECOGNITION TO PROCESS SPEECH At least not for all NLP tasks 50

52 Speech Processing Chain Speech Recognition Speech Collection Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis Manual Transcripts Full Transcripts This Talk Bag of Words Representation Good enough for many tasks 51

53 NLP on Speech Full speech recognition to obtain transcripts – Acoustic phonetic model – Sequence model and decoder – Language Model NLP requirements – Bag of words sufficient for many NLP tasks Document classification, clustering, etc. Do we really need the whole speech recognition pipeline? 52

54 Our Goal 0 0 1 0 0 1 1 1 Feature Extraction Find Audio Segments Assign Text to Segments Full Transcripts Speech Recognition Text Processing Find long segments (>0.5s) of repeated acoustic signal Extract features directly from repeated segments Jansen et al, Interspeech 2010 Find Segments Extract Features Find Segments Extract Features 53

55 Creating Pseudo-Terms P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 54

56 Example Pseudo-Terms term_5 term_6 term_63 term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurance term life_insurance how_much_we long_term budget_for our_life_insurance budget end_of_the_month stay_within_a_certain you_know have_to certain_budget 55

57 Graph Based Clustering Nodes: each matched audio segment Edges: edge between two segment if fractional overlap exceeds threshold Extract connected components of graph – This work: One pseudo-term for each connected component – Future work: better graph clustering algorithms newspapers newspaper a paper keep track of keep track Pseudo-term 1Pseudo-term 2 56

58 Bags of Words  Bags of Pseudo-Terms Are pseudo-terms good enough? Four score and seven years is a lot of years. four score seven years... 1 2 term_12 term_5 term_12 term_5 … 2 1 0 0 1 0 1 57

59 Evaluation: Data Switchboard telephone speech corpus – 600 conversation sides, 6 topics, 60+ hours of audio Topics: family life, news media, public education, exercise, pets, taxes Identify all pairs of matched regions – Graph clustering to produce pseudo-terms O(n 2 ) on 60+ hours is a lot! – Efficient algorithms & sparsity  not as bad as you think – 500 terapixel dotplot from 60+ hours of speech – Compute time: 100 cores, 5 hours 58

60 Clustering Results: Every time I fire an ASR guy, my performance doesn’t go down as much as you would have thought… 59

61 What We’ve Achieved (Anything you can do with a BOW, we can do with a BOP) Speech Collection Pesuedo Term Discovery Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis 60

62 BACKUP 61

63 The Answer to All Questions Kenneth Church ((almost) former) president 62

64 Overview: Taxonomy of Questions Easy Questions: – How do I improve my performance? Google: – Q&A: The answer to all questions is a URL Siri: – Will you marry me? – I love you – Hello Harder questions: – What does a professor do? Deeper questions (Linguistics & Philosophy): – What is the meaning of a lexical item? 63

65 Q&A: The Answer to All Questions  What is the highest point on Earth? When was Bill Gates Born? 64

66 Siri: “Real World” Q&A (Not all questions can be answered with a URL) Siri – Will you marry me? – I love you – Hello 65

67 66

68 Spoken Web Queries You can ask any question you want – As long as the answer is a URL – Or a joke – Or an action on the phone (open app, call mom) Easy (high freq/low entropy) – Wikipedia – Google Harder (OOVs) – Wingardium Leviosa  When Guardian love Llosa – Non-speech: Water running  Or (universal acceptor) 67

69 Coping Strategies When users don’t get satisfaction – They try various coping strategies – Which are rarely effective Examples: – Repeat the query – Over-articulation (+ screaming & clipping) – Spell mode 68

70 Pseudo-Truth (Self-training) Train on ASR Output – Negative Feedback Loop (with β = 1?) Obviously, pseudo-truth ≠ real truth – Especially for universal acceptors – If ASR output is “or” (universal acceptor) Pseudo-truth is wrong 69

71 Pseudo-Truth & Repetition Average WER is about 15% – But average case is not typical case Usually right: – Easy queries (Wikipedia) Usually wrong: – Hard queries (OOVs) and Universal acceptors (“or”) Pseudo-truth is more credible when – Lots of queries (& clicks) across users (speaker indep) Pseudo-truth is less credible when – A single user repeats previous query (speaker dependent) 70

72 Zero Resource Apps Detect repetitions (with DTW) – If user repeats the previous query, Pseudo-truth is probably wrong (speaker dependent) – If lots of users issue the same query, Pseudo-truth is probably right (speaker independent) Need to distinguish “Google” from “Or” – Both are common, But most instances of “google” will match one another Unlike “or” (universal acceptor) Probably can’t use ASR output to detect repetitions – Wingardium Leviosa  lots of diff (wrong) ASR outputs Probably can’t use ASR output to distinguish – “google” from “or” DTW >> ASR output – for detecting repetition & improving pseudo-truth 71

73 Spoken Web Search ≠ Dictation Spoken Web Search: – Queries are short – Big fat head & Long tail (OOVs) – Large Cohorts: Lots of “Google,” “or” & “wingardium leviosa” OOVs are not one-ofs: – Lots of instances of “wingardium leviosa” Small samples are good enough for – Labeling (via Mechanical Turk?) – Calibration (via Mechanical Turk?) 72

74 Spelling Correction Products: Microsoft Office ≠ Bing Microsoft Office – A dictionary of general vocabulary is essential Bing (Web Search) – A dictionary is essentially useless General vocabulary is more important in documents than web queries Pre-Web Survey: Kukich (1992) – Cucerzan and Brill (2004) propose an iterative process that is more appropriate for web queries. 73

75 Spelling Correction: Bing ≠ Office 74

76 Context is helpful Easier to correct MWEs (names) than words Cucerzan and Brill (2004) 75

77 ASR without a Dictionary More like did-you-mean – Than spelling correction for Microsoft Office Use Zero Resource Methods to cluster OOVs – Label/Calibrate Small Samples with Mechanical Turk Wisdom of the crowds – Crowds: smarter than all the lexicographers in the world How many words/MWEs do people know? – 20k? 400k? 1M? 13M? – Is there a bound or does V (vocab) increase with N (corpus)? – Is V a measure of intelligence or just experience? – Is Google smarter than we are? Or just more experienced? 76

78 Conclusions: Opportunities for Zero Resource Methods Bing ≠ Office – Spoken Web Queries ≠ Dictation Pseudo-Truth ≠ Truth (β << 1) – Use Zero-Resource Methods to distinguish “Google” from “Or” (tie βs) Google: same audio Or: different audio Assign confidence to Pseudo-Truth – More credible: Repetition Across Users (with clicks) – Less credible: Repetition Within Session (without clicks) (Ineffective) Coping Strategy: Ask the same question again (and again) OOVs: Like Did-you-mean spelling correction – Use zero-resource methods to find lots of instances of same thing – Label/Calibrate small samples with Mechanical Turk 77


Download ppt "Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS."

Similar presentations


Ads by Google