Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation.

Similar presentations


Presentation on theme: "Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation."— Presentation transcript:

1 Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation NISS Workshop on Computational Advertising, November 2009

2 Text Mining Search and Navigation Buying Cheap(er) on eBay Cannon 30d Canon 30d Not good for the sellers. Not good for most buyers. Not good for the middle man.

3 Text Mining Search and Navigation epresso machinesespesso machinesespreso machinesespressomachinesesspreso machinesesspresso machinesexpresso machinesexspresso machines Good Ads for Bad Queries espresso machines singular wirelesscingulair wirelesscigular wirelesscingulare wirelesscingullar wirelesscinguilar wirelescingluarwirelesscircular wireless cingular wireless

4 Text Mining Search and Navigation Is a Trusted Dictionary Enough? Search: max payne chats and codes new humwee pics Music: selin dion color of my love cristina aquillara Shopping: pansonic dvd reorders brita water filer Help and Support: printer divers for window vista insert flash flies into power point cheatscelinecolour panasonicrecorders filter driverswindows filespowerpoint christina aguilera

5 Text Mining Search and Navigation Web Query Logs as Corpora Web Search: over to 1 billion queries per day! 10-15% of the queries contain spelling errors highly dynamic domain: many new names and concepts become popular every day extremely difficult to maintain a high-coverage lexicon difficult to define what a valid web query is e.g.:divx, ecard, ipod, korn, xbox, zune, naboo, nimh, nsync, shrek, 5dmkii, tsx The problem The solution

6 Text Mining Search and Navigation Problems To Be Handled cheese cake factory  cheesecake factory chat inspanich  chat in spanish amd processors  amd processors Concatenate and split Recognize out-of-lexicon valid words Change in-lexicon words to out-of-lexicon words gun dam fighter  gundam fighter power crd  power cord video crd  video card chicken sop  chicken soup sop opera  soap opera Context-sensitive correction of out-of-lexicon words Context-sensitive correction of in-lexicon words

7 Text Mining Search and Navigation An HMM Architecture for Spelling Correction brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier brita water filer states: input query: all alternative spellings from the query log

8 Text Mining Search and Navigation What about terrible misspellings? input: arnol shwartzeggar desired output: arnold schwarzenegger unweighted edit distance: 5

9 Text Mining Search and Navigation Misspelled query:arnol shwartzeggar First iteration:arnold schwartzneggar Second iteration:arnold schwartzenegger Third iteration:arnold schwa x rzenegger Fourth iteration:arnold schwarzenegger An Iterative Approach no more changes Speller output:

10 Text Mining Search and Navigation hunny moon honemoon8 honemoons3 honeybeemon3 honeymonn14 honeymoon19019 honeymoon's12 honeymooner3 honeymooner's6 honeymooners771 honeymooning29 honeymoonitis6 honeymoons5259 honneymoon6 honneymoons9 honnymoon4 honoeymoon3 honymoon19 huneymoon10 honey moon333 honey moon's5 honey mooners34 honey moons136 honney moon6 hony moon4 Iterative spelling correction process honeymoon Search Query Log Statistics Some Intuition

11 Text Mining Search and Navigation Basic Assumptions about the “Noise” query logs contain a lot of different misspellings for most words the better spelled a word form, the more frequent it is the correct forms are much more frequent than their misspellings

12 Text Mining Search and Navigation Another Example albert einstein 4834 albert einstien525 albert einstine149 albert einsten27 albert einsteins25 albert einstain11 albert einstin10 albert eintein9 albeart einstein6 aolbert einstein6 alber einstein4 albert einseint3 albert einsteirn3 albert einsterin3 albert eintien3 alberto einstein3 albrecht einstein3 alvert einstein3

13 Text Mining Search and Navigation Concatenation and Splitting Store word unigrams and bigrams in the same searchable trie structure. Find alternative spellings for the input words in this common structure.

14 Text Mining Search and Navigation Avoid Changing the User’s Intent brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier brita water filer brit waiter file

15 Text Mining Search and Navigation Modified Viterbi Search – Fringes e.g.: water filer  waiter file k 1  k 2  k 1 +k 2 paths in-lexicon words

16 Text Mining Search and Navigation Modified Viterbi Search – Stop words e.g.: lord of teh rigs  lord of the rings

17 Text Mining Search and Navigation Evaluation All queriesValidMisspelled Nr. queries1044864180 Full system81.884.867.2 No lexicon70.372.261.1 No query log77.082.152.8 All edits equal80.483.366.1 Unigrams only54.757.441.7 1 iteration only80.988.047.2 2 iterations only81.384.466.7 No fringes80.683.367.2

18 Text Mining Search and Navigation A Closer Look to the Results 81.8% overall agreement with the annotators Errors: –alternative queries for valid queries many false positives are reasonable suggestions e.g. cowboy robes  cowboy ropes –alternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger  massager / messenger annotator inter-agreement rate: 91.3%

19 Text Mining Search and Navigation Evaluation – When we “know” user’s intent Full system73.1 No lexicon59.2 No query log44.9 All edits equal69.9 Unigrams only43.0 1 iteration only45.5 2 iterations only68.2 No fringes71.0 (audio flie, audio file)  audio file (bueavista, buena vista)  buena vista (carrabean nooms, carrabean rooms)  caribbean rooms 368 queries

20 Text Mining Search and Navigation Learning Curve Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative process that exploits the collective knowledge of web users”, EMNLP 2004


Download ppt "Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation."

Similar presentations


Ads by Google