Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation.

Slides:



Advertisements
Similar presentations
 1)T ext can be searchable, except for the image file.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Personalized Navigation in the Semantic Web: An Enhanced Faceted Browser Michal Tvarožek FIIT STU BA.
Computer Software 3 Section A Software Basics CHAPTER PARSONS/OJA
Your one stop for spelling needs Spellementary So what is “Spellementary” ? Spelling assistant software that assists users in finding the word that they.
Click Here to Begin. Objectives Purchasing a PC can be a difficult process full of complex questions. This Computer Based Training Module will walk you.
Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe.
Spelling Correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan & Eric Bill Microsoft Research Proceedings.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
Inverted Index Hongning Wang
Created by: Jenny Montemorano Presented by: The Gig… Geneva Information Gateway Sponsored by the NYS BTOP Grant.
Int 1 Revision Word Processing Most people are familiar with word processing packages such as Microsoft Word, Open Office and Word Perfect. Here are some.
Meeting Recorder Adam Janin
Chapter 14 The Second Component: The Database.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Overview of Search Engines
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
Breathing New Life Into An Old Laptop. Give an Old Laptop New Life with Cheap (or Free) Projects Picture frame Wireless Bridge File Server Printer server.
Prevent Cross-Site Scripting (XSS) attack
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Python File Handling. In all the programs you have made so far when program is closed all the data is lost, but what if you want to keep the data to use.
Return to the Word 2007 web page Lesson 2: Creating and Editing Business Letters.
Lecture 16 Page 1 CS 236 Online SQL Injection Attacks Many web servers have backing databases –Much of their information stored in a database Web pages.
First-Line Indenting Paragraphs Release the mouse button to place the First Line Indent marker at the.5” mark on the ruler, or one-half inch from the left.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
COMPREHENSIVE Windows Tutorial 7 Managing Multimedia Files.
Architecture Planning and designing a successful system Use tried and tested techniques Easy to maintain Robust and long lasting.
Internet Searching Made Easy Last Updated: Lesson Plan Review Lesson 1: Finding information on the Internet –Web address –Using links –Search.
PPT Slides by Dr. Craig Tyran & Kraig Pencil Computer Networking – Part 1 MIS 320 Kraig Pencil Summer 2014.
Web Application Security ECE ECE Internetwork Security What is a Web Application? An application generally comprised of a collection of scripts.
Innovations & Inventions The 1990’s by Sebastian, Issy, and Lori.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Instructions You must bring your own laptop, your presentation and a power supply Technical staff is present in all to assist with network, audio and video.
Copyright 2007, Paradigm Publishing Inc. ACCESS 2007 Chapter 3 BACKNEXTEND 3-1 LINKS TO OBJECTIVES Modify a Table – Add, Delete, Move Fields Modify a Table.
CS5103 Software Engineering Lecture 02 More on Software Process Models.
Hardware/Software Basics Test
Text Clustering Hongning Wang
1 Visual Basic “Whidbey”: RAD for the Visual Basic Developer Jay Schmelzer and Shamez Rajan Program Manager Microsoft Corporation Jay Schmelzer and Shamez.
IM Shopping Instant Messenger that sells and buys Lou Pan Jian Wu.
Introduction  Program: Set of sequence instruction that tell the computer what to do.  Software: A collection of programs, data, and information. 
9NL Ayomi Hasenclever.  You cant touch a software  It is stored in a computer or laptop  Allows the hardware to do something useful, without the software.
Stuff to memorise… "A method tells an object to perform an action. A property allows us to read or change the settings of the object."
Microsoft PowerPoint 2010 Chapter 3 Reusing a Presentation and Adding Media.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
M M Waseem Iqbal.  Cause: Unverified/unsanitized user input  Effect: the application runs unintended SQL code.  Attack is particularly effective if.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Testing Your Site Design Prototype Evaluate Is it what the user wants? No.
Organizing, Editing and Printing Pictures
Tolerant Retrieval Review Questions
Search Engine Architecture
Handling Data Designing Structure, Capturing and Presenting Data
SQL Injection Attacks Many web servers have backing databases
Specialized Application Software
SPECIALIZED APPLICATION SOFTWARE
Microsoft Video Editing Software
Chapter 6 System and Application Software
Specialized Application Software
Data Mining Chapter 6 Search Engines
Hardware Components & Software Concepts
Coding Concepts (Basics)
What's New in Visual Studio 2005
Organizing, Editing and Printing Pictures
Windows Tutorial 7 Managing Multimedia Files
Handling Data Designing Structure, Capturing and Presenting Data
3.1 Basic Concept of Directory and Sub-directory
Chapter 6 System and Application Software
Chapter 6 System and Application Software
Chapter 6 System and Application Software
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Text Mining Search and Navigation Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation NISS Workshop on Computational Advertising, November 2009

Text Mining Search and Navigation Buying Cheap(er) on eBay Cannon 30d Canon 30d Not good for the sellers. Not good for most buyers. Not good for the middle man.

Text Mining Search and Navigation epresso machinesespesso machinesespreso machinesespressomachinesesspreso machinesesspresso machinesexpresso machinesexspresso machines Good Ads for Bad Queries espresso machines singular wirelesscingulair wirelesscigular wirelesscingulare wirelesscingullar wirelesscinguilar wirelescingluarwirelesscircular wireless cingular wireless

Text Mining Search and Navigation Is a Trusted Dictionary Enough? Search: max payne chats and codes new humwee pics Music: selin dion color of my love cristina aquillara Shopping: pansonic dvd reorders brita water filer Help and Support: printer divers for window vista insert flash flies into power point cheatscelinecolour panasonicrecorders filter driverswindows filespowerpoint christina aguilera

Text Mining Search and Navigation Web Query Logs as Corpora Web Search: over to 1 billion queries per day! 10-15% of the queries contain spelling errors highly dynamic domain: many new names and concepts become popular every day extremely difficult to maintain a high-coverage lexicon difficult to define what a valid web query is e.g.:divx, ecard, ipod, korn, xbox, zune, naboo, nimh, nsync, shrek, 5dmkii, tsx The problem The solution

Text Mining Search and Navigation Problems To Be Handled cheese cake factory  cheesecake factory chat inspanich  chat in spanish amd processors  amd processors Concatenate and split Recognize out-of-lexicon valid words Change in-lexicon words to out-of-lexicon words gun dam fighter  gundam fighter power crd  power cord video crd  video card chicken sop  chicken soup sop opera  soap opera Context-sensitive correction of out-of-lexicon words Context-sensitive correction of in-lexicon words

Text Mining Search and Navigation An HMM Architecture for Spelling Correction brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier brita water filer states: input query: all alternative spellings from the query log

Text Mining Search and Navigation What about terrible misspellings? input: arnol shwartzeggar desired output: arnold schwarzenegger unweighted edit distance: 5

Text Mining Search and Navigation Misspelled query:arnol shwartzeggar First iteration:arnold schwartzneggar Second iteration:arnold schwartzenegger Third iteration:arnold schwa x rzenegger Fourth iteration:arnold schwarzenegger An Iterative Approach no more changes Speller output:

Text Mining Search and Navigation hunny moon honemoon8 honemoons3 honeybeemon3 honeymonn14 honeymoon19019 honeymoon's12 honeymooner3 honeymooner's6 honeymooners771 honeymooning29 honeymoonitis6 honeymoons5259 honneymoon6 honneymoons9 honnymoon4 honoeymoon3 honymoon19 huneymoon10 honey moon333 honey moon's5 honey mooners34 honey moons136 honney moon6 hony moon4 Iterative spelling correction process honeymoon Search Query Log Statistics Some Intuition

Text Mining Search and Navigation Basic Assumptions about the “Noise” query logs contain a lot of different misspellings for most words the better spelled a word form, the more frequent it is the correct forms are much more frequent than their misspellings

Text Mining Search and Navigation Another Example albert einstein 4834 albert einstien525 albert einstine149 albert einsten27 albert einsteins25 albert einstain11 albert einstin10 albert eintein9 albeart einstein6 aolbert einstein6 alber einstein4 albert einseint3 albert einsteirn3 albert einsterin3 albert eintien3 alberto einstein3 albrecht einstein3 alvert einstein3

Text Mining Search and Navigation Concatenation and Splitting Store word unigrams and bigrams in the same searchable trie structure. Find alternative spellings for the input words in this common structure.

Text Mining Search and Navigation Avoid Changing the User’s Intent brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier brita water filer brit waiter file

Text Mining Search and Navigation Modified Viterbi Search – Fringes e.g.: water filer  waiter file k 1  k 2  k 1 +k 2 paths in-lexicon words

Text Mining Search and Navigation Modified Viterbi Search – Stop words e.g.: lord of teh rigs  lord of the rings

Text Mining Search and Navigation Evaluation All queriesValidMisspelled Nr. queries Full system No lexicon No query log All edits equal Unigrams only iteration only iterations only No fringes

Text Mining Search and Navigation A Closer Look to the Results 81.8% overall agreement with the annotators Errors: –alternative queries for valid queries many false positives are reasonable suggestions e.g. cowboy robes  cowboy ropes –alternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger  massager / messenger annotator inter-agreement rate: 91.3%

Text Mining Search and Navigation Evaluation – When we “know” user’s intent Full system73.1 No lexicon59.2 No query log44.9 All edits equal69.9 Unigrams only iteration only iterations only68.2 No fringes71.0 (audio flie, audio file)  audio file (bueavista, buena vista)  buena vista (carrabean nooms, carrabean rooms)  caribbean rooms 368 queries

Text Mining Search and Navigation Learning Curve Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative process that exploits the collective knowledge of web users”, EMNLP 2004