Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Chapter 5: Introduction to Information Retrieval
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Search Engines and Information Retrieval
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Information Retrieval in Practice
Scott Wen-tau Yih Joint work with Kristina Toutanova, John Platt, Chris Meek Microsoft Research.
Near Duplicate Detection
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tag-based Social Interest Discovery
DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1, Byungchul Park 1, Young J. Won 1 John Strassner 2, and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns,
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
School of Information Technology & Electrical Engineering Multiple Feature Hashing for Real-time Large Scale Near-duplicate Video Retrieval Jingkuan Song*,
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
The Simigle Image Search Engine Wei Dong
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
1 Clustering Web Queries John S. Whissell, Charles L.A. Clarke, Azin Ashkan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/08/31.
Information Retrieval in Practice
Information Organization: Overview
Linguistic Graph Similarity for News Sentence Searching
Information Retrieval (in Practice)
Near Duplicate Detection
Web News Sentence Searching Using Linguistic Graph Similarity
Vector-Space (Distributional) Lexical Semantics
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
Learning Term-weighting Functions for Similarity Measures
Information Organization: Overview
Presentation transcript:

Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Same article

Subject: The most popular 400% on first deposit Dear Player : ) They offer a multi-levelled bonus, which if completed earns you a total o= take your 400% right now on your first deposit Get Started right now >>> __________________________ Windows Live?: Keep your life in sync. Subject: The most popular 400% on first deposit Dear Player : ) They offer a multi-levelled bonus, which if completed earns you a total o= take your 400% right now on your first deposit Get Started right now >>> __________________________ Windows Live?: Keep your life in sync. Subject: sweet dream 400% on first deposit Dear Player : ) bets in light of the new legislation passed threatening the entire online g=ming... take your 400% right now on your first deposit Get Started right now >>> _________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now= Nothing can be better than buying a good with a discount. Subject: sweet dream 400% on first deposit Dear Player : ) bets in light of the new legislation passed threatening the entire online g=ming... take your 400% right now on your first deposit Get Started right now >>> _________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now= Nothing can be better than buying a good with a discount. Same payload info

Search Engines Smaller index and storage of crawled pages Present non-redundant information spam filtering Spam campaign detection Online Advertising Web plagiarism detection  Not showing content ads on low quality pages

Capture the notion of “near-duplicate” Whether a document fragment is important depends on the target application Generalize well for future data e.g., identify important names even if they were unseen before Preserve efficiency Most applications target large document sets; cannot sacrifice efficiency for accuracy

Improves accuracy by learning a better document representation Learns the notion of “near-duplicate” from (a small number of) labeled documents Has a simple feature design Alleviates out-of-vocabulary problem, generalizes well Easy to evaluate, little additional computation Plugs in a learning component Can be easily combined with existing NDD methods

Introduction Adaptive Near-duplicate Detection A unified view of NDD methods Improve accuracy via similarity learning Experiments Conclusions

01101 AB12FE DFA F DFA15

BP to proceed with pressure test on leaking well … 01101

For efficient document comparison and processing Encode document into a set of hash code(s)  Shingles: MinHash  I-Match: SHA1 (single hash value)  Charikar’s random projection: SimHash [Henzinger ‘06] AB12FE DFA15 009F12485 …

01101 AB12FE DFA F DFA15

Quality of the term vectors determines the final prediction accuracy Hashing schemes approximate the vector similarity function (e.g., cosine and Jaccard) AB12FE DFA F DFA15

00

Doc-independent features Evaluated by table lookup e.g., Doc frequency (DF), Query frequency (QF) Doc-dependent features Evaluated by linear scan e.g., Term frequency (TF), Term location (Loc) No lexical features used Very easy to compute

Introduction Adaptive Near-duplicate Detection Experiments Data sets: News & Quality of raw vector representations Quality of document signatures Learning curve Conclusions

Web News Articles (News) Near-duplicate news pages [Theobald et al. SIGIR-08] 68 clusters; 2160 news articles in total 5 times 2-fold cross-validation Hotmail Outbound Messages ( ) Training: 400 clusters (2,256 msg) from Dec 2008 Testing: 475 clusters (658 msg) from Jan 2009 Initial clusters selected using Shingle and I-Match; labels are further corrected manually

CosineJaccard

Cosine Jaccard

Initial Model Final Model