Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Recommender Systems & Collaborative Filtering
Context-Sensitive Query Auto-Completion AUTHORS:NAAMA KRAUS AND ZIV BAR-YOSSEF DATE OF PUBLICATION:NOVEMBER 2010 SPEAKER:RISHU GUPTA 1.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Introduction to Information Retrieval
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Large-Scale Entity-Based Online Social Network Profile Linkage.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Lecturer: Ghadah Aldehim
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Week 1 Introduction to Search Engine Optimization.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Chapter 8: Web Analytics, Web Mining, and Social Analytics
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Prepared for SEO Analysis Prepared for 17 June 2014.
Search Engines & Subject Directories
Information Retrieval
Data Integration for Relational Web
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engines & Subject Directories
Search Engines & Subject Directories
Presentation transcript:

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA

Impressions and ImpressionRank Impression of page/site x on a keyword w: A user sends w to a search engine The search engine returns x as one of the results The user sees the result x ImpressionRank of x: # of impressions of x Within a certain time frame Measure of page/site visibility in a search engine Each result has an impression on the keyword “www 2009”: www2009.org/calls.html

Popular Keyword Extraction The Popular Keyword Extraction problem: Input: web page x, int k Output: k keywords on which x has the most impressions among all keywords Example: x = sarah palin john mccain cindy mccain

Motivation Popularity rating of pages and sites Site analytics Enable site owners to determine their visibility in different search engines Combine with traffic data to derive click-through rates Compare to other sites Keyword suggestions for online advertising Social analysis Search engine evaluation Finding similar pages

Internal Measurements of ImpressionRank and Popular Keyword Extraction Search engines can compute both ImpressionRank and popular keywords based on their query logs Query logs are not publicly released due to privacy concerns Caveats: Only search engines can do this Non-transparent

External Measurements of ImpressionRank and Popular Keyword Extraction Main cost measure: # of requests to the search engine and to the suggestion server ImpressionRank estimator / Popular keyword extractor ImpressionRank / Popular Keywords Target page URL

Our Contributions Reduce ImpressionRank Estimation to Popular Keyword Extraction First external algorithm for popular keyword extraction Accurate Uses relatively few search engine requests Applies to: Single web pages ( Web sites ( Domains (*.cnn.com/*)

Related Work Keyword extraction [Frank et al 99, Turney 00, …] Keyword suggestions (for online advertising) [Yih et al 06, Fuxman et al 08] Query by Document [Yang et al 09] Commercial traffic reporting [GoogleTrends, comScore, Nielsen, Compete]

Roadmap The naïve popular keyword extraction algorithm The improved popular keyword extraction algorithm Best-First Search Experimental results

Popular Keyword Extraction: The Naïve Algorithm Verification procedure for keyword w: Submit w to the search engine and the suggestion server Verify that w returns the target page Verify that the popularity of w > 0 [BG08] Candidate Verifier Term Extractor Term Pool Candidate keyword generator Popular Keywords Recall problem: Target page may have impressions on keywords that do not occur in its text Recall problem: Target page may have impressions on keywords that do not occur in its text Efficiency problem: 10 3 terms  term candidates Efficiency problem: 10 3 terms  term candidates Target Page mp3 tag

Candidate keyword generator Best-First Search Popular Keyword Extraction: The Improved Algorithm Candidate Verifier Term Extractor Term Pool Target Page

… mp3 weather … mp3 songtag … Candidate keyword TRIE Best-First Search Candidate Verifier 35 8 Goals: Prune as many candidates as possible Verify the most promising candidates first Start with single term candidates Score candidates While not exceeded search engine request budget w = top scoring candidate Send w to the verifier Decide whether to prune w If not prune w Expand w – generate and score the children of w

Pruning Pruning decision for keyword w: Submit query inurl: w If no results, prune w and all its descendants Retrieve suggestions for w If no results, prune w and all its descendants Pruning eliminates the vast majority of candidates A single search/suggestion request may eliminate thousands of candidates

Scoring The Best-First search algorithm considers only the top scoring candidates given the budget Want to predict Whether the search engine returns the target page on w Whether w is a popular keyword score(w) = tf(w)   idf(w)   popularity_score(w)  , , and  : relative weights of the scoring components Predicts whether the search engine returns the target page on w Predicts the popularity of w

How to Compute Candidate Scores Every time the algorithm expands a keyword, it needs to compute scores for all its children There could be thousands of such children TF Score Straightforward. No search requests needed. IDF Score Approximated based on an offline corpus. No search requests needed. Popularity Score [BarYossefGurevich 08]: Algorithm for estimating keyword popularity using the query suggestion service Too costly: may use dozens of suggestion requests per estimate We present a new algorithm that estimates popularity for all the children in bulk Uses hundreds of suggestion requests to estimate the popularity of all the children Estimates are less accurate

Cheap Popularity Estimation Input: a keyword w Goal: Estimate popularity of all w’s children Bucket children according to their first character Estimate relative popularity of each bucket Estimate the relative popularity within each bucket Estimate of popularity_score(prefix) BG08 Popularity Estimator … s s t t mp3 song mp3 tag mp3 table mp3 tag mp3 table … mp3 s mp3 t Example: w = “mp3” children: “mp3 song”, “mp3 tag”, “mp3 table”, …

Popular Keyword Extraction Algorithm: Quality Analysis Precision: 100% All extracted keywords return the target page Recall: do we miss some popular keywords? More difficult to measure – no ground truth to compare to Estimate lower bound on the recall Google: recall > 90% Yahoo!: recall = 70% - 80%

Resource Usage ~10000 suggestion server requests per page ~1000 search engine requests per page 85%(Google), 75%(Yahoo) after 25% of resources spent

ImpressionRank of News Sites (March 2009) weather cnn video obama weather cnn bristol palin news amazon movies barack obama stimulus package new york times barack obama

ImpressionRank of Social Sites (March 2009)

Conclusions First external algorithms for ImpressionRank estimation Popular keyword extraction Future work Improve efficiency Improve recall