Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA

Impressions and ImpressionRank Impression of page/site x on a keyword w: A user sends w to a search engine The search engine returns x as one of the results The user sees the result x ImpressionRank of x: # of impressions of x Within a certain time frame Measure of page/site visibility in a search engine Each result has an impression on the keyword “www 2009”: www.2009.org www2009.org/calls.html www.loginconference.com...

Popular Keyword Extraction The Popular Keyword Extraction problem: Input: web page x, int k Output: k keywords on which x has the most impressions among all keywords Example: x = www.johnmccain.com sarah palin john mccain cindy mccain

Motivation Popularity rating of pages and sites Site analytics Enable site owners to determine their visibility in different search engines Combine with traffic data to derive click-through rates Compare to other sites Keyword suggestions for online advertising Social analysis Search engine evaluation Finding similar pages

Internal Measurements of ImpressionRank and Popular Keyword Extraction Search engines can compute both ImpressionRank and popular keywords based on their query logs Query logs are not publicly released due to privacy concerns Caveats: Only search engines can do this Non-transparent

External Measurements of ImpressionRank and Popular Keyword Extraction Main cost measure: # of requests to the search engine and to the suggestion server ImpressionRank estimator / Popular keyword extractor ImpressionRank / Popular Keywords Target page URL

Our Contributions Reduce ImpressionRank Estimation to Popular Keyword Extraction First external algorithm for popular keyword extraction Accurate Uses relatively few search engine requests Applies to: Single web pages (www.cnn.com) Web sites (www.cnn.com/*) Domains (*.cnn.com/*)

Related Work Keyword extraction [Frank et al 99, Turney 00, …] Keyword suggestions (for online advertising) [Yih et al 06, Fuxman et al 08] Query by Document [Yang et al 09] Commercial traffic reporting [GoogleTrends, comScore, Nielsen, Compete]

Roadmap The naïve popular keyword extraction algorithm The improved popular keyword extraction algorithm Best-First Search Experimental results

Popular Keyword Extraction: The Naïve Algorithm Verification procedure for keyword w: Submit w to the search engine and the suggestion server Verify that w returns the target page Verify that the popularity of w > 0 [BG08] Candidate Verifier Term Extractor Term Pool Candidate keyword generator Popular Keywords Recall problem: Target page may have impressions on keywords that do not occur in its text Recall problem: Target page may have impressions on keywords that do not occur in its text Efficiency problem: 10 3 terms  10 9 3-term candidates Efficiency problem: 10 3 terms  10 9 3-term candidates Target Page mp3 tag

Candidate keyword generator Best-First Search Popular Keyword Extraction: The Improved Algorithm Candidate Verifier Term Extractor Term Pool Target Page

… mp3 weather … mp3 songtag … Candidate keyword TRIE Best-First Search Candidate Verifier 35 8 Goals: Prune as many candidates as possible Verify the most promising candidates first Start with single term candidates Score candidates While not exceeded search engine request budget w = top scoring candidate Send w to the verifier Decide whether to prune w If not prune w Expand w – generate and score the children of w

Pruning Pruning decision for keyword w: Submit query inurl: w If no results, prune w and all its descendants Retrieve suggestions for w If no results, prune w and all its descendants Pruning eliminates the vast majority of candidates A single search/suggestion request may eliminate thousands of candidates

Scoring The Best-First search algorithm considers only the top scoring candidates given the budget Want to predict Whether the search engine returns the target page on w Whether w is a popular keyword score(w) = tf(w)   idf(w)   popularity_score(w)  , , and  : relative weights of the scoring components Predicts whether the search engine returns the target page on w Predicts the popularity of w

How to Compute Candidate Scores Every time the algorithm expands a keyword, it needs to compute scores for all its children There could be thousands of such children TF Score Straightforward. No search requests needed. IDF Score Approximated based on an offline corpus. No search requests needed. Popularity Score [BarYossefGurevich 08]: Algorithm for estimating keyword popularity using the query suggestion service Too costly: may use dozens of suggestion requests per estimate We present a new algorithm that estimates popularity for all the children in bulk Uses hundreds of suggestion requests to estimate the popularity of all the children Estimates are less accurate

Cheap Popularity Estimation Input: a keyword w Goal: Estimate popularity of all w’s children Bucket children according to their first character Estimate relative popularity of each bucket Estimate the relative popularity within each bucket Estimate of popularity_score(prefix) BG08 Popularity Estimator … s s t t mp3 song mp3 tag mp3 table mp3 tag mp3 table … 5 6 2 4 5 mp3 s mp3 t Example: w = “mp3” children: “mp3 song”, “mp3 tag”, “mp3 table”, …

Popular Keyword Extraction Algorithm: Quality Analysis Precision: 100% All extracted keywords return the target page Recall: do we miss some popular keywords? More difficult to measure – no ground truth to compare to Estimate lower bound on the recall Google: recall > 90% Yahoo!: recall = 70% - 80%

Resource Usage ~10000 suggestion server requests per page ~1000 search engine requests per page 85%(Google), 75%(Yahoo) after 25% of resources spent

ImpressionRank of News Sites (March 2009) weather cnn video obama weather cnn bristol palin news amazon movies barack obama stimulus package new york times barack obama

ImpressionRank of Social Sites (March 2009)

Conclusions First external algorithms for ImpressionRank estimation Popular keyword extraction Future work Improve efficiency Improve recall

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Similar presentations

Presentation on theme: "Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Similar presentations

Presentation on theme: "Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA."— Presentation transcript:

Similar presentations

About project

Feedback