Download presentation
Presentation is loading. Please wait.
Published byJohnathan Kennedy Modified over 8 years ago
1
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized by Park,Sung Eun, IDS Lab., Seoul National University Presented by Park,Sung Eun,IDS Lab., Seoul National University
2
Copyright 2008 by CEBT Contents Introduction Contextual Shortcuts Concept Ranking Method Feature Space Interestingness and Relevance of a Concept Evaluation Cross Validation Approach, Editorial Evaluation, Real World Results Conclusion 2
3
Copyright 2008 by CEBT Introduction Determining and ranking the key concepts in a document Goal Given the candidate set of entities, learn a ranking function which orders the entities by their interestingness and relevance Applications Contextual advertising systems Text summarization User centric entity detection systems – Detect entities and concepts within text – Transform those detected entities into actionable like “intelligent hyperlinks” 3
4
Copyright 2008 by CEBT Contextual Shortcut 4
5
Copyright 2008 by CEBT A concept vector Concepts : A piece of text that refers to an abstract thought or idea. Ex) car insurance, justice Generating concept vector – Term vector : TF/IDF from documents in Yahoo! Search – Unit vector : all units found in the document Units are constructed from query logs in an iterative statistical approach using the frequencies of the distinct queries – Concept vector : the term vector and the unit vector are merged Contextual Shortcut 5
6
Copyright 2008 by CEBT Previous Concept Ranking Method AG(TF,Unit) 1.A term appears in the term vector, but not in the unit vector – punish its term vector weight 2.A term appears in the unit vector, but not in the term vector – its unit weight 3.add this term to the concept vector with its unit weight – um its term vector and unit vector weights 6 Document Concept AG(TF,Unit) ScoreRanking President bush1.15491 Iraq war1.18332 Political parties0.61473 …
7
Copyright 2008 by CEBT Proposed Concept Ranking Method Ranking Function : SVM(Support Vector Machine) SVM light : an open source library for ranking SVM Interestingness : 9 Features of a concept Relevance: pre-mined terms of the concept 7 Term 1 Term 2 Term 3 Term 4 Term 5 Term 7 Term 6 … InterestingnessRelevanceRanking Concept1I1R11 Concept2I2R22 Concept3I3R33 ……… TermsFeatures SVM light
8
Copyright 2008 by CEBT Interestingness of a concept CategoryFeaturesDetails Search Engine Query Logs Freq exact # of queries received that are exactly same as the concept Freq phrase contained # of queries that are exactly same as the concept Unit score The score in the unit vector Search Engine Result Pages Search engine phrase The number of pages returned to the concept as a query Text Based Features Concept size # of terms in the concept Number of characters # of characters in the concept Subconcepts # of subconcepts contained in the concept Taxanomy High level type If the concept exists in one of the editorially maintained lists, use it as a feature Others Wiki word count The length of the Wikipedia articles 8
9
Copyright 2008 by CEBT Relevance of a Concept in a Context A mining approach to obtain a good relevance scoring mechanism Use pre-mined keywords for each concepts Relevant terms of Relevance of the concept can be computed based on the co- occurrence of the pre-mined keyword. 9
10
Copyright 2008 by CEBT Relevance of a Concept in a Context Relevant term scoring 1.Search engine snippets – Using Yahoo! Developer Network API – Treat returned snippets as a document and compute score= tf*idf – Top m=100 terms based on the score 2.Prisma query refinement tool – Prisma is a tool which assists users to augment or replace their queries by providing feedback terms by considering the top 50 documents in a large collection based on factors such as count and position of the terms, document rank, occurrence of query terms within the input phrase. – Construct single document from the concepts returned by Prisma for concept c i and compute the score based on the tf*idf values 10
11
Copyright 2008 by CEBT Relevance of a Concept in a Context Relevant term scoring 3.Related query suggestions – Using Yahoo! Developer Network API – 300 suggestions and the query frequencies of the suggestions – Say k is the number of term appeared in suggestion lists 11 Snippet Prisma Query Suggetions
12
Copyright 2008 by CEBT Intuition of Query Suggestion and Prisma 12
13
Copyright 2008 by CEBT Evaluation Cross Validation Approach Data – Randomly sampled news stories that were annotated by Contextual Shortcuts – The number of times these stories viewed and the number of clicks received by each concept that was detected in the stories – 870 stoires,6420 concepts of 16549 sample clicks Weighted Error Rate Where Click-through-rate=(the number of clicks) / (the number of views) 13
14
Copyright 2008 by CEBT Evaluation NDCG(Normalized discounted cumulative gain measure) – A valuable metric for those applications that require high precision at top ranks – Score for a sorted list of k concepts on document i – Where score(j)=bucketNo(CTR(j)/100), bucketNo() returns a bucket number between 0 and 1000 considering all the CTR values observed in the system in increasing order. 14
15
Copyright 2008 by CEBT Evaluation Interestingness features 15
16
Copyright 2008 by CEBT Evaluation Relevance score 16
17
Copyright 2008 by CEBT Evaluation Interestingness Features and Relevance Score 17
18
Copyright 2008 by CEBT Evaluation Editorial Evaluation 1.Processed set of documents is presented to the judges 2.A judge is asked to select a document from the pool. 3.Ask to read the document and rate each entity or concept highlighted in the document in terms of its interestingness and relevance 18
19
Copyright 2008 by CEBT Contributions We propose to use implicit user feedback in the form of click data to determine the most interesting and relevant concepts in a context via a machine learning approach. We describe a feature space pertinent to the interestingness of a concept, and present algorithms to identify relevance of a concept in a given context. We evaluate the proposed techniques extensively using click data, an editorial study, and an analysis on production system. The results show significant improvements. We provide a detailed description of a framework that enables efficient implementation of the proposed techniques in a production system. 19
20
Copyright 2008 by CEBT Discussion No theoretical base on their feature selection assumptions. No references or base theory at all Depending on the technology already developed in previous studies. Huge advantage on having valuable dataset. 20
21
Q&A Thank you 21
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.