Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Optimizing search engines using clickthrough data
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A Quality Focused Crawler for Health Information Tim Tang.
Evaluating Search Engine
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
INFO 624 Week 3 Retrieval System Evaluation
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Overview of Web Data Mining and Applications Part I
Adapting Deep RankNet for Personalized Search
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
Mining Large Graphs Part 3: Case studies Jure Leskovec and Christos Faloutsos Machine Learning Department Joint work with: Lada Adamic, Deepay Chakrabarti,
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Topics and Transitions: Investigation of User Search Behavior Xuehua Shen, Susan Dumais, Eric Horvitz.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Algorithmic Detection of Semantic Similarity WWW 2005.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Digital Libraries Lillian N. Cassel Spring A digital library An informal definition of a digital library is a managed collection of information,
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Post-Ranking query suggestion by diversifying search Chao Wang.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Effects of User Similarity in Social Media Ashton Anderson Jure Leskovec Daniel Huttenlocher Jon Kleinberg Stanford University Cornell University Avia.
WEB SPAM.
Search Engines and Link Analysis on the Web
Chapter 6 Classification and Prediction
Preface to the special issue on context-aware recommender systems
Peer-to-Peer and Social Networks Fall 2017
Bidirectional Query Planning Algorithm
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Presentation transcript:

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR

Motivation Information retrieval traditionally considered documents as independent Web retrieval incorporates global hyperlink relationships to enhance ranking (e.g., PageRank, HITS) –Operates on the entire graph –Uses just one feature (principal eigenvector) of the graph Our work on Web projections focuses on –contextual subsets of the web graph; in-between the independent and global consideration of the documents –a rich set of graph theoretic properties

Web projections Web projections: How they work? –Project a set of web pages of interest onto the web graph –This creates a subgraph of the web called projection graph –Use the graph-theoretic properties of the subgraph for tasks of interest Query projections –Query results give the context (set of web pages) –Use characteristics of the resulting graphs for predictions about search quality and user behavior

Q QueryResultsProjection on the web graph Query projection graph Generate graphical features Query connection graph Construct case library Predictions Query projections

Questions we explore How do query search results project onto the underlying web graph? Can we predict the quality of search results from the projection on the web graph? Can we predict users’ behaviors with issuing and reformulating queries?

Is this a good set of search results?

Will the user reformulate the query?

Resources and concepts Web as a graph –URL graph: Nodes are web pages, edges are hyper-links Data from March 2006 Graph: 22 million nodes, 355 million edges –Domain graph: Nodes are domains (cmu.edu, bbc.co.uk). Directed edge (u,v) if there exists a webpage at domain u pointing to v Data from February 2006 Graph: 40 million nodes, 720 million edges Contextual subgraphs for queries –Projection graph –Connection graph Compute graph-theoretic features

“Projection” graph Example query: Subaru Project top 20 results by the search engine Number in the node denotes the search engine rank Color indicates relevancy as assigned by human: –Perfect –Excellent –Good –Fair –Poor –Irrelevant

“Connection” graph Projection graph is generally disconnected Find connector nodes Connector nodes are existing nodes that are not part of the original result set Ideally, we would like to introduce fewest possible nodes to make projection graph connected Connector nodes Projection nodes

Finding connector nodes Find connector nodes is a Steiner tree problem which is NP hard Our heuristic: –Connect 2 nd largest connected component via shortest path to the largest –This makes a new largest component –Repeat until the graph is connected Largest component 2 nd largest component

Extracting graph features The idea –Find features that describe the structure of the graph –Then use the features for machine learning Want features that describe –Connectivity of the graph –Centrality of projection and connector nodes –Clustering and density of the core of the graph vs.

Examples of graph features Projection graph –Number of nodes/edges –Number of connected components –Size and density of the largest connected component –Number of triads in the graph Connection graph –Number of connector nodes –Maximal connector node degree –Mean path length between projection/connector nodes –Triads on connector nodes We consider 55 features total vs.

Q QueryResultsProjection on the web graph Query projection graph Generate graphical features Query connection graph Construct case library Predictions Experimental setup

Constructing case library for machine learning Given a task of interest Generate contextual subgraph and extract features Each graph is labeled by target outcome Learn statistical model that relates the features with the outcome Make prediction on unseen graphs

Experiments overview Given a set of search results generate projection and connection graphs and their features Predict quality of a search result set –Discriminate top20 vs. top40to60 results –Predict rating of highest rated document in the set Predict user behavior –Predict queries with high vs. low reformulation probability –Predict query transition (generalization vs. specialization) –Predict direction of the transition

Experimental details Features –55 graphical features –Note we use only graph features, no content Learning –We use probabilistic decision trees (“DNet”) Report classification accuracy using 10-fold cross validation Compare against 2 baselines –Marginals: Predict most common class –RankNet: use 350 traditional features (document, anchor text, and basic hyperlink features)

Search results quality Dataset: –30,000 queries –Top 20 results for each Each result is labeled by a human judge using a 6-point scale from "Perfect" to "Bad" Task: –Predict the highest rating in the set of results 6-class problem 2-class problem: “Good” (top 3 ratings) vs. “Poor” (bottom 3 ratings)

Search quality: the task Predict the rating of the top result in the set Predict “Good” Predict “Poor”

Search quality: results Predict top human rating in the set –Binary classification: Good vs. Poor 10-fold cross validation classification accuracy Observations: –Web Projections outperform both baseline methods –Just projection graph already performs quite well –Projections on the URL graph perform better Attributes URL Graph Domain Graph Marginals0.55 RankNet Projection Connection Projection + Connection All

Search quality: the model The learned model shows graph properties of good result sets Good result sets have: –Search result nodes are hub nodes in the graph (have large degrees) –Small connector node degrees –Big connected component –Few isolated nodes in projection graph –Few connector nodes

Predict user behavior Dataset –Query logs for 6 weeks –35 million unique queries, 80 million total query reformulations –We only take queries that occur at least 10 times –This gives us 50,000 queries and 120,000 query reformulations Task –Predict whether the query is going to be reformulated

Query reformulation: the task Given a query and corresponding projection and connection graphs Predict whether query is likely to be reformulated Query not likely to be reformulated Query likely to be reformulated

Query reformulation: results Observations: –Gradual improvement as using more features –Using Connection graph features helps –URL graph gives better performance We can also predict type of reformulation (specialization vs. generalization) with 0.80 accuracy Attributes URL Graph Domain Graph Marginals0.54 Projection Connection Projection + Connection All

Query reformulation: the model Queries likely to be reformulated have: –Search result nodes have low degree –Connector nodes are hubs –Many connector nodes –Results came from many different domains –Results are sparsely knit

Conclusion We introduced Web projections –A general approach of using context-sensitive sets of web pages to focus attention on relevant subset of the web graph –And then using rich graph-theoretic features of the subgraph as input to statistical models to learn predictive models We demonstrated Web projections using search result graphs for –Predicting result set quality –Predicting user behavior when reformulating queries

Future directions Combine with content and usage features Explore other ways to define the context –E.g., web pages that user recently visited Explore the role of connector nodes –Should they be included in the result set? Move beyond set level prediction –Characterize individual node’s position in the graph –Use to enhance ranking, identify page properties Characterize web and query dynamics –Understand user’s search paths –Model the evolution of communities and topics

Additional material

Projection on URL & Domain graph Query: encyclopedia URL graph Domain graph Domain graph projections are denser (better connected)

Projection and connection graphs Query: Yahoo Projection graph Connection graph

Good vs. Poor result set Good (top20) vs. Poor (top 40 to 60) Query medline Domain graph projection Good result set (top 20) Poor result set (top 40 to 60)

Good vs. Poor: the task Good (top20) vs. Poor (top 40 to 60) Query Wisconsin URL graph projection Good result set Poor result set

Good vs. Poor: performance Project top20 and top results (ordered by human rating) Predict whether a given graph is composed from top or bottom search results Results: –Gradual increase in performance –Projections on the domain graph perform better 10-fold cross validation Classification Accuracy vs. Attributes URL Graph Domain Graph Marginals0.50 RankNet0.74 Projection Connection Projection + Connection All0.88

Good vs. Poor: the model Good result sets have: –Few isolated and dangling nodes –Results are from fewer domains Poor result sets are the opposite: –Disconnected tree-like graphs with many connector nodes

Specialization vs. Generalization Given a query transition predict whether it is: –Specialization (words were added) –Generalization (words were removed) Q: free house plans Q: house plans Query transition

Predict type of query transition Given graphs before and after transition predict the transition type Query transition Is transition specialization or generalization? Q: strawberry shortcakeQ: strawberry shortcake pictures

Type of transition: task Predict whether given transition was specialization or generalization Gradual increase in performance as using richer attributes Attributes URL Graph Domain Graph Marginals0.50 Projection Connection Projection + Connection All

Type of transition: the model Specializations: –Decrease in number of connected components –Decrease in number of isolated nodes –Largest component increases –Number of connector nodes decreases

Guess query reformulation Given a query predict whether it is likely to get specialized or generalized. Results show … Attributes URL Graph Domain Graph Marginals0.50 Projection Connection Projection + Connection All

Impact and applications Identify queries search engine does poorly on Given query reformulation predictions we know whether the user will be happy or not Use predictions on query reformulation for –suggest alternative queries –spot badly formulated queries