Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Ranking Tweets Considering Trust and Relevance Srijith Ravikumar,Raju Balakrishnan, and Subbarao Kambhampati Arizona State University 1.
Search Advertising These slides are modified from those by Anand Rajaram & Jeff UllmanAnand Rajaram & Jeff Ullman.
Dr. Subbarao Kambhampati
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Optimal Ad Ranking for Profit Maximization Raju Balakrishnan (Arizona State University) Subbarao Kambhampati (Arizona State University) TexPoint fonts.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Dissertation Defense) Committee: Subbarao.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Proposal Defense) Committee: Subbarao Kambhampati.
CIKM’2008 Presentation Oct. 27, 2008 Napa, California
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Topic-Sensitive SourceRank: Agreement Based Source Selection for the Multi-Topic Deep Web Integration Manishkumar Jha Raju Balakrishnan Subbarao Kambhampati.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti Ravi Kumar Kunal Punera
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
(hyperlink-induced topic search)
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement Raju Balakrishnan, Subbarao Kambhampati Arizona State University.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Millions of Databases: Which are Trustworthy and Relevant?
Data Integration with Dependent Sources
Keyword Searching and Browsing in Databases using BANKS
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
Presentation transcript:

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Agenda  Trust and Relevance based Ranking of Web Databases for the Deep Web.  Ad-Ranking Considering Mutual- Influences. Optimal Ad Ranking for Profit Maximization

Deep Web Integration Problem Web DB Mediator ← query Web DB Millions of Databases Containing Structured Tuples Uncontrolled Collection of Redundant Information answer tuples → ← answer tuples ← query query → Deep Web

Given a user query, select a subset of sources to provide most relevant and trustworthy answers. Trustworthiness: Degree of Belief in the correctness of the data Relevance: Degree by which the data satisfies the information needs of the user. Search Results must be Trustworthy and Relevant. Surface web Search combines hyper-link based PageRank and Relevance to Assure trust and relevance of results. Source Selection in Deep Web

Source Agreement Observations  Many Sources Return Answers to the Same Query.  Comparison of Semantics of the answers is facilitated by structure of the tuples Idea: Compare Agreement of Answers Returned by Different Sources to Assess the Reputation of Sources! Agreement Based Relevance and Trust assessment May be intuitively understood as a meta-reviewer assessing quality of a paper based on agreement between primary reviews. Reviewers agreed upon by other reviewers are likely to be relevant and trustworthy.

Agreement Implies Trust & Relevance Probability of Agreement or Two independently selected irrelevant/false tuples Probability of Agreement or two independently picked relevant and true tuples is

Computing Agreement between Sources  Closely Related to Record Linkage Problem for Integration of databases without common domains (Cohen 98).  We used a Greedy matching between tuples using Jaro-Winkler similarity with SoftTF-IDF, since this measure performs best for named entity matching (Cohen et al. 03)  Agreement computed using top-5 answer tuples to sample queries (200 queries each domain).  The computation complexity is ; where V is number of data sources, using top-k answers.

Representation: Agreement Graph Link Semantics from S i to S j with weight w: S i acknowledges w fraction of tuples in S j Sample agreement graph for the book sources. where induces the smoothing links to account for the unseen samples. R 1, R 2 are the result sets of S 1, S 2.

Calculating SourceRank How do I Search using the agreement graph? 1.Start on a random node 2.If he likes the result, randomly traverse a link, with a probability proportional to its weight to search an agreed database. 3.If he does not like the node, restart the search traversing a smoothing link.  This is a Weighted Markov Random Walk.  The visit probability of the searcher for a database is given by the stationary visit probability of the random walk on the database vertex. SourceRank is equal to this stationary visit probability of the random walk on the database vertex.

Combining Coverage and SourceRank Coverage of a set of tuples T w.r.t a query q Coverage is calculated using sample queries, and we used Jaro-Winkler with SoftTF-IDF similarity between the query and the tuple as the relevance measure. We combine the Coverage and SourceRank as Databases are ranked based on this Score, with.

Evaluations and Results Evaluated in movies and books domain web databases listed in UIUC TEL-8 repository, twenty two from each domain. Evaluation Metrics 1.Ability to remove closely related out of domain Sources. 2.Top-5 precision. (relevance evaluation) 3.Ability to remove corrupted sources (trustworthiness) 4.Time to Compute the Agreement Graph

1. Ranks of Out of Domain Sources

2. Top-5 Precision-Movies Movies Top-4 Source SelectionMovies Top-8 Source Selection 36% 40%

2. Top-5 Precision-Books Top-4 Source SelectionTop-8 Source Selection

3. Trustworthiness of Source Selection Trustworthiness-MoviesTrustworthiness-Books

4. Time to Compute Agreement Graph Time Vs number of SourcesTime Vs top-k tuples

System Implementation System Architecture Implemented as a web application. Searches real web databases Searches Online books and movies Web Databases

Agenda  Trust and Relevance based Ranking of Web Databases for the Deep Web.  Ad-Ranking Considering Mutual- Influences. Optimal Ad Ranking for Profit Maximization

Ad Ranking: State of the Art Sort by Bid Amount x Relevance We Consider Ads as a Set, and ranking is based on User’s Browsing Model Sort by Bid Amount Ads are Considered in Isolation, Ignoring Mutual influences.

Mutual Influences Optimal Ad Ranking for Profit Maximization Three Manifestations of Mutual Influences on an Ad are 1.Similar ads placed above  Reduces user’s residual relevance of the ad 2.Relevance of other ads placed above  User may click on above ads may not view the ad 3.Abandonment probability of other ads placed above  User may abandon search and not view the ad

User’s Browsing Model Optimal Ad Ranking for Profit Maximization User Browses Down Staring at the first Ad  Abandon Browsing with Probability  Goes Down to next Ad with probability At every Ad he May Process Repeats for the Ads Below With a Reduced Probability  Click the Ad With Relevance Probability If is similar to residual relevance of goes down and abandonment probabilities goes up.

Optimal Ad Ranking for Profit Maximization Expected Profit Considering Ad Similarities Considering Bid Amounts ( ), Residual Relevance ( ), abandonment probability ( ), and similarities the expected profit from a set of n ads is, THEOREM: Optimal Ad Placement Considering Similarities between the ads is NP-Hard Proof is a reduction of independent set problem to choosing top k ads considering similarities. Expected Profit =

Dropping similarity, hence replacing Residual Relevance ( ) by Absolute Relevance ( ), Ranking to Maximize This Expected Profit is a Sorting Problem Optimal Ad Ranking for Profit Maximization Expected Profit Considering other two Mutual Influences (2 and 3) Expected Profit =

Optimal Ad Ranking for Profit Maximization Optimal Ranking  The physical meaning RF is the profit generated for unit consumed view probability of ad  Ads above have more view probability. Placing ads producing more profit per consumed view probability is intuitively justifiable. (Refer Balakrishnan & Kambhampati (WebDB 08) for proof of optimality) Rank ads in Descending order of:

Comparison to Yahoo and Google Yahoo!  Assume abandonment probability is zero Google Assume where is a constant for all ads Optimal Ad Ranking for Profit Maximization Assumes that the user has infinite patience to go down the results until he finds the ad he wants. Assumes that abandonment probability is negatively proportional to relevance.

Optimal Ad Ranking for Profit Maximization Quantifying Expected Profit Proposed strategy gives maximum profit for the entire range 45.7% 35.9% Number of Clicks Zipf Random with exponent 1.5 Abandonment Probability Uniform Random as Relevance Uniform Random as Bid Amounts Uniform Random Difference in profit between RF and competing strategy is significant Bid Amount Only strategy becomes optimal at

Optimal Ad Ranking for Profit Maximization Contributions SourceRank  Agreement based computation of relevance and trust of deep web sources.  System implementation to search the deep web, and formal evaluation. Ad-Ranking  Extending Expected Profit Model of Ads Based on Browsing Model, Considering Mutual Influences  Optimal Ad Ranking Considering Mutual Influences Other than Ad Similarities. Thank You!

Deep Web Integration Roadmap