GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Graphical Models BRML Chapter 4 1. the zoo of graphical models Markov networks Belief networks Chain graphs (Belief and Markov ) Factor graphs =>they.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Copyright © 2014 Criteo millions de prédictions par seconde Les défis de Criteo Nicolas Le Roux Scientific Program Manager - R&D.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,
Management Science 461 Lecture 2b – Shortest Paths September 16, 2008.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Mining and Searching Massive Graphs (Networks)
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Computing Trust in Social Networks
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Path Planning in Expansive C-Spaces D. HsuJ.-C. LatombeR. Motwani CS Dept., Stanford University, 1997.
Fast Random Walk with Restart and Its Applications
PageRank Identifying key users in social networks Student : Ivan Todorović, 3231/2014 Mentor : Prof. Dr Veljko Milutinović.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Overview of Web Data Mining and Applications Part I
The Further Mathematics network
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
BIG DATA NICOLAS MUNOZ. Topics What is Big Data? Benefits & Drawbacks How does it work? Companies doing Big Data Market for Big Data Applications of Big.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Distributed Computing Rik Sarkar. Distributed Computing Old style: Use a computer for computation.
Information Flow using Edge Stress Factor Communities Extraction from Graphs Implied by an Instant Messages Corpus Franco Salvetti University of Colorado.
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Random Walk with Restart (RWR) for Image Segmentation
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Google News Personalization: Scalable Online Collaborative Filtering
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
1 Presented by: Yuchen Bian MRWC: Clustering based on Multiple Random Walks Chain.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Post-Ranking query suggestion by diversifying search Chao Wang.
Kijung Shin Jinhong Jung Lee Sael U Kang
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
User Modeling for Personal Assistant
Introduction to Web Mining
DTMC Applications Ranking Web Pages & Slotted ALOHA
Centrality in Social Networks
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

GDG DevFest Central Italy

2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

The AdWords Problem

?

?

Soccer Shoes

The AdWords Problem Soccer Shoes

Google Advertisement in Numbers Over a billion of query a day. A lot of advertisers.

Challenges Several scientific and technological challenges. How to find in real-time the best ads? How to price each ads? How to suggest new queries to advertisers? The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)

Google Advertisement in Numbers 2012 Revenues: 46 billions USD 95% Advertisement: 43 billions USD.

Goals of the Project Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. Goals: Useful business information. Improve advertisement. More relevant performance benchmarks.

Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers. QueryInformation Nike store New York Market Segment: Retailer, Geo: NY (USA), Stats: 10 clicks Soccer shoes Market Segment: Apparel, Geo: London, UK, Stats: 4 clicks Soccer ball Market Segment: Equipment, Geo: San Franciso, CA, Stats: 5 clicks …. millions of other queries ….

Representing the data How to represent the salient features of the data? Relationships between advertisers and queries Statistics: clicks, costs, etc. Take into account the categories. Efficient algorithms.

Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Königsberg’s bridges problem. Euler, 1735.

Graphs: the lingua franca of Big Data Graphs are everywhere! Social Networks Technological Networks Natural Networks

Graphs: the lingua franca of Big Data Formal definition A B C D A set of Nodes

Graphs: the lingua franca of Big Data Formal definition A B C D A set of Edges

Graphs: the lingua franca of Big Data Formal definition A B C D The edges might have a weight

Adwords data as a (Bipartite) Graph A lot of Advertisers Billions of Queries Hundreds of Labels

Semi-Formal Problem Definition Advertisers Queries

Semi-Formal Problem Definition A Advertisers Queries

Semi-Formal Problem Definition A Advertisers Queries Labels:

Semi-Formal Problem Definition A Advertisers Queries Labels:

Semi-Formal Problem Definition A Advertisers Queries Labels: Goal: Find the nodes most “similar” to A.

How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy? Can it scale to graphs with billions of nodes? Can be computed in real-time?

The three ingredients of Big Data A lot of data… A sophisticated infrastructure: MapReduce Efficient algorithms: Graph mining

MapReduce

The work is spread across several machines in parallel connected with fast links.

Algorithms Personalized PageRank: Random walks on the graph Closely related to the celebrated Google PageRank™.

Personalized PageRank

Idea: perform a very long random walk (starting from v). Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v. Strong community bias (this can be formalized).

Personalized PageRank Exact computation is unfeasible O(n^3), but it can be approximated very well. Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However…

Algorithmic Bottleneck Our graphs are simply too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

1 st idea: Tackling Real Graph Structure Data size is the main bottleneck. Compressing the graph would speed up the computation.

1 st idea: Tackling Real Graph Structure abcdefg AB A B Only advertisers. Advertisers and queries 1

1 st idea: Tackling Real Graph Structure abcdefg AB 1 A B Advertisers and queries abc d e f g A B Ranking of the entire graph 2 Only advertisers.

1 st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

Two-stage Approach First stage: Large-scale (but feasible) MapReduce pre-computation. Second Stage: Fast iterative algorithm.

First Stage: Individual Category Rankings Advertisers Queries

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings

First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings Precomputed Rankings

Second Stage: Rank aggregation Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.

Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

Experimental evaluation shows the accuracy of the results. Fully implemented and currently under evaluation for integration in production systems. Ongoing research project for future scientific publications. Conclusions