Presentation is loading. Please wait.

Presentation is loading. Please wait.

GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

Similar presentations


Presentation on theme: "GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)"— Presentation transcript:

1 GDG DevFest Central Italy 2013 1

2 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google) and the AdWords team.

3 The AdWords Problem

4 ?

5 ?

6 Soccer Shoes

7 The AdWords Problem Soccer Shoes

8 Google Advertisement in Numbers Over a billion of query a day. A lot of advertisers. www.google.com/competition/howgooglesearchworks.html

9 Challenges Several scientific and technological challenges. How to find in real-time the best ads? How to price each ads? How to suggest new queries to advertisers? The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)

10 Google Advertisement in Numbers 2012 Revenues: 46 billions USD 95% Advertisement: 43 billions USD. http://investor.google.com/financial/tables.html

11 Goals of the Project Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. Goals: Useful business information. Improve advertisement. More relevant performance benchmarks.

12 Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers. QueryInformation Nike store New York Market Segment: Retailer, Geo: NY (USA), Stats: 10 clicks Soccer shoes Market Segment: Apparel, Geo: London, UK, Stats: 4 clicks Soccer ball Market Segment: Equipment, Geo: San Franciso, CA, Stats: 5 clicks …. millions of other queries ….

13 Representing the data How to represent the salient features of the data? Relationships between advertisers and queries Statistics: clicks, costs, etc. Take into account the categories. Efficient algorithms.

14 Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Königsberg’s bridges problem. Euler, 1735.

15 Graphs: the lingua franca of Big Data Graphs are everywhere! Social Networks Technological Networks Natural Networks

16 Graphs: the lingua franca of Big Data Formal definition A B C D A set of Nodes

17 Graphs: the lingua franca of Big Data Formal definition A B C D A set of Edges

18 Graphs: the lingua franca of Big Data Formal definition A B C D The edges might have a weight 1 4 2 3

19 Adwords data as a (Bipartite) Graph A lot of Advertisers Billions of Queries Hundreds of Labels

20 Semi-Formal Problem Definition Advertisers Queries

21 Semi-Formal Problem Definition A Advertisers Queries

22 Semi-Formal Problem Definition A Advertisers Queries Labels:

23 Semi-Formal Problem Definition A Advertisers Queries Labels:

24 Semi-Formal Problem Definition A Advertisers Queries Labels: Goal: Find the nodes most “similar” to A.

25 How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy? Can it scale to graphs with billions of nodes? Can be computed in real-time?

26 The three ingredients of Big Data A lot of data… A sophisticated infrastructure: MapReduce Efficient algorithms: Graph mining

27 MapReduce

28 The work is spread across several machines in parallel connected with fast links.

29 Algorithms Personalized PageRank: Random walks on the graph Closely related to the celebrated Google PageRank™.

30 Personalized PageRank

31

32

33

34

35

36

37

38

39

40

41

42

43 Idea: perform a very long random walk (starting from v). Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v. Strong community bias (this can be formalized).

44 Personalized PageRank Exact computation is unfeasible O(n^3), but it can be approximated very well. Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However…

45 Algorithmic Bottleneck Our graphs are simply too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

46 1 st idea: Tackling Real Graph Structure Data size is the main bottleneck. Compressing the graph would speed up the computation.

47 1 st idea: Tackling Real Graph Structure abcdefg AB A B Only advertisers. Advertisers and queries 1

48 1 st idea: Tackling Real Graph Structure abcdefg AB 1 A B Advertisers and queries abc d e f g A B Ranking of the entire graph 2 Only advertisers.

49 1 st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

50 Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

51 Two-stage Approach First stage: Large-scale (but feasible) MapReduce pre-computation. Second Stage: Fast iterative algorithm.

52 First Stage: Individual Category Rankings Advertisers Queries

53 First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings

54 First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings

55 First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings Precomputed Rankings

56 Second Stage: Rank aggregation Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.

57 Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).

58 Experimental evaluation shows the accuracy of the results. Fully implemented and currently under evaluation for integration in production systems. Ongoing research project for future scientific publications. Conclusions

59


Download ppt "GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)"

Similar presentations


Ads by Google