Download presentation

Presentation is loading. Please wait.

Published byJett Shipler Modified about 1 year ago

1
Online Advertising Open lecture at Warsaw University February 25/26, 2011 Ingmar Weber Yahoo! Research Barcelona Please interrupt me at any point!

2
Disclaimers & Acknowledgments This talk presents the opinions of the author. It does not necessarily reflect the views of Yahoo! Inc. or any other entity. Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo! or any other company. Many of the slides in this lecture are based on tables/graphs from the referenced papers. Please see the actual papers for more details.

3
Review from last lecture Lots of money –Ads essentially pay for the WWW Mostly sponsored search and display ads –Sp. search: sold using variants of GSP –Disp. ads: sold in GD contracts or on the spot Many computational challenges –Finding relevant ads, predicting CTRs, new/tail content and queries, detecting fraud, …

4
Plan for today and tomorrow So far –Mostly introductory, “text book material” Now –Mostly recent research papers –Crash course in machine learning, information retrieval, economics, … Hopefully more “think-along” (not sing-along) and not “shut-up-and-listen”

5
But first … Third party cookies (many others …)

6
Efficient Online Ad Serving in a Display Advertising Exchange Keving Lang, Joaquin Delgado, Dongming Jiang, et al. WSDM’11

7
Not so simple landscape for D A Advertisers Publishers 32m, likes running “Buy shoes at nike.com”“Visit asics.com today”“Rolex is great.” A running blogThe legend of Cliff YoungCelebrity gossip Users 50f, loves watches16m, likes sports Basic problem: Given a (user, publisher) pair, find a good ad(vertiser)

8

9
Ad networks and Exchanges Ad networks –Bring together supply (publishers) and demand (advertisers) –Have bilateral agreements via revenue sharing to increase market fluidity Exchanges –Do the actual real-time allocation –Implement the bilateral agreements

10
User constraints: no alcohol ads to minors Supply constraints: conservative network doesn’t want left publishers Demand constraints: Premium blogs don’t want spammy ads Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1) D only known at end.

11
Valid Paths & Objective Function

12
Algorithm A Worst case running time? Typical running time? Depth-first search enumeration

13
Algorithm B Worst case running time? Sum vs. product? Optimizations? D pruning Upper bound Why? US pruning

14
Reusable Precomputation What if space limitations? How would you prioritize? Cannot fully enforce D Depends on reachable sink … … which depends on U

15
Experiments – artificial data

16
Experiments – real data

17
Competing for Users’ Attention: On the Interplay between Organic and Sponsored Search Results Christian Danescu-Niculescu- Mizil, Andrei Broder, et al. WWW’10 What would you investigate? What would you suspect?

18
Things to look at General bias for near-identical things –Ads are preferred (as further “North”) –Organic results are preferred Interplay between ad CTR and result CTR –Better search results, less ad clicks? –Mutually reinforcing? Dependence on type –Navigational query vs. informational query –Responsive ad vs. incidental ad

19
Data One month of traffic for subset of Y! search servers Only North ads, served at least 50 times For each query q i most clicked ad A i * and most clicked organic result O i * 63,789 (q i, O i *, A i * ) triples Bias?

20
(Non-)Commercial bias? Look at A * and O * with identical domain Probably similar quality … … but (North) ad is higher What do you think? In 52% ctrO > ctrA

21
Correlation ctrO av. ctrA ctrA av. ctrO For given (range of) ctrO bucket all ads.

22
Navigational vs. non-navigational ctrA av. ctrO ctrO av. ctrA Navigational: antagonistic effect Non-navigational: (mild) reinforcement

23
Dependence on similarity Bag of words for title terms (“Free Radio”, “Pandora Radio – Listen to Free Internet Radio, Find New Music”) = 2/9

24
Dependence on similarity av. ctrA

25
A simple model Want to model Also need:

26
A simple model Explains basic (quadratic) shape of overlap vs. ad click-through-rate

27
Improving Ad Relevance in Sponsored Search Dustin Hillard, Stefan Schroedl, Eren Manavoglu, et al. WSDM’10

28
Ad relevance Ad attractiveness Relevance –How related is the ad to the search query –q=“cocacola”, ad=“Buy Coke Online” Attractiveness –Essentially click-through rate –q=“cocacola”, ad=“Coca Cola Company Job” –q=*, ad=“Lose weight fast and easy” Hope: decoupling leads to better (cold-start) CTR predictions

29
Basic setup Get relevance from editorial judgments –Perfect, excellent, good, fair, bad –Treat non-bad as relevant Machine learning approach –Compare query to the ad –Title, description, display URL –Word overlap (uni- and bigram), character overlap (uni- and bigram), cosine similarity, ordered bigram overlap –Query length Data –7k unique queries (stratified sample) –80k query-ad judged relevant pairs

30
Basic results – text only What other features? Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”) Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”) Accuracy = (“said the right thing”)/(“said something”) F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean

31
Incorporating user clicks Can use historic CTRs –Assumes (ad,query) pair has been seen Useless for new ads –Also evaluate in blanked-out setting

32
Translation Model In search, translation models are common Here D = ad Good translation = ad click Typical model Maximum likelihood (for historic data) Any problem with this? A query term An ad term

33
Digression on MLE Maximum likelihood estimator –Pick the parameter that‘s most likely to generate the observed data Example: Draw a single number from a hat with numbers {1, …, n}. You observe 7. Maximum likelihood estimator? Underestimates size (c.f. # of species) Underestimates unknown/impossible Unbiased estimator?

34
Remove position bias Train one model as described before –But with smoothing Train a second model using expected clicks Ratio of model for actual and expected clicks Add these as additional features for the learner

35
Filtering low quality ads Showing fewer ads gave more clicks per search! Use to remove irrelevant ads - Don‘t show ads below relevance threshold

36

37

38
Second part of Part 2

39
Estimating Advertisability of Tail Queries for Sponsored Search Sandeep Pandey, Kunal Punera, Marcus Fontoura, et al. SIGIR’10

40
Two important questions Query advertisability –When to show ads at all –How many ads to show Ad relevance and clickability –Which ads to show –Which ads to show where Focus on first problem. Predict: will there be an ad click? Difficult for tail queries!

41
Word-based Model s(q) = # instances of q with an ad click n(q) = # instances of q without an ad click Query q has words {w i }. Model q‘s click propensity as: Good/bad? Variant w/o bias for long queries: Maximum likelihood attempt to learn these:

42
Word-based Model Then give up …each q only one word

43
Linear regression model Problem? Different model: words contribute linearly Add regularization to avoid overfitting of underdetermined problem

44
Digression Taken from: andhttp://www.dtreg.com/svm.htm

45
Topical clustering Latent Dirichlet Allocation –Implicitly uses co-occurrences patterns Incorporate the topic distributions as features in the regression model

46
Evaluation Why not use the observed c(q) directly? –“Ground truth” is not trustworthy – tail queries Sort things by predicted c(q) –Should have included optimal ordering!

47
Learning Website Hierarchies for Keyword Enrichment in Contextual Advertising Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg WSDM’11

48
The problem(s) Keywords extracted for contextual advertising are not always perfect Many pages are not indexed – no keywords available. Still have to serve ads Want a system that for a given URL (indexed or not) outputs good keywords Key observation: use in-site similarity between pages and content

49
Preliminaries Mapping URLs u to key-value pairs Represent webpage p as vector of keywords –tf, df, and section where found Goals: 1.Use u to introduce new kw and/or update existing weights 2.For unindexed pages get kw via other pages from same site Latency constraint!

50
What they do Conceptually: –Train a decision tree with keys K as attribute labels, V as attribute values and pages P as class labels –Too many classes (sparseness, efficiency) What they do: –Use clusters of web pages as labels

51
Digression: Large scale clustering How (and why) to detect mirror pages? –“ls man” Want a summarizing “fingerprint”? –Direct hashing won’t work What would you do? Syntactic clustering of the Web, Broder et al., 1997

52
Shingling w-shingles of a document (say, w=4) –“If you are lonely when you are alone, you are in bad company.” (Sartre) {(if you are lonely), (you are lonely when), (are lonely when you), (lonely when you are), …} Resemblance r w (A,B) = |S(A,w) Å S(B,w)|/|S(A,w) [ S(B,w)| Works well, but how to compute efficiently?!

53
Obtaining a “sketch” Fix shingle size w, shingle universe U. Each indvidual shingle is a number (by hashing) Let W be a set of shingles. Define: MIN s (W) = The set of s smallest elements in W, if |W| ¸ s W otherwise Theorem: Let U ! U be a permutation of U chosen uniformly at random. Let M(A) = MIN s ( (S(A))) and M(B) = MIN s ( (S(B))). The value |MIN s (M(A) [ M(B)) Å M(A) Å M(B)|/|MIN s (M(A) [ M(B))| is an unbiased estimate of the resemblance of A and B.

54
Proof Note: Min s (M(A)) has a fixed size (namely s).

55
Back to where we were They (essentially) use agglomerative single-linkage clustering with a min similarity stopping threshold Splitting criteria –How would you do it? Do you know agglomerative clustering?

56
Not the best criterion? IG prefers attributes with many values –They claim: high generalization error –They use: Gain Ratio (GR)

57
Take impressions into account So far (unweighted) pages –Class probability = number of pages More weight for recent visits: Weight things by impressions.

58
Stopping criterion Stop splitting in tree construction when –All children part of the same class –Too few impressions under the node –Statistically not meaningful (Chi-square test) Now we have a decision tree for URLs (leaves) –What about interior nodes?

59
Obtaining keywords for nodes Belief propagation – from leaves up …and back down down Now we have keywords for nodes. Keywords for matching nodes are used.

60
Evaluation Two state-of-the-art baselines –Both use the content –JIT uses only first 500 bytes, syntactical –“Semantic” uses topical page hierarchies –All used with cosine similarity to find ads Relevance evaluation –Human judges evaluated ad relevance

61
(Some) Results nDCG … slide

62
Digression - nDCG Normalized Discounted Cumulative Gain CG: total relevance at positions 1 to p DCG: the higher the better nDCG: take problem difficulty into account

63
An Expressive Mechanism for Auctions on the Web Paul Dütting, Monika Henzinger, Ingmar Weber WWW’11

64
More general utility functions Usually –u i,j (p j ) = v i,j – p j –Sometimes with (hard) budget b i We want to allow –u i,j (p j ) = v i,j – c i,j ¢ p j, i.e. (i,j)-dependent slopes –multiple slopes on different intervals –non-linear utilities altogether

65
Why (i,j)-dependent slopes? Suppose mechanism uses CPC pricing … … but a bidder has CPM valuation Mechanism computes Guarantees

66
Translating back to impressions … Why (i,j)-dependent slopes?

67
Why different slopes over intervals? Suppose bidding on a car on ebay –Currently only 1-at-a-time (or dangerous)! –Utility depends on rates of loan

68
Why non-linear utilities? Suppose the drop is supra-linear –The higher the price the lower the profit … –… and the higher the uncertainty –Maybe log(C i,j -p j ) –“Risk-averse” bidders Will use piece-wise linear for approximation –Give approximation guarantees

69
Input definition Set of n bidder I, set of k items J. Items contain a dummy item j 0. Each bidder i has an outside option o i. Each item j has a reserve price r j.

70
Compute an outcome Outcome is feasible if Outcome is envy free if for all i and (i,j) 2 IxJ Bidder optimal if for all other envy free and for all bidders i (strong!) Problem statement

71
Bidder optimality vs. truthfulness Two bidders i 2 {1,2} and two items j 2 {1,2}. r j = 0 for j 2 {1,2}, and o i = 0 for i 2 {1,2} What‘s a bidder optimal outcome? What if bidder 1 underreports u 1,1 ( ¢ )? Note: “degenerate” input! Theorem: General position => Truthfulness. [See paper for definition of “general position”.]

72
Main Results Definition:

73
Overdemand-preserving directions Basic idea –Algorithm iteratively increases the prices –Price increases required to solve overdemand Tricky bits –preserve overdemand (will explain) –show necessity (for bidder optimality) –accounting for unmatching (for running time)

74
Overdemand-preserving directions Bidders Items 10-p 1 9-p 2 12-p 1 7-p 2 5-p 3 3-p 3 8-p 1 2-p 3 11-p 2 p 1 =1 p 2 =0 p 3 =0 Explain: First choice graph Explain: Increase required The simple case Explain: Path augmentation 5 4

75
Overdemand-preserving directions Bidders Items 11-2p 1 9-p p 1 4-3p 2 5-p 3 3-p 3 8-4p 1 2-p 3 9-7p 2 p 1 =1 p 2 =0 p 3 =0 Explain: c i,j matter! The not-so-simple case

76
Finding ov.d-preserv. directions Key observation (not ours!): –minimize –or equivalently No longer preserves full first choice graph –But alternating tree Still allows path augmention

77
The actual mechanism

78
Effects of Word-of-Mouth Versus Traditional Marketing: Findings from an Internet Social Networking Site Michael Trusov, Randolph Bucklin, Koen Pauwels Journal of Marketing, 2009

79
The growth of a social network Driving factors –Paid event marketing (101 events in 36 wks) –Media attention (236 in 36 wks) –Word-of-Mouth (WOM) Can observe –Organized marketing events –Mentions in the media –WOM referrals (through invites) –Number of new sign-ups

80
What could cause what? Media coverage => new sign-ups? New sign-ups => media coverage? WOM referrals => new sign-ups? ….

81
Time series modeling sign-ups WOM referrals Media appearances Promo events interceptlinear trendholidaysday of week Up to 20 days Lots of parameters

82
Time series modeling Overfitting?

83
Granger Causality Correlation causality –Regions with more storks have more babies –Families with more TVs live longer Granger causality attempts more –Works for time series –Y and (possible) cause X –First, explain (= linear regression) Y by lagged Y –Explain the rest using lagged X –Significant improvement in fit?

84
What causes what?

85
Response of Sign-Ups to Shock IRF: impulse response function New to me …

86
Digression: Bass diffusion New “sales” at time t: Ultimate market potential m is given.

87
Model comparison 197 train (= in-sample) 61 test (= out-of-sample)

88
Monetary Value of WOM CPM about $.40 (per ad) Impressions visitor/month about 130 Say 2.5 ads per impression $.13 per month per user, or about $1.50/yr IRF: 10 WOM = 5 new sign-ups over 3 wk 1 WOM worth approx $.75/yr

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google