Ingmar Weber Yahoo! Research Barcelona

Ingmar Weber Yahoo! Research Barcelona ingmar@yahoo-inc.com
Please interrupt me at any point! Online Advertising Open lecture at Warsaw University February 25/26, 2011 Ingmar Weber Yahoo! Research Barcelona

Disclaimers & Acknowledgments
This talk presents the opinions of the author. It does not necessarily reflect the views of Yahoo! Inc. or any other entity. Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo! or any other company. Many of the slides in this lecture are based on tables/graphs from the referenced papers. Please see the actual papers for more details.

Review from last lecture
Lots of money Ads essentially pay for the WWW Mostly sponsored search and display ads Sp. search: sold using variants of GSP Disp. ads: sold in GD contracts or on the spot Many computational challenges Finding relevant ads, predicting CTRs, new/tail content and queries, detecting fraud, …

Plan for today and tomorrow
So far Mostly introductory, “text book material” Now Mostly recent research papers Crash course in machine learning, information retrieval, economics, … Hopefully more “think-along” (not sing-along) and not “shut-up-and-listen”

But first … Third party cookies (many others …)

Efficient Online Ad Serving in a Display Advertising Exchange
Keving Lang, Joaquin Delgado, Dongming Jiang, et al. WSDM’11

Not so simple landscape for D A
Advertisers “Buy shoes at nike.com” “Visit asics.com today” “Rolex is great.” Publishers A running blog The legend of Cliff Young Celebrity gossip Users 32m, likes running 50f, loves watches 16m, likes sports Basic problem: Given a (user, publisher) pair, find a good ad(vertiser)

Ad networks and Exchanges
Bring together supply (publishers) and demand (advertisers) Have bilateral agreements via revenue sharing to increase market fluidity Exchanges Do the actual real-time allocation Implement the bilateral agreements

D only known at end. User constraints: no alcohol ads to minors
Which edges can be pruned? Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1) D only known at end. User constraints: no alcohol ads to minors Supply constraints: conservative network doesn’t want left publishers Demand constraints: Premium blogs don’t want spammy ads

Valid Paths & Objective Function

Algorithm A Depth-first search enumeration Worst case running time?
Figure 2 – enumerates all paths – exponential in worst case, linear for trees. Worst case running time? Typical running time?

Algorithm B US pruning Worst case running time? Sum vs. product?
Optimizations? Upper bound Why? Single source, multi sink Dijkstra E log(N) Dijkstra dominates The reversal is at most E Sort is at most First prune edges D pruning

Reusable Precomputation
Cannot fully enforce D Depends on reachable sink … … which depends on U What if space limitations? How would you prioritize? Pre enforce S and D constraints. Only have to add U later. Precompute for all s? Can vary for how many s we precompute things.

Experiments – artificial data

Experiments – real data

Christian Danescu-Niculescu-Mizil, Andrei Broder, et al. WWW’10
Competing for Users’ Attention: On the Interplay between Organic and Sponsored Search Results Christian Danescu-Niculescu-Mizil, Andrei Broder, et al. WWW’10 What would you investigate? What would you suspect?

Things to look at General bias for near-identical things
Ads are preferred (as further “North”) Organic results are preferred Interplay between ad CTR and result CTR Better search results, less ad clicks? Mutually reinforcing? Dependence on type Navigational query vs. informational query Responsive ad vs. incidental ad

Data One month of traffic for subset of Y! search servers
Only North ads, served at least 50 times For each query qi most clicked ad Ai* and most clicked organic result Oi* 63,789 (qi, Oi*, Ai*) triples Bias?

(Non-)Commercial bias?
Look at A* and O* with identical domain Probably similar quality … … but (North) ad is higher What do you think? In 52% ctrO > ctrA

Correlation av. ctrA av. ctrO ctrA ctrO
For given (range of) ctrO bucket all ads.

Navigational vs. non-navigational
av. ctrA av. ctrO ctrO ctrA Navigational: antagonistic effect Non-navigational: (mild) reinforcement

Dependence on similarity
Bag of words for title terms (“Free Radio”, “Pandora Radio – Listen to Free Internet Radio, Find New Music”) = 2/9

Dependence on similarity
av. ctrA av. ctrA

A simple model Want to model Also need:

A simple model Explains basic (quadratic) shape of overlap vs. ad click-through-rate

Improving Ad Relevance in Sponsored Search
Dustin Hillard, Stefan Schroedl, Eren Manavoglu, et al. WSDM’10

Ad relevance  Ad attractiveness
How related is the ad to the search query q=“cocacola”, ad=“Buy Coke Online” Attractiveness Essentially click-through rate q=“cocacola”, ad=“Coca Cola Company Job” q=*, ad=“Lose weight fast and easy” Hope: decoupling leads to better (cold-start) CTR predictions

Basic setup Get relevance from editorial judgments
Perfect, excellent, good, fair, bad Treat non-bad as relevant Machine learning approach Compare query to the ad Title, description, display URL Word overlap (uni- and bigram), character overlap (uni- and bigram), cosine similarity, ordered bigram overlap Query length Data 7k unique queries (stratified sample) 80k query-ad judged relevant pairs

Basic results – text only
Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”) Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”) Accuracy = (“said the right thing”)/(“said something”) F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean What other features?

Incorporating user clicks
Can use historic CTRs Assumes (ad,query) pair has been seen Useless for new ads Also evaluate in blanked-out setting

Translation Model In search, translation models are common Here D = ad
Good translation = ad click Typical model Maximum likelihood (for historic data) A query term An ad term Any problem with this?

Digression on MLE Maximum likelihood estimator
Pick the parameter that‘s most likely to generate the observed data Example: Draw a single number from a hat with numbers {1, …, n}. You observe 7. Maximum likelihood estimator? Underestimates size (c.f. # of species) Underestimates unknown/impossible Unbiased estimator?

Remove position bias Train one model as described before
But with smoothing Train a second model using expected clicks Ratio of model for actual and expected clicks Add these as additional features for the learner

Filtering low quality ads
Use to remove irrelevant ads - Don‘t show ads below relevance threshold Showing fewer ads gave more clicks per search!

Second part of Part 2

Estimating Advertisability of Tail Queries for Sponsored Search
Sandeep Pandey, Kunal Punera, Marcus Fontoura, et al. SIGIR’10

Two important questions
Query advertisability When to show ads at all How many ads to show Ad relevance and clickability Which ads to show Which ads to show where Focus on first problem. Predict: will there be an ad click? Difficult for tail queries!

Word-based Model Query q has words {wi}. Model q‘s click propensity as: Good/bad? Variant w/o bias for long queries: Maximum likelihood attempt to learn these: s(q) = # instances of q with an ad click n(q) = # instances of q without an ad click

Word-based Model Then give up …each q only one word

Linear regression model
Different model: words contribute linearly Add regularization to avoid overfitting of underdetermined problem Problem?

Digression Taken from: http://www.dtreg.com/svm.htm and

Topical clustering Latent Dirichlet Allocation
Implicitly uses co-occurrences patterns Incorporate the topic distributions as features in the regression model

Evaluation Why not use the observed c(q) directly?
“Ground truth” is not trustworthy – tail queries Sort things by predicted c(q) Should have included optimal ordering!

Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg WSDM’11
Learning Website Hierarchies for Keyword Enrichment in Contextual Advertising Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg WSDM’11

The problem(s) Keywords extracted for contextual advertising are not always perfect Many pages are not indexed – no keywords available. Still have to serve ads Want a system that for a given URL (indexed or not) outputs good keywords Key observation: use in-site similarity between pages and content

Preliminaries Mapping URLs u to key-value pairs
Represent webpage p as vector of keywords tf, df, and section where found Goals: Use u to introduce new kw and/or update existing weights For unindexed pages get kw via other pages from same site Latency constraint!

What they do Conceptually: What they do:
Train a decision tree with keys K as attribute labels, V as attribute values and pages P as class labels Too many classes (sparseness, efficiency) What they do: Use clusters of web pages as labels

Digression: Large scale clustering
Syntactic clustering of the Web, Broder et al., 1997 How (and why) to detect mirror pages? “ls man” Want a summarizing “fingerprint”? Direct hashing won’t work What would you do?

Shingling w-shingles of a document (say, w=4)
“If you are lonely when you are alone, you are in bad company.” (Sartre) {(if you are lonely), (you are lonely when), (are lonely when you), (lonely when you are), …} Resemblance rw(A,B) = |S(A,w)ÅS(B,w)|/|S(A,w)[S(B,w)| Works well, but how to compute efficiently?!

Obtaining a “sketch” Fix shingle size w, shingle universe U.
Each indvidual shingle is a number (by hashing) Let W be a set of shingles. Define: MINs(W) = The set of s smallest elements in W, if |W|¸s W otherwise Theorem: Let : U!U be a permutation of U chosen uniformly at random. Let M(A) = MINs((S(A))) and M(B) = MINs((S(B))). The value |MINs(M(A)[M(B))ÅM(A)ÅM(B)|/|MINs(M(A)[M(B))| is an unbiased estimate of the resemblance of A and B.

Proof Note: Mins(M(A)) has a fixed size (namely s).

Back to where we were They (essentially) use agglomerative single-linkage clustering with a min similarity stopping threshold Splitting criteria How would you do it? Do you know agglomerative clustering?

Not the best criterion? IG prefers attributes with many values
They claim: high generalization error They use: Gain Ratio (GR)

Take impressions into account
So far (unweighted) pages Class probability = number of pages Weight things by impressions. More weight for recent visits:

Stopping criterion Stop splitting in tree construction when
All children part of the same class Too few impressions under the node Statistically not meaningful (Chi-square test) Now we have a decision tree for URLs (leaves) What about interior nodes?

Obtaining keywords for nodes
Belief propagation – from leaves up …and back down down Now we have keywords for nodes. Keywords for matching nodes are used.

Evaluation Two state-of-the-art baselines Relevance evaluation
Both use the content JIT uses only first 500 bytes, syntactical “Semantic” uses topical page hierarchies All used with cosine similarity to find ads Relevance evaluation Human judges evaluated ad relevance

(Some) Results nDCG … slide

Digression - nDCG Normalized Discounted Cumulative Gain
CG: total relevance at positions 1 to p DCG: the higher the better nDCG: take problem difficulty into account

An Expressive Mechanism for Auctions on the Web
Paul Dütting, Monika Henzinger, Ingmar Weber WWW’11

More general utility functions
Usually ui,j(pj) = vi,j – pj Sometimes with (hard) budget bi We want to allow ui,j(pj) = vi,j – ci,j¢ pj, i.e. (i,j)-dependent slopes multiple slopes on different intervals non-linear utilities altogether

Why (i,j)-dependent slopes?
Suppose mechanism uses CPC pricing … … but a bidder has CPM valuation Mechanism computes Guarantees

Why (i,j)-dependent slopes?
Translating back to impressions …

Why different slopes over intervals?
Suppose bidding on a car on ebay Currently only 1-at-a-time (or dangerous)! Utility depends on rates of loan

Why non-linear utilities?
Suppose the drop is supra-linear The higher the price the lower the profit … … and the higher the uncertainty Maybe log(Ci,j-pj) “Risk-averse” bidders Will use piece-wise linear for approximation Give approximation guarantees

Input definition Set of n bidder I, set of k items J.
Items contain a dummy item j0. Each bidder i has an outside option oi. Each item j has a reserve price rj.

Problem statement Compute an outcome Outcome is feasible if
Outcome is envy free if for all i and (i,j) 2IxJ Bidder optimal if for all other envy free and for all bidders i (strong!)

Bidder optimality vs. truthfulness
Two bidders i2{1,2} and two items j2{1,2}. rj = 0 for j2{1,2}, and oi = 0 for i2{1,2} What‘s a bidder optimal outcome? What if bidder 1 underreports u1,1(¢)? Note: “degenerate” input! Theorem: General position => Truthfulness. [See paper for definition of “general position”.]

Main Results Definition:

Overdemand-preserving directions
Basic idea Algorithm iteratively increases the prices Price increases required to solve overdemand Tricky bits preserve overdemand (will explain) show necessity (for bidder optimality) accounting for unmatching (for running time)

The simple case 1 1 10-p1 p1=1 5 9-p2 5-p3 Bidders 2 8-p1 Items 2 7-p2 p2=0 4 3-p3 Explain: Increase required Explain: First choice graph 12-p1 11-p2 3 3 p3=0 2-p3 Explain: Path augmentation

The not-so-simple case 1 1 11-2p1 p1=1 9-p2 5-p3 Bidders 2 8-4p1 Items 2 4-3p2 p2=0 3-p3 Explain: ci,j matter! 12-3p1 9-7p2 3 3 p3=0 2-p3

Finding ov.d-preserv. directions
Key observation (not ours!): minimize or equivalently No longer preserves full first choice graph But alternating tree Still allows path augmention

The actual mechanism

Michael Trusov, Randolph Bucklin, Koen Pauwels
Effects of Word-of-Mouth Versus Traditional Marketing: Findings from an Internet Social Networking Site Michael Trusov, Randolph Bucklin, Koen Pauwels Journal of Marketing, 2009

The growth of a social network
Driving factors Paid event marketing (101 events in 36 wks) Media attention (236 in 36 wks) Word-of-Mouth (WOM) Can observe Organized marketing events Mentions in the media WOM referrals (through invites) Number of new sign-ups

What could cause what? Media coverage => new sign-ups?
New sign-ups => media coverage? WOM referrals => new sign-ups? ….

Time series modeling Up to 20 days Lots of parameters sign-ups
intercept linear trend holidays day of week sign-ups WOM referrals Media appearances Promo events Up to 20 days Lots of parameters

Time series modeling Overfitting?

Granger Causality Correlation  causality
Regions with more storks have more babies Families with more TVs live longer Granger causality attempts more Works for time series Y and (possible) cause X First, explain (= linear regression) Y by lagged Y Explain the rest using lagged X Significant improvement in fit?

What causes what?

Response of Sign-Ups to Shock
IRF: impulse response function New to me …

Digression: Bass diffusion
New “sales” at time t: Ultimate market potential m is given.

Model comparison 197 train (= in-sample) 61 test (= out-of-sample)

Monetary Value of WOM CPM about $.40 (per ad)
Impressions visitor/month about 130 Say 2.5 ads per impression $.13 per month per user, or about $1.50/yr IRF: 10 WOM = 5 new sign-ups over 3 wk 1 WOM worth approx $.75/yr

Ingmar Weber Yahoo! Research Barcelona

Similar presentations

Presentation on theme: "Ingmar Weber Yahoo! Research Barcelona"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ingmar Weber Yahoo! Research Barcelona

Similar presentations

Presentation on theme: "Ingmar Weber Yahoo! Research Barcelona"— Presentation transcript:

Similar presentations

About project

Feedback