Internet Research: Whats hot in Search, Advertizing & Cloud Computing Rajeev Rastogi Yahoo! Labs Bangalore.

Slides:



Advertisements
Similar presentations
Slide 1 of 10 Taming the Internet. Slide 2 of 10 Overview Specific products include Directories, Intellectual Capital Collections, and annotated reports.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
CS525: Special Topics in DBs Large-Scale Data Management
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
A PowerPoint Presentation
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
■ Google’s Ad Distribution Network ■ Primary Benefits of AdWords ■ Online Advertising Stats and Trends ■ Appendix: Basic AdWords Features ■ Introduction.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Information Retrieval in Practice
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CONCRETE SOFTWARE SOLUTIONS PVT. LTD. A leading Digital Marketing Firm In India.
Information Retrieval in Practice
AdWords Instructor: Dawn Rauscher. Quality Score in Action 0a2PVhPQhttp:// 0a2PVhPQ.
SOCIAL MEDIA OPTIMIZATION – GOOGLE ADSENSE, ANALYTICS, ADWORDS & MUCH MORE Ritesh Ambastha, iWillStudy.com.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
 Popularity of browsers:  Popularity of search.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Bug Localization with Machine Learning Techniques Wujie Zheng
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Presenter: Shanshan Lu 03/04/2010
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DIGITAL ADVERTISING Standard 4. THE ROLE OF DIGITAL ADVERTISING IS TO INCREASE SALES OR IMPROVE BRAND AWARENESS.
Post-Ranking query suggestion by diversifying search Chao Wang.
Data Mining and Decision Support
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Online Marketing. Types Marketing Link Building Content Marketing Search Engine Optimization(SEO) Social Media Marketing Advertising.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
The Google Display Network. Why Display Matters..
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data mining in web applications
Information Retrieval in Practice
Information Organization: Overview
Automated ad placement
Map Reduce.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval
Computational Advertising and
Information Organization: Overview
Presentation transcript:

Internet Research: Whats hot in Search, Advertizing & Cloud Computing Rajeev Rastogi Yahoo! Labs Bangalore

The most visited site on the internet 600 million+ users per month Super popular properties – News, finance, sports – Answers, flickr, del.icio.us – Mail, messaging – Search

Unparalleled scale 25 terabytes of data collected each day – Over 4 billion clicks every day – Over 4 billion s per day – Over 6 billion instant messages per day Over 20 billion web documents indexed Over 4 billion images searchable No other company on the planet processes as much data as we do!

Yahoo! Labs Bangalore Focus is on basic and applied research – Search – Advertizing – Cloud computing University relations – Faculty research grants – Summer internships – Sharing data/computing infrastructure – Conference sponsorships – PhD co-op program

Web Search

What does search look like today?

Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker

Search results of the future: Query refinement

Search results of the future: Rich media

Technologies that are enabling search transformation Information extraction (structured abstracts) Web page classification (query refinement) Multimedia search (rich media)

Reviews Information extraction (IE) Goal: Extract structured records from Web pages Name Address Category Phone Price Map

Multiple verticals Business, social networking, video, ….

Price Category Address PhonePrice One schema per vertical Name Title Education Connections Posted by Title Date RatingViews

IE on the Web is a hard problem Web pages are noisy Pages belonging to different Web sites have different layouts Noise

Web page types Template-based Hand-crafted

Template-based pages Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction ~30% of crawled Web pages Information rich, frequently appear in the top results of search queries E.g. search query: Chinese Mirch New York – 9 template-based pages in the top 10 results

Wrapper Induction Learn Annotate Pages Sample pages Website pages Learn Wrappers Apply wrappers Records XPath Rules Extract Annotations Extract Website pages Sample Enables extraction from template-based pages

Example XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span Generalize

Filters Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone): ([0-9] 3 ) [0-9] 3 -[0-9] 4

Limitations of wrappers Wont work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive

Research challenge Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site Only annotate pages from a few sites initially as training data

Conditional Random Fields (CRFs) Models conditional probability distribution of label sequence y=y 1,…,y n given input sequence x=x 1,…,x n – f k : features, k : weights Choose k to maximize log-likelihood of training data Use Viterbi algorithm to compute label sequence y with highest probability

CRFs-based IE Name Category Address Phone Noise Web pages can be viewed as labeled sequences Train CRF using pages from few Web sites Then use trained CRF to extract from remaining sites

Drawbacks of CRFs Require too many training examples Have been used previously to segment short strings with similar structure However, may not work too well across Web sites that – contain long pages with lots of noise – have very different structure

An alternate approach that exploits site knowledge Build attribute classifiers for each attribute – Use pages from a few initial Web sites For each page from a new Web site – Segment page into sequence of fields (using static repeating text) – Use attribute classifiers to assign attribute labels to fields Use constraints to disambiguate labels – Uniqueness: an attribute occurs at most once in a page – Proximity: attribute values appear close together in a page – Structural: relative positions of attributes are identical across pages of a Web site

Attribute classifiers + constraints example Chinese Mirch Chinese, Indian 120 Lexington Avenue New York, NY (212) Page1: Jewel of India Indian 15 W 44 th St New York, NY (212) Page2: 21 Club American 21 W 52 nd St New York, NY (212) Page3: Phone Address Category Name Category Category, Name Name Name, Noise Address Phone Uniqueness constraint: Name Precedence constraint: Name < Category 21 Club American 21 W 52 nd St New York, NY (212) Category Name Address Phone

Other IE scenarios: Browse page extraction Similar-structured records

IE big picture/taxonomy Things to extract from – Template-based, browse, hand-crafted pages, text Things to extract – Records, tables, lists, named entities Techniques used – Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers – Content-based (attribute values/models) – e.g. dictionaries – Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs Level of automation – Manual, supervised, unsupervised

Web Page Classification: Requirements Quality – High Precision and Recall – Leverage structured input (links, co-citations) and output (taxonomy) Scalability – Large numbers of training Examples, Features and Classes – Complex Structured input and output Cost – Small human effort (for labeling of pages) – Compact classifier model – Low prediction time

Structured Output Learning Structured Output Examples – Multi-class – Taxonomy Naïve approach – Separate binary classifier per class – Separate classifier for each taxonomy level Better approach – single (SVM) classifier – Higher accuracy, more efficient – Sequential Dual Method (SDM) Visit each example sequentially and solve associated QP problem (in dual) efficiently Order of magnitude faster Sport Cricket Health One-day Test FitnessMedicine Soccer

Classification With Relational Information Relational Information – Web page links, structural similarity Graph representation – Pages as nodes (with labels) – Edge weights (s(j,k)): Page similarity, out-link/co-citation existence, etc. Classification can be expressed as an optimization problem: Co-citation Similar structure Link

Multimedia Search Availability & consumption of multimedia content on the Internet is increasing – 500 billion images will be captured in 2010 Leveraging content and metadata are important for MM search Some big technical challenges are: – Results diversity – Relevance – Image Classification, e.g., pornography

Near-Duplicate Detection Multiple near-similar versions of an image exist on the internet –scaled, cropped, captioned, small scene change, etc. Near-duplicates adversely impact user experience Can we use a compact description and dedup in constant time? Fourier-Mellin Transform (FMT): translation, rotation, and scale invariant Signature generation using a small number of low-frequency coefficients of FMT

Filtering noisy tags to improve relevance Measures such as IDF may assign high weights to noisy tags – Treat Tag-Sets as Bag-of-words, random collection of terms Boosting weights of tags based on their co-occurrence with other tags can filter out noise idfco-occur

Online Advertizing

Search query Ad Sponsored search ads

How it works Advertiser Sponsored search engine I want to bid $5 on canon camera I want to bid $2 on cannon camera Engine decides when/where to show this ad on search results page Advertizer pays only if user clicks on ad Ad Index

Ad selection criterion Problem: which ads to show from among ads containing keyword? Ads with highest bid may not maximize revenue Choose ads with maximum expected revenue – Weigh bid amount with click probability AdBidClick Prob Expected Revenue A1$ A2$ A3$

Contextual Advertising Ads

Contextual ads Similar to sponsored search, but now ads are shown on general Web pages as opposed to only search pages – Advertizers bid on keywords – Advertizer pays only if user clicks, Y! & publisher share paid amount – Ad matching engine ranks ads based on expected revenue (bid amount * click probability)

Estimating click probability Use logistic regression model p(click | ad, page, user) = f i : i th feature for ad, page, user w i : weight for feature f i Training data: ad click logs (all clicks + non-click samples) Optimize log-likelihood to learn weights

Features Ad: bid terms, title, body, category,… Page: url, title, keywords in body, category, … User – Geographic (location, time) – Demographic (age, gender) – Behavioral Combine above to get (billions of) richer features E.g: (apple ad title) (ipod page body) (20 < user age < 30) Select subset that leads to improvement in likelihood

Banner ads Show Web page with display ads Ad Creates Brand Awareness

How it works Engine guarantees 1M impressions Advertiser pays a fixed price – No dependence on clicks Engine does admission control, decides allocation of ads to pages Advertiser Banner Ad Engine I want 1M impressions On finance.yahoo.com, gender = male, age = during the month of April 2009 Ad Index

Allocation Example SUPPLY (Qty, Price) DEMAND (Target, Qty) Age Gender Male Female > 30 (Gender=Male, 12M) (Age>30, 12M) (10M,$20)(10M,$10) Suboptimal Optimal (6M,$10) Value=$60M Value= $120M (6M, $20) Unallocated 12

Research problem Goal: Allocate demands so that the value of unallocated inventory is maximized Similar to transportation problem

Transportation problem 1 1 ji 2 2 Demands SupplyPrice d1 d2 di s1 sj s2 pj p2 p1 Edges to Ri xi1 xi2 xij xij: Units of demand I allocated to region j

Ads taxonomy Search pagesWeb pages Contextual Sponsored search Banner Online Ads Keywords Attributes Targeting: Guarantees:NG G CPC CPM/CPCCPMCPC Model:

Major trend: Ads convergence Today Contextual CPC Display CPM Separate systems for contextual & display Tomorrow Unified Ads marketplace – Unify contextual & Display – Increase supply & demand – Enable better matching – CPC, CPM ads compete Y! Ad Exchange CPC, CPM Advertiser: Creates demand Publisher: Creates supply of pages

Research challenge Which ad to select between competing CPC, CPM ads? – Use eCPM For CPM ads: eCPM = bid For CPC ads: eCPM = bid * Pr(click) – Select ad with max eCPM to maximize revenue Problem: ad with highest eCPM may not get selected – eCPMs estimated based on historical data, which can differ from actual eCPMs – Variance in estimated eCPMs higher for CPC ads – Selection gets biased towards ads which have higher variance as they have higher probability of over-estimated eCPMs Estimated eCPM CPC ad CPM ad Actual eCPM Estimated eCPM

Cloud Computing

Much of the stuff we do is compute/data-intensive Search – Index 100+ billion crawled Web pages – Build Web graph, compute PageRank Advertizing – Construct ML models to predict click probability Cluster, classify Web pages – Improve search relevance, ad matching Data mining – Analyze TBs of Web logs to compute correlations between (billions of) user profiles and page views

Solution: Cloud computing A cloud consists of – 1000s of commodity machines (e.g., Linux PCs) – Software layer for Distributing data across machines Parallelizing application execution across cluster Detecting and recovering from failures – Yahoo!s software layer based on Hadoop Open Source

Cloud computing benefits Enables processing of massive compute-intensive tasks Reduces computing and storage costs – Resource sharing leads to efficient utilization – Commodity hardware, open source Shields application developers from complexity of building in reliability, scalability in their programs – In large clusters, machines fail every day – Parallel programming is hard

Cloud computing at Yahoo! 10,000s of nodes running Hadoop, TBs of RAM, PBs of disk Multiple clusters, largest is a 1600 node cluster

Hadoops Map/Reduce Framework Framework for parallel computation over massive data sets on large clusters As an example, consider the problem of creating an index for word search. – Input: Thousands of documents/web pages – Output: A mapping of word to document IDs Farmer1 has the following animals: bees, cows, goats. Some other animals … Animals: 1, 2, 3, 4, 12 Bees: 1, 2, 23, 34 Dog: 3,9 Farmer1: 1, 7 …

Hadoops Map/Reduce Machine1 Machine2 Machine3 Animals: 1,3 Dog: 3 Animals:2,12 Bees: 23 Dog:9 Farmer1: 7 Machine4 Animals: 1,3 Animals:2,12 Bees:23 Machine5 Dog: 3 Dog:9 Farmer1: 7 Machine4 Animals: 1,2,3,12 Bees:23 Machine5 Dog: 3,9 Farmer1: 7 Input splitMap Tasks intermediate output (sorted) ShuffleReduce Tasks Index example (contd.)

Research challenges Rack 1 Rack 2 Rack i Rack n Compute Nodes in Racks Data Blocks for a given job distributed and replicated across nodes in a rack and across racks Data Distribution and Replication Challenges: Optimize distribution to provide maximum locality Optimize replication to provide best fault tolerance Job Queues based on priorities and SLAs 123 L1L1 L2L2 SDS Q 1 40% YST Q 2 35% ATG Q m 25% LmLm Job Scheduling Challenges: Schedule jobs to maximize resource utilization while preserving SLAs Schedule jobs to maximize data locality Performance modeling

Summary Internet is an exciting place, plenty of research needed to improve – User experience – Monetization – Scalability Search -> Information extraction, classification, …. Advertizing -> Click prediction, ad placement, …. Cloud computing -> Job scheduling, perf modeling, … Solving problems will require techniques from multiple disciplines: ML, statistics, economics, algos, systems, …