1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data Analytics Over the Cloud (MDAC 2010)

2 Outline Hadoop in Yahoo! Common types of applications on Hadoop Sample applications in: Content Analysis Web Graph Mail Spam Filtering Search Advertising User Modeling on Hadoop Challenges and Practical Considerations

3 Hadoop in Yahoo

4 By the Numbers About 30,000 nodes in tens of clusters 1 Node = 4 *1 TB disk, 8 cores, 16 GB RAM as a typical configuration. Largest single cluster of about 4000 nodes 4 tiers of clusters Application research and development Production clusters Hadoop platform development and testing Proof of concepts and ad-hoc work Over 1000 users across research, engineering, operations etc. Running more than 100,000 jobs per day More than 3 PB of data Compressed and un-replicated volume Currently running Hadoop 0.20

5 Advantages Wide applicability of the M/R computing model Many problems in internet domain can be solved by data parallelism High throughput Stream through 100 TB of data in less than 1 hour Applications that took weeks earlier complete in hours Research prototyping, development, and production deployment systems are (almost) identical Scalable, economical, fault-tolerant Shared resource with common infrastructure operations

6 Entities in internet eco-system Content (pages, blogs etc.) Search Engine Search Advertising Content/Display Advertising User Queries Ads (Text, Display etc.) Browses SearchesInteracts Leverage Hadoop extensively in all of these domains in Yahoo!

7 Common Types of Applications

8 Common applications on Hadoop in Yahoo! 1.Near real-time data pipelines Backbone for analytics, reporting, research etc. Multi-step pipelines to create data feeds from logs Web-servers - page content, layout and links, clicks, queries etc. Ad servers – ad serving opportunity data, impressions Clicks, beacons, conversion data servers Process large volume of events Tens of billions events/day Tens of TB (compressed) data/day Latencies of tens of minutes to a few hours. Continuous runs of jobs working on chunks of data

9 Example: Data Pipelines Tens of billions events/day Parse and Transform event streams Join clicks with views Filter out robots Aggregate, Sort, Partition Data Quality Checks Analytics Ads and Content User Profiles User Sessions Network analytics Experiment reporting Optimize traffic &engagement User session & click-stream Path and funnel analysis User segment analysis Interest Measurements Modeling and Scoring Experimentation

10 Common applications on Hadoop in Yahoo! 2.High throughput engine for ETL and reporting applications Put large data sources (e.g. logs) on HDFS Run canned aggregations, transformations, normalizations Load reports to RDBMS/data marts Hourly and Daily batch jobs 3.Exploratory data research Ad-hoc analysis and insights into data Leveraging Pig and custom Map Reduce scripts Pig is based on Pig Latin (up-coming support for SQL) Procedural language, designed for data parallelism Supports nested relational data structures

11 Common applications on Hadoop in Yahoo! 4.Indexing for efficient retrieval Build and update indices of content, ads etc. Updated in batch mode and pushed for online serving Efficient retrieval of content and ads during serving 5.Offline modeling Supervised and un-supervised learning algorithms Outlier detection methods Association rule mining techniques Graph analysis methods Time series analysis etc.

12 Common applications on Hadoop in Yahoo! 6. Batch and near real-time scoring applications Offline model scoring for upload to serving applications Frequency: hourly or daily jobs 7. Near real-time feedback from serving systems Update features and model weights based on feedback from serving Periodically push these updates to online scoring and serving Typical updates in minutes or hours 8. Monitoring and performance dashboards Analyze scoring and serving logs for: Monitoring end to end performance of scoring and serving systems Measurements of model performance and measurements

13 Sample Applications

14 Application: Content Analysis Web data Information about every web site, page, and link crawled by Yahoo Growing corpus of more than 100Tb+ data from 10s of billions documents Document processing pipeline on Hadoop Enrich with features from page, site etc. Page segmentation Term document vector and weighted variants Entity anlaysis Detection, disambiguation, resolution of entities in page Concepts and topic modeling and clustering Page quality analysis

15 Application: Web graph analysis Directed graph of the web Aggregated views by different dimensions Sites, Domains, Hosts etc. Large scale analysis of this graph 2 trillion links Jobs utilize 100,000+ maps, ~10,000 reduces ~300 TB compressed output AttributeBefore HadoopWith Hadoop Time1 monthDays Maximum number of URLs ~ Order of 100 billionMany times larger

16 Application: Mail spam filtering Scale of the problem ~ 25B Connections, 5B deliveries per day ~ 450M mailboxes User feedback on spam is often late, noisy and not always actionable ProblemAlgorithmData sizeRunning time on Hadoop Detecting spam campaigns Frequent Itemset mining ~ 20 MM spam votes 1 hour Gaming of spam IP votes by spammers Connected component (squaring a bi- partite graph) ~ 500K spammers, 500k spam IPs 1 hour

17 Application: Mail Spam Filtering Campaigns 9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,) 92457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,) 92447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,) 92432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,) 92376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,) 92184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) 91990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,) 91899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,) 91743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,) 91706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,) 91693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,) 91689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income- Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,) 17 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,) 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On- New-Vehicles,FROMDOM:sherge.info,URL:sherge.info, ip_D:66.206.25.227,)

18 Application: Search Ranking Rank web-pages based on relevance to queries Features based on content of page, site, queries, web graph etc. Train machine learning models to rank relevant pages for queries Periodically learn new models DimensionBefore HadoopUsing Hadoop Features~ 100s~ 1000s Running Time~ Days to weeks~ hours

19 Application: Search Assist TM Related concepts occur together. Analyze ~ 3 years of logs Build dictionaries on Hadoop and push to online serving DimensionBefore HadoopUsing Hadoop Time4 weeks< 30 minutes LanguageC++Python Development Time2-3 weeks2-3 days

20 Applications in Advertising Expanding sets of seed keywords for matching with text ads Analyze text corpus, user query sessions, clustering keywords etc. Indexing ads for fast retrieval Build and update index of more than a billion text ads Response prediction and Relevance modeling Categorization of pages and queries to help in matching Adult pages, gambling pages etc. Forecasting of ad inventory User modeling Model performance dashboards

21 User Modeling on Hadoop

22 User activities Large dimensionality of possible user activities But a typical user has a sparse activity vector Attributes of the events change over time Building a pipeline on Hadoop to model user interests from activities AttributePossible ValuesTypical values per user Pages~ MM10 – 100 Queries~ 100s of MMFew Ads~ 100s of thousands10s

23 User Modeling Pipeline 5 main components to train, score and evaluate models 1. Data Generation a.Data Acquisition b.Feature and Target Generation 2. Model Training 3. Offline Scoring and Evaluation 4. Batch scoring and upload to online serving 5. Dashboard to monitor the online performance

24 HDFS Model Files User event History files Scores and Reports Feature and Target Set Work Flow Manager Online Serving Systems Data GenerationModeling Engine Scoring and Evaluation Filtering Aggregations Join Scoring Score&graph based eval MergingProjection Join Transformations Join Filtering Model Training Hadoop Overview of User Modeling Pipeline Models and Scores

25 1a. Data Acquisition Input Multiple user event feeds (browsing activities, search etc.) per time period UserTimeEventSource U1U1 T0T0 visited autos.yahoo.comWeb server logs U1U1 T1T1 searched for car insuranceSearch logs U1U1 T2T2 browsed stock quotesWeb server logs U1U1 T3T3 saw an ad for discount brokerage, but did not click Ad logs U1U1 T4T4 checked Yahoo MailWeb server logs U1U1 T5T5 clicked on an ad for auto insurance Ad logs, click server logs

26 1a. Data Acquisition Event Feeds User event Normalized Events (NE) User event User event Project relevant event attributes Filter irrelevant events Tag and Transform Categorization Topic …. HDFS User event User event User event Map Operations

27 1a. Data Acquisition Output: Single normalized feed containing all events for all users per time period UserTimeEventTag U1U1 T0T0 Content browsingAutos, Mercedes Benz U2U2 T2T2 Search queryCategory: Auto Insurance ………….………...……….……… U 23 T 23 Mail usage Drop event U 36 T 36 Ad clickCategory: Auto Insurance

28 1b. Feature and Target Generation Features: Summaries of user activities over a time window Aggregates, Moving averages, Rates etc. over moving time windows Support online updates to existing features Targets: Constructed in the offline model training phase Typically user actions in the future time period indicating interest Clicks/Click-through rates on ads and content Site and page visits Conversion events – Purchases, Quote requests etc. – Sign-ups to newsletters, Registrations etc.

29 1b. Feature and Target Windows Time Query Visit Y! finance Feature Window Target Window Interest event Moving Window T0T0

30 1b. Feature Generation NE 1 Feature Set HDFS NE 4 NE 2 NE 5NE 6 NE 3 NE 7NE 8NE 9 Aggregate Normalized events Map 1 U1, Event 1 Map 2 U1, Event 2 Map 3 U1, Event 2 Reduce 1Reduce 2 All events for U1 U2, Event 2 U2, Event 3 U2, Event 1 All events for U2 U1U1 T0T0 Content browsingAutos, Mercedes Benz U1U1 T2T2 Search queryCategory: Auto Insurance U1U1 T3T3 Click on search resultCategory: Insurance premiums U1U1 T4T4 Ad clickCategory: Auto Insurance Summaries over user event history Aggregates within window Time and event weighted averages Event rates ……..

31 1b. Joining Features and Targets Low target rates Typical response rates are in the range of 0.01% ~ 1% Many users have no interest activities in the target window First construct the targets Compute the feature vector only for users with targets Reduces the need for computing features for users without target actions Allows stratified sampling of users with different target and feature attributes

32 2. Model Training Supervised models trained using a variety of techniques Regressions Different flavors: Linear, Logistic, Poisson etc. Constraints on weights Different regularizations: L 1 and L 2 Decision trees Used for both regression and ranking problems Boosted trees Naïve Bayes Support vector machines Commonly used in text classification, query categorization etc. Online learning algorithms

33 2. Model Training Maximum Entropy modeling Log-linear link function. Classification problems in large dimensional, sparse features Constrained Random Fields Sequence labeling and named-entity recognition problems Some of these algorithms are implemented in Mahout Not all algorithms are easy to implement in MR framework Train one model per node. Each node can train model for one target response

34 3. Offline Scoring and Evaluation Apply weights from model training phase to features from Feature generation component Mapper operations only Janino* equation editor Embedded compiler can compile arbitrary scoring equations. Can also embed any class invoked during scoring Can modify features on the fly before scoring Evaluation metrics Sort by scores and compute metrics in reducer Precision vs. Recall curve Lift charts * http://docs.codehaus.org/display/JANINO/Home

35 Modeling Workflow Target generation Feature generation Data Acquisition User event history Targets Features Model Training Weights Training Phase Target generation Feature generation Data Acquisition User event history Targets Features Evaluation Phase Model Scoring Evaluation Scores

36 4. Batch Scoring Data Acquisition User event history Feature generation Features Online Serving Systems Model Scoring Scores Weights

37 User modeling pipeline system ComponentData ProcessedTime Data Acquisition~ 1 Tb per time period 2 – 3 hours Feature and Target Generation ~ 1 Tb * Size of feature window 4 - 6 hours Model Training~ 50 - 100 Gb1 – 2 hours for 100s of models Scoring~ 500 Gb1 hour

38 Challenges and Practical Considerations

39 Current challenges Limited size of name-node File and block meta-data in HDFS is in RAM on name-node On name-node with 64Gb RAM ~ 100 million file blocks and 60 million files Upper limit of 4000 node limit cluster Adding more reducers leads to a large number of small files Copying data in/out of HDFS Limited by read/write rates of external file systems High latency for small jobs Overhead to set up may be large for small jobs

40 Practical considerations Reduce amount of data transfer from mapper to reducer There is still disk write/read in going from mapper to reducer Mapper output = Reducer input files can become large Can run out of disk space for intermediate storage Project a subset of relevant attributes in mapper to send to reducer Use combiners Compress intermediate data Distribution of keys Reducer can become a bottleneck for common keys Use Partitioner to control distribution of map records to reducers E.g. distribute mapper records with common keys across multiple reducers in a round robin manner

41 Practical considerations Judicious partitioning of data Multiple files helps parallelism, but hit name-node limits Smaller number of files keeps name-node happy but at the expense of parallelism Less ideal for distributed computing algorithms requiring communications (e.g. distributed decision trees) MPI on top of the cluster for communication

42 Acknowledgment Numerous wonderful colleagues! Questions?

43 Appendix: More Applications

44 Application: Content Optimization Optimizing content across the Yahoo portal pages Rank articles from an editorial pool of articles based on interest Yahoo Front Page, Yahoo News etc. Customizing feeds in My Yahoo portal page Top buzzing queries Content recommendations (RSS feeds) Use Hadoop for feature aggregates and model weight updates near real-time and uploaded to online serving

45 Ads Optimizatio n Search Index Machine Learned Spam filters RSS Feed Recos. Content Optimizatio n Yahoo Front Page – Case Study

46 Application: Search Logs Analysis Analyze search result view and click logs Reporting and measurement of user click response User session analysis Enrich, expand and re-write queries Spelling corrections Suggesting related queries Traffic quality and protection Detect and filter out fraudulent traffic and clicks

47 Mail Spam Filtering: Connected Components Y1 = Yahoo user 1, Y2 = Yahoo user 2 IP1 = IP address of the host Y1 voted not-spam from 47 y1y1 y2y2 IP 1 IP 2 y1y1 y2y2 weight = 2 SQUARING

48 Mail Spam Filtering: Connected Components Voting 48 y2y2 y1y1 IP 3 IP 4 IP 1 IP 2 Set of voted from IPs y3y3 Set of voted on IPs Set of Yahoo IDs voting notspam Set of IPs/YIDs used exclusively for voting notspam Set of (likely new) spamming IPs which are worth voting for

1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data Analytics Over the Cloud (MDAC 2010)

Similar presentations

Presentation on theme: "1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data Analytics Over the Cloud (MDAC 2010)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data Analytics Over the Cloud (MDAC 2010)

Similar presentations

Presentation on theme: "1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data Analytics Over the Cloud (MDAC 2010)"— Presentation transcript:

Similar presentations

About project

Feedback