Presentation on theme: "Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build."— Presentation transcript:
Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build custom audiences for marketers composed of the existing customers of a brand and the consumers most closely connected to them via the social graph. We use non-personally identifiable data from across social media to deliver highly scalable audiences across the top comScore 1000 sites.
How We Do It Gather Data, Build Models, Identify Targets, Show Ads –sample browser visitation data from micro (social network) and macro (blog) user generated content sites –acquire browser visitation data from client assets –correlate brand interest with UGC affinities to identify brand neighbors –build-out brand-specific audiences in ad exchanges –purchase impressions on brand neighbor browsers from exchanges
Why Hadoop? Business Intelligence Needs: –monitor data gathering for reach, value and data partner payment –monitor campaign audience generation and expansion –monitor app server activity levels and data quality Previous experience: –online advertising platform reporting –relational databases, data warehousing and ETL processing Needed web-scale solution that was: –affordable (free) and would run on available hardware –could handle initial logs of 50 to 100 million lines/day –could grow to handle expected 1 billion log lines/day Possibilities considered were custom application and Hadoop –custom application offered known capabilities, relatively quick implementation, would likely be outgrown –Hadoop promised proper foundation, but was unknown with learning curve and potential that it wouldnt meet our needs
Initial Implementation Had legacy (2004) hardware inherited from prior company –3 Slaves: dual 3Ghz single core Xeon, 4GB RAM, 120GB disk –master: dual 3Ghz single core Xeon, 4GB RAM, 660GB disk –running Linux CentoOS 5, Java 1.6 Set-up of development environment and cluster took about 3 days –master setup took a day –slave setup took about an hour each –cluster set-up took a couple days (Retrying connect to server...) Developed custom Java Map/Reduce application –mapper included 20 classes to parse log lines into fields and do counts –combiner & Reducer consisted of one class to aggregate counts –development time was approximately 2 weeks for initial prototype
v1.0 Production configuration: –5 legacy servers as slaves, 1 legacy server as Master –CentOS 5, Java 1.6, Hadoop 16 – Upgrade from 15 to 16 was seamless 11 aggregation sets which group on 4 fields and have 10 counts Maximum through-put of 6,000 lines/second –jobs consisted up to 5 million lines in up to 300 files and took from 4 to more than 30 minutes or more –processed an average of 160 million lines/day with peaks of 260 million –no data was maintained in the Hadoop file system Normal behavior was for Hadoop cluster to run continuously starting at between 3 and 7 pm and finishing around 5am. We experienced no Hadoop specific errors or unplanned down time in 8 months of continuous operation from May to December 2008
v2.0 Updated configuration –6 Slaves: dual 2.5Ghz quad-core Xeon, 16GB RAM, 4TB disk –2 Masters: fully fault tolerant (DRBD) with automatic failover, dual 2.5Ghz quad-core Xeon, 16GB RAM, 1.4TB –CentOS 5, Java 1.6, Hadoop 18 – Upgrade from 16 was seamless Currently have 16 aggregations plus jobs to gather browser specific data and generate input for data models –Currently maintain 8.5TB of data with replication factor of 2x (17TB total). Replication factor 2x is used to maximize available disk space Experienced through-put of more than 22,500 lines/second, estimated capacity of more than 40,000 lines/second –jobs consist of up to 15 million lines and up to 1000 files –process 360 files/hr and between 6 and 30 million lines/hr –average 450 million/day; Record was 771 million Normal behavior is for Hadoop to be essentially idle 40% of the time We have still experienced no Hadoop specific errors
Primary Processing Cycle Every 10 minutes 60 tomcat servers generate gzipped log files of between 3MB and 100MB A cron runs every 10 minutes to download the files to a to-do directory on the Hadoop master 4 additional crons run every 5 minutes to –copy batches of files into an HDFS directory named with a job ID –run an initial m/r job to generate aggregations and extract browser specific event data –copy the aggregated data to the local file system, move the raw input data to an archive directory within the HDFS and copy the browser specific data into a secondary staging directory within the HDFS –load the aggregated data into MySQL tables
Browser Data Processing Every 30 minutes a cron runs to pull the latest browser specific data from what has been extracted from he logs over the course of the day. –on average 1.25 million new browsers are added every hour with an average of 30 million unique browsers with new data touched daily Every morning at 2:30am details of brand specific browser activity accumulated the prior day are compiled using a map job with no reducer. –approximately 1.75 million (6%) of browsers have brand specific activity –from these browser records, approximately 20 million brand relational data points are identified –the results are exported to the local file system and imported into MySQL tables which feed our data modeling
HDFS Layout/Maintenance HDFS space is divided between work space, raw log archives and browser history data –persistent file space utilization is limited to 70% to allow for work space and redistribution of data if a slave fails –raw logs are maintained for 14 days in the original m/r input directories Browser history data is partitioned by date and divided into: –21 days of browser data extracted from raw logs –90 days of daily browser data –90 days of brand relational data Cron runs once an hour and removes oldest files when utilization percentage is greater than 70.
v3.0+ Reduce or eliminate dependence on MySQL for data set generation –Data set builds currently take 50 to 80 hours; aim is to reduce it to 10% of that or less Replace static MySQL data sets with distributed cache with real-time updates Potential for use of HBase Cascading