Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thursdays 9:00 ET/PT Building Web Analytics on Hadoop at CBS Interactive Michael Sun Big Data Workshop, Boston 03/10/2012.

Similar presentations

Presentation on theme: "Thursdays 9:00 ET/PT Building Web Analytics on Hadoop at CBS Interactive Michael Sun Big Data Workshop, Boston 03/10/2012."— Presentation transcript:

1 Thursdays 9:00 ET/PT Building Web Analytics on Hadoop at CBS Interactive Michael Sun Big Data Workshop, Boston 03/10/2012

2 Brands and Websites of CBS interactive, Samples GAMES & MOVIESTECH, BIZ & NEWSSPORTS ENTERTAINMENT MUSIC

3 CBSi Scale Top 20 global web property 235M worldwide monthly unique users Hadoop Cluster size: –Currently workers: 40 nodes (260 TB) –This month: add 24 nodes, 800 TB total –Next quarter: ~ 80 nodes, ~ 1 PB DW peak processing: > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011

4 Web Analytics Processing Collect web logs for web metrics analysis –Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc Provide internal metrics for web sites monitoring A/B testing Billers apps, external reporting Ad event tracking to support sales Provide data service –Support marketing by providing data for data mining –User-centric datastore (stay tuned) –Optimize user experience 1 - Source: comScore, March 2011

5 2105595680218152960 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115& e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x10 50&brlang=zh- CN&tcset=utf8&im=dwjs& e=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B 0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6 %E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8 D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1 schemas.append(Schema(( # schemas[0] SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64), SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'), SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'), SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'), SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'), SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'), SchemaField('http_status', 'int', nullable=True, signed=True), SchemaField('bytes_sent', 'int', nullable=True, signed=True), SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'), SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2) )))

6 Modernize the platform The web log processing using a proprietary platform ran into the limit –Code base was 10 years old –The version we used vendor is no longer supporting –Not fault-tolerant –Upgrade to the newer version not cost-effective Data volume is increasing all the time –300+ web sites –Video tracking increasing the fastest –To support new initiatives of business Use open source systems as much as possible

7 Hadoop to the Rescue / Research Open-source: scalable data processing framework based on MapReduce Processing PB of data using Hadoop Distributed files system (HDFS) –high throughput –Fault-Tolerant Distributed computing model –Functional programming model based − MapReduce (M|S|R) Execution engine –Used as a cluster for ETL –Collect data (distributed harvester) –Analyze data (M/R, streaming + scripting + R, Pig/Hive) –Archive data (distributed archive)

8 The Plan Build web logs collection (codename Fido) –Apache web log piped to cronolog –Hourly M/R collector job to − Gzip hourly log files & checksum − Scp from web servers to Hadoop datanodes − Put on HDFS Build Python ETL framework (codename Lumberjack) –Based stdin/stdout streaming, one process/one thread –Can run stand-alone or on Hadoop –Pipeline –Filter –Schema Build web log processing with Lumberjack –Parse –Sessionize –Lookup –Format data/Load to DB

9 Hadoop External data sources Web Analytics HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems

10 Clickmap

11 Web log Processing by Hadoop Streaming and Python-ETL Parsing web logs –IAB filtering and checking –Parsing user agents by regex –IP range lookup –Look up product key etc Sessionization –Prepare Sessionize –Sessionize –Filter-unpack Process huge dimensions, URL/Page Title Load Facts –Format Load data/Load data to DB

12 Benefits to Ops Processing time to reaching SLA, saving 6 hours Running 2 years in production without any big issues Withstood the test of 50% / year data volume increase Architecture by design made easy to add new processing logic Robust and Fault-Tolerant –Five dead datanodes, jobs still ran OK –Upgraded JVM on a few datanodes while jobs running –Reprocessing old data while processing data of current day

13 Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work Python ETL Framework –Home grown, under review for open-source release –Rich functionalities by Python –Extensible –NLS support Put on top of another platform, eg Hadoop –Distributed/Parallel –Sorting –Aggregation

14 Conclusions II – Power and Flexibility for Processing Big Data Hadoop - scale and computing horsepower –Robustness –Fault-tolerance –Scalability –Significant reduction of processing time to reach SLA –Cost-effective − Commodity HW − Free SW Currently: –Build Multi-tenant Hadoop clusters using Fair Scheduler

15 Questions? Follow up Lumberjack

Download ppt "Thursdays 9:00 ET/PT Building Web Analytics on Hadoop at CBS Interactive Michael Sun Big Data Workshop, Boston 03/10/2012."

Similar presentations

Ads by Google