Presentation on theme: "Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009."— Presentation transcript:
Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009
About me Background in vertical web crawl –Krugle search engine for open source code –Bixo open source web mining toolkit Consultant for companies using EC2 –Web mining –Data processing Founder of Bixo Labs –Elastic web mining platform –http://bixolabs.com
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101 Extracting & Analyzing Web Data More Than Just Search Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts…
4 Steps in Web Mining Collect - fetch content from web Parse - extract data from formats Analyze - tokenize, rate, classify, cluster Produce - “useful data”
Web Mining versus Data Mining Scale - 10 million isn’t a big number Access - public but restricted –Special implicit rules apply Structure - not much
How to Mine Large Scale Web Data? Start with scalable map-reduce platform Add a workflow API layer Mix in a web crawling toolkit Write your custom data processing code Run in an elastic cloud environment
One Solution - the HECB Stack B ixo C ascading H adoop E C2
EC2 - Amazon Elastic Compute Cloud True cost of non-cloud environment –Cost of servers & networking (2 year life) –Cost of colo (6 servers/rack) –Cost of OPS salary (15% of FTE/cluster) –Managing servers is no fun Web mining is perfect for the cloud –“bursty” => savings are even greater –Data is distilled, so no transfer $$$ pain
Why Hadoop? Perfect for processing lots of data –Map-reduce –Distributed file system Open source, large community, etc. Runs well in EC2 clusters Elastic Map Reduce as option
Why Cascading? API on top of Hadoop Supports efficient, reliable workflows Reduces painful low-level MR details Build workflow using “pipe” model
Why Bixo? Plugs into Cascading-based workflow –Scales with Hadoop cluster –Rules well in EC2 Handles grungy web crawling details –Polite yet efficient fetching –Errors, web servers that lie –Parsing lots of formats, broken HTML Open source toolkit for web mining apps
SEO Keyword Data Mining Example of typical web mining task Find common keywords (1,2,3 word terms) –Do domain-centric web crawl –Parse pages to extract title, meta, h1, links –Output keywords sorted by frequency Compare to competitor site(s)
Custom Code for Example Filtering URLs inside domain –Non-English content –User-generated content (forums, etc) Generating keywords from text –Special tokenization –One, two, three word phrases But 95% of code was generic
End Result in Data Mining Tool
What Next? Another example - mining mailing lists Go straight to Summary/Q&A Talk about Web Scale Mining Write tweets, posts & s “No minute off-line goes unpunished”
Another Example - HUGMEE H adoop U sers who G enerate the M ost E ffective E mails
Helpful Hadoopers Use mailing list archives for data (collect) Parse mbox files and s (parse) Score based on key phrases (analyze) End result is score/name pair (produce)
Scoring Algorithm Very sophisticated point system “thanks” == 5 “owe you a beer” == 50 “worship the ground you walk on” == 100
High Level Steps Collect s –Fetch mod_mbox generated page –Parse it to extract links to mbox files –Fetch mbox files –Split into separate s Parse s –Extract key headers (messageId, , etc) –Parse body to identify quoted text
High Level Steps Analyze s –Find key phrases in replies (ignore signoff) –Score s by phrases –Group & sum by message ID –Group & sum by address Produce ranked list –Toss addresses with no love –Sort by summed score
Building the Flow
This Hug’s for Ted!
Web Scale Mining Bigger Data –100M pages versus 1M pages Bigger Breadth –100K domains versus 1K domains Bigger Clusters –50 servers versus 5 servers
Web Scale == Endless Heuristics Document features detection –Charset –Mime-type –Language –Many noisy sources of “truth” Duplicates detection –Quest for the perfect hash function Spam/porn/link farm detection
Web Scale == Challenges All web servers lie Edge cases ad nauseam Avoiding spam/porn/junk Focusing on English content Scaling to 100K domains/100M pages –Avoid bottlenecks –Fix large cluster issues
Public Terabyte Dataset Sponsored by Concurrent/Bixolabs High quality crawl of top domains –HECB Stack using Elastic Map Reduce Hosted by Amazon in S3, free to EC2 users Crawl & processing code available Questions, input?
Web Scale Case Study - PTD Crawl Robots.txt - Robot Exclusion Protocol –Not a real standard, lots of extensions –Many ways to mess it up (HTML, typos, etc) Great performance when all is well –25K pages/minute fetching –50K pages/minute parsing Hadoop vs –Different APIs, behavior, bugs –At painful cluster tuning stage
Large Scale Web Mining Summary 10K is easy, 100M is hard –You encounter endless edge cases –There’s always another bottleneck –Cluster tuning is challenging Web mining toolkit approach works –Easier to customize/optimize –Easier to solve problems Back
Summary HECB stack works well for web mining –Cheaper than typical colo option –Scales to hundreds of millions of pages –Reliable and efficient workflow Web mining has high & increasing value –Search engine optimization, advertising –Social networks, reputation –Competitive pricing –Etc, etc, etc.