Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009.

Similar presentations

Presentation on theme: "Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009."— Presentation transcript:


2 Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009

3 About me  Background in vertical web crawl –Krugle search engine for open source code –Bixo open source web mining toolkit  Consultant for companies using EC2 –Web mining –Data processing  Founder of Bixo Labs –Elastic web mining platform –

4 Typical Data Mining

5 Data Mining Victory!

6 Meanwhile, Over at McAfee…

7 Web Mining 101  Extracting & Analyzing Web Data  More Than Just Search  Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts…

8 4 Steps in Web Mining  Collect - fetch content from web  Parse - extract data from formats  Analyze - tokenize, rate, classify, cluster  Produce - “useful data”

9 Web Mining versus Data Mining  Scale - 10 million isn’t a big number  Access - public but restricted –Special implicit rules apply  Structure - not much

10 How to Mine Large Scale Web Data?  Start with scalable map-reduce platform  Add a workflow API layer  Mix in a web crawling toolkit  Write your custom data processing code  Run in an elastic cloud environment

11 One Solution - the HECB Stack  B ixo  C ascading  H adoop  E C2

12 EC2 - Amazon Elastic Compute Cloud  True cost of non-cloud environment –Cost of servers & networking (2 year life) –Cost of colo (6 servers/rack) –Cost of OPS salary (15% of FTE/cluster) –Managing servers is no fun  Web mining is perfect for the cloud –“bursty” => savings are even greater –Data is distilled, so no transfer $$$ pain

13 Why Hadoop?  Perfect for processing lots of data –Map-reduce –Distributed file system  Open source, large community, etc.  Runs well in EC2 clusters  Elastic Map Reduce as option

14 Why Cascading?  API on top of Hadoop  Supports efficient, reliable workflows  Reduces painful low-level MR details  Build workflow using “pipe” model

15 Why Bixo?  Plugs into Cascading-based workflow –Scales with Hadoop cluster –Rules well in EC2  Handles grungy web crawling details –Polite yet efficient fetching –Errors, web servers that lie –Parsing lots of formats, broken HTML  Open source toolkit for web mining apps

16 SEO Keyword Data Mining  Example of typical web mining task  Find common keywords (1,2,3 word terms) –Do domain-centric web crawl –Parse pages to extract title, meta, h1, links –Output keywords sorted by frequency  Compare to competitor site(s)

17 Workflow

18 Custom Code for Example  Filtering URLs inside domain –Non-English content –User-generated content (forums, etc)  Generating keywords from text –Special tokenization –One, two, three word phrases  But 95% of code was generic

19 End Result in Data Mining Tool

20 What Next?  Another example - mining mailing lists  Go straight to Summary/Q&A  Talk about Web Scale Mining  Write tweets, posts & emails “No minute off-line goes unpunished”

21 Another Example - HUGMEE  H adoop  U sers who  G enerate the  M ost  E ffective  E mails

22 Helpful Hadoopers  Use mailing list archives for data (collect)  Parse mbox files and emails (parse)  Score based on key phrases (analyze)  End result is score/name pair (produce)

23 Scoring Algorithm  Very sophisticated point system  “thanks” == 5  “owe you a beer” == 50  “worship the ground you walk on” == 100

24 High Level Steps  Collect emails –Fetch mod_mbox generated page –Parse it to extract links to mbox files –Fetch mbox files –Split into separate emails  Parse emails –Extract key headers (messageId, email, etc) –Parse body to identify quoted text

25 High Level Steps  Analyze emails –Find key phrases in replies (ignore signoff) –Score emails by phrases –Group & sum by message ID –Group & sum by email address  Produce ranked list –Toss email addresses with no love –Sort by summed score

26 Workflow

27 Building the Flow

28 mod_mbox Page

29 Custom Operation

30 Validate

31 This Hug’s for Ted!

32 Produce Back

33 Web Scale Mining  Bigger Data –100M pages versus 1M pages  Bigger Breadth –100K domains versus 1K domains  Bigger Clusters –50 servers versus 5 servers

34 Web Scale == Endless Heuristics  Document features detection –Charset –Mime-type –Language –Many noisy sources of “truth”  Duplicates detection –Quest for the perfect hash function  Spam/porn/link farm detection

35 Web Scale == Challenges  All web servers lie  Edge cases ad nauseam  Avoiding spam/porn/junk  Focusing on English content  Scaling to 100K domains/100M pages –Avoid bottlenecks –Fix large cluster issues

36 Public Terabyte Dataset  Sponsored by Concurrent/Bixolabs  High quality crawl of top domains –HECB Stack using Elastic Map Reduce  Hosted by Amazon in S3, free to EC2 users  Crawl & processing code available  Questions, input?

37 Web Scale Case Study - PTD Crawl  Robots.txt - Robot Exclusion Protocol –Not a real standard, lots of extensions –Many ways to mess it up (HTML, typos, etc)  Great performance when all is well –25K pages/minute fetching –50K pages/minute parsing  Hadoop 0.18.3 vs. 0.19.2 –Different APIs, behavior, bugs –At painful cluster tuning stage


39 Large Scale Web Mining Summary  10K is easy, 100M is hard –You encounter endless edge cases –There’s always another bottleneck –Cluster tuning is challenging  Web mining toolkit approach works –Easier to customize/optimize –Easier to solve problems Back

40 Summary  HECB stack works well for web mining –Cheaper than typical colo option –Scales to hundreds of millions of pages –Reliable and efficient workflow  Web mining has high & increasing value –Search engine optimization, advertising –Social networks, reputation –Competitive pricing –Etc, etc, etc.

41 Any Questions?  My email:  Bixo mailing list:


Download ppt "Web Mining in the Cloud Hadoop/Cascading/Bixo in EC2 Ken Krugler, Bixo Labs, Inc. ACM Data Mining SIG 08 December 2009."

Similar presentations

Ads by Google