Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pulling Structured Data Out of Unstructured Data Greg Lindahl CTO, blekko.

Similar presentations


Presentation on theme: "Pulling Structured Data Out of Unstructured Data Greg Lindahl CTO, blekko."— Presentation transcript:

1 Pulling Structured Data Out of Unstructured Data Greg Lindahl CTO, blekko

2 blekko who? Search engine, like Google only smaller Founded 2007, launched to public in 2010 $55M in funding, 1,500 servers, 46 employees, 20 PBytes of disk

3 What this talk is not… Lots of clever research on finding templates to extract data from websites which do not want their data extracted Classic example: Yelp data Shion Deysarkar of Datafinity does this

4 What does a web crawl look like? Our crawl size is limited by our index size must fit on the 100T of SSD we have on each serving cluster 4 billion URLs crawled – all less than 8 weeks old Crawl frontier of 25 billion urls – Google has 1T Compressed data size of 58T – uncompressed > 10X larger?

5 Easy data in a web crawl Email addresses? 140M Your obfuscation is useless – law of large numbers Social security numbers Valid-looking credit cards

6 Map/Reduce and WebGrep Map step brings the computation to the data – embarrassingly parallel Shuffle-Reduce step produces an answer blekko’s NoSQL database does Shuffle-Reduce in the database using combinators More info? highscalability.com article series Webgrep is blekko feature that lets people suggest 3 strings to be searched over the whole web No limit of 1,000 matches – Accurate counts Search in the HTML code of webpages https://blekko.com/webgrep

7 Webgrep examples Schema-less links href="http:// - 10.2 billion href="https:// - 862 million href=”// - 58 million Semantic web depends on microformats Microformats are unpopular!

8 What’s the question? Posed by Michael Trott of Wolfram Research “the average size of a * is * -- 58 million instances on the web, says Google What are * and *?

9 Full question + is the largest + + is the smallest + + is the biggest + + is the highest + + is the tallest + + is the deepest + + is the strongest + + is the rarest + + is the richest + + is the poorest + + is the shortest + + is the longest + …

10 As a mapjob 5-10 hours to run on ~ 1 petabyte of input text, 4 billion urls 300 million lines of output 2.5 gigabytes of output

11 What “is the biggest” on the Internet?

12 What “is the biggest” on the internet? “youporn is the biggest free porn tube” – 123,089 instances

13 SEO Search Engine Optimization

14 Legit looking answers “what is the biggest challenge facing the” – 33,447 “what is the biggest considerion for public” “do you think is the biggest deterrant to american” “what is the biggest determinant of total” “asia is the biggest continent on earth” – 11,640

15 Phrase frequencies “consists of” -- 18,924,925 “a * is a” “with the largest” “is a member of” “most * are” “is the largest” -- 7,834,488

16 “consists of” “audience consists of 50 percent or” – 307,826 “skeleton consists of groups of bones” “eapc consists of all nato member” – 179,489 “that consists of 28 independent member” – 179,456 ….


Download ppt "Pulling Structured Data Out of Unstructured Data Greg Lindahl CTO, blekko."

Similar presentations


Ads by Google