Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer.

Similar presentations


Presentation on theme: "Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer."— Presentation transcript:

1 Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer

2 Slide 2 The Common Crawl

3 Slide 3 Hyperlink Graphs Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.

4 Slide 4 HTML-embedded Data on the Web Several million websites semantically markup the content of their HTML pages. Markup Syntaxes  Microformats  RDFa  Microdata Data snippets within info boxes

5 Slide 5 Relational HTML Tables HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia. Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB  In a corpus of 14B raw tables, 154M are „good“ relations (1.1%)

6 Slide 6 The Web Data Commons Project  Has developed an Amazon-based framework for extracting data from large web crawls  Capable to run on any cloud infrastructure  Has applied this framework to the Common Crawl data  Adaptable to other crawls  Results and framework are publicly available  Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.

7 Slide 7 Extraction Framework AWS EC2 Instance Master AWS SQS AWS EC2 Instance AWS S3 1: Fill queue 2: Launch instances 3: Request file-reference 4: Download file 5: Extract & Upload automated manual 6: Collect results

8 Slide 8 Extraction Worker AWS S3 WDC Extractor.(w)arc Worker Filter output Worker: Written in Java Process one page at once Independent from other files and workers Download file Upload output file Filter: Reduce Runtime Mime-Type filter Regex detection of content or meta- information Worker

9 Slide 9 Web Data Commons – Extraction Framework  Written in Java  Mainly tailored for Amazon Web Services  Fault tolerant and cheap  300 USD to extract 17 billion RDF statements from 44 TB  Easy customizable  Only worker has to be adapted  Worker is a single process method processing one file each time  Scaling is automated by the framework  Access Open Source Code:  https://www.assembla.com/code/commondata/ Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.

10 Slide 10 Extracted Datasets  Hyperlink Graph  HTML-embedded Data  Relational HTML Tables Hyperlink Graph HTML-embedded Data Relational HTML Tables

11 Slide 11 Hyperlink Graph  Extracted from the Common Crawl 2012 Dataset  Over 3.5 billion pages connected by over 128 billion links  Graph files: 386 GB

12 Slide 12 Hyperlink Graph  Degrees do not follow a power-law  Detection of Spam pages  Further insights:  WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)  WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.) Discovery of evolutions in the global structure of the World Wide Web.

13 Slide 13 Hyperlink Graph Discovery of important and interesting sites using different popularity rankings or website categorization libraries Websites connected by at least ½ Million Links

14 Slide 14 HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Markup Syntaxes RDFa Microformats Microdata

15 Slide 15 Websites containing Structured Data (2013) 1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%). Web Data Commons - Microformat, Microdata, RDFa Corpus 17 billion RDF triples from Common Crawl 2013 Next release will be in winter 2014

16 Slide 16 Top Classes Microdata (2013) schema = Schema.org dv = Google‘s Rich Snippet Vocabulary

17 Slide 17 HTML Tables Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB Crestan, Pantel: Web-Scale Table Census and Classification. WSDM In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008) Classification Precision: 70-80%

18 Slide 18 WDC - Web Tables Corpus  Large corpus of relational Web tables for public download  Extracted from Common Crawl 2012 (3.3 billion pages)  147 million relational tables  selected out of 11.2 B raw tables (1.3%)  download includes the HTML pages of the tables (1TB zipped)  Table Statistics  Heterogeneity: Very high. MinMaxAverageMedian Attributes22, Data Rows170,

19 Slide 19  Attribute Statistics 28,000,000 different attribute labels WDC - Web Tables Corpus Attribute#Tables name4,600,000 price3,700,000 date2,700,000 artist2,100,000 location1,200,000 year1,000,000 manufacturer375,000 counrty340,000 isbn99,000 area95,000 population86,000  Subject Attribute Values 1.74 billion rows 253,000,000 different subject labels Value#Rows usa135,000 germany91,000 greece42,000 new york59,000 london37,000 athens11,000 david beckham3,000 ronaldinho1,200 oliver kahn710 twist shout2,000 yellow submarine1,400

20 Slide 20 Conclusion Three factors are necessary to work with web-scale data:  Thanks to Common Crawl, this data is available  Like Amazon or other on-demand cloud-services  The Web Data Commons Framework, or standard tools like Pig  Cost evaluation on task-base, but the WDC framework has turned out to be cheaper Availability of Crawls Availability of cheap, easy-to-use infrastructures Easy to adopt scalable extraction frameworks

21 Slide 21 Questions  Please visit our website:  Data and Framework are available as free download  Web Data Commons is supported by:


Download ppt "Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer."

Similar presentations


Ads by Google