Www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis.

Slides:



Advertisements
Similar presentations
Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
Advertisements

Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute 1 From OntoSelect to OntoSelect-SWSE.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer.
Sematic Web Microdata, Microformat and RDF Advanced Web-based Systems | Misbhauddin.
© Copyright 2008 STI INNSBRUCK ClickBank OC Working Group – Anton Evangelatov.
Page 1 June 2, 2015 Optimizing for Search Making it easier for users to find your content.
Ontology Notes are from:
The Web of Linked Data Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
© Copyright 2008 STI INNSBRUCK Semantic Web Applications Lecture XIV Dieter Fensel.
Thank you SPSKC15 sponsors!. SharePoint 2013 Search Service Application (SSA) Ambar Nirgudkar Software Engineer
Overview of Search Engines
Create a Website Lesson 1 – Part 3. Domain Names 2 Domain names are used to identify one or more IP addresses ( ). For example, the domain.
Automated Tracking of Online Service Policies J. Trent Adams 1 Kevin Bauer 2 Asa Hardcastle 3 Dirk Grunwald 2 Douglas Sicker 2 1 The Internet Society 2.
Contents:  1 – Introduction to the subject of web mining and techniques  2 – Overview of research conducted (both theory and practical)  3 – Software.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
© Copyright 2012 STI INNSBRUCK Net Communication Management (Ncm.at) OC meeting, Serge Tymaniuk.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
© Copyright 2008 STI INNSBRUCK GuestCentric OC Working Group – Serge Tymaniuk.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
By Joanne Parker.  This website is stands for Internet Movie Database and it is where you can find out information on movies, actors, directors, producers.
© Copyright 2008 STI INNSBRUCK Eventiply OC Working Group – Serge Tymaniuk.
 “Micro Data (like RDFa and Microformats) is a form of semantic mark-up designed to describe elements on a web page e.g. review, person, event etc. This.
© Copyright 2008 STI INNSBRUCK Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
© Copyright 2008 STI INNSBRUCK
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
© Copyright 2008 STI INNSBRUCK Travel Audience.
Do's and don'ts to improve your site's ranking … Presentation by:
© Copyright 2013 STI INNSBRUCK Schema Creator Ioan Toma.
© Copyright 2008 STI INNSBRUCK Schema.org - Actions OC Working Group – Anton Evangelatov.
Social Bookmarking For Scientists - The Best of Both Worlds Ben Lund Nature Publishing Group 28th June 2006 Data Webs, Imperial College, London.
© Copyright 2008 STI INNSBRUCK HTML Data Guide W3C Interest Group Note 08 March 2012 OC Working Group –
Towards a semantic web Philip Hider. This talk  The Semantic Web vision  Scenarios  Standards  Semantic Web & RDA.
© Copyright 2008 STI INNSBRUCK August 2, 2012 – Carmen Brenner.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
© Copyright 2008 STI INNSBRUCK Collusion Discover who is tracking you online.
Presented By:- Thomas Steiner Raphael Troncy Michael Hausenblas Reviewed By:- Sudeep Malik Professor :- Chris Mattmann.
 Structured Data An Introduction to Semantic Web “It is very hard for search engines to understand the structure and semantics of data embedded in an.
Web Review The Web Web 1.0 Web 2.0 Future of the Web Internet Programming - Chapter 01:XHTML1.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
© Copyright 2012 STI INNSBRUCK August 30 th, 2012.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
© Copyright 2008 STI INNSBRUCK OntoWebber: Model-Driven Ontology-Based Web Site Management (2001) Yuhui Jin,
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Effectively Conducting Research on the Internet Library Research Skills Seminar.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Exploit Interactive Web Magazine.
© Copyright 2015 STI INNSBRUCK PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.
© Copyright 2011 STI INNSBRUCK SCMS – Semantifying Content Management Systems + A Semantic Publish/subscribe.
A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
HOW DOES INFOMINE WEBSITE WORKS ENID FRANCO. INFOMINE is a website in which people at university or college levels can use the web as a library resource.
Data mining in web applications
Search Engine Optimization
Challenges and Opportunities of Archiving the UK Web
Why (and How To) use Cross site publishing in SharePoint 2013
Information needed for citing sources:
Data Integration for Relational Web
Google Dataset Search Evaluation
All About the Internet.
A long written work by an expert, giving a broad overview of a topic, aimed at students. Textbook.
Presentation transcript:

© Copyright 2008 STI INNSBRUCK Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis OC Working Group – Serge Tymaniuk

Overview Introduction Methodology Results Questions 2

Introduction Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1) –(1) Data and Web Science Group, University of Mannheim, Germany –(2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands Features: –Analysis of RDFa, Microdata, and Microformats adoption on the Web –Based on large public Web crawl of 3 billion HTML pages –Aims at revealing the main topical areas of the published data and different vocabularies within each topical area –Examine structural richness (which properties are used to described popular types of entities) 3

Web Crawl Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3. 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains. Crawling conducted between Jan. - June 2012 Compressed size of the corpus is 48TB Relies on the PageRank algorithm 4

Data Extraction Process Parsing framework is executed on Amazon EC2 Relies on Anything To Triples ( parsing library from Apachehttp://any23.apache.org/ Rapidminer data mining framework is used for vocabulary term co-occurrence analyses 5

Results: Overall picture Structured data was discovered within 369M out of 3B pages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%) 6

Results: Deployment by FORMAT 7 * PLDs – Public Level Domains (i.e. websites) * URLs – HTML pages

Results: Deployment by POPULARITY * According to Alexa Internet Inc. (AL) list of the most frequently visited websites 8

Results: Deployment by domains 9

Results: Deployment on the same Website 93,5% of all website which has structured data use only a single format 10

11 Results: Deployment of RDFa Most frequently used RDFa classes: Alexa top 100 websites that use RDFa: IMDB Microsoft News Portal BBC Most frequently used properties co-occurring with all the 4 most frequently used OGP classes:

12 Results: Deployment of Microdata Most frequently used Microdata classes: Alexa top 100 websites that use Microdata: eBay Microsoft Corp. Apple Inc.

13 Results: Deployment of Microformats Most frequently used Microformats classes: Alexa top 100 websites that use Microformats: Wikipedia Adobe Taobao marketplace

Results: Topical Domains Dominant Domains of the published data: –Persons and Organizations (by all 3 formats) –Blog- and CMS-related metadata (by RDFa and Microdata) –Navigational metadata (by RDFa and Microdata) –Product data (by all 3 formats) –Event data (by Microformats) 14

Results: Structural Richness Only a small set of generic properties is used to describe entities: –Instances of OGP class “Product” are described by title, url, site_name, description in most classes –Instances of Schema class “Product” is described largely only by name and description.  Additional extraction techniques has to be employed for deeper understanding 15

Sources 16 1.Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdfhttp://hannes.muehleisen.org/Bizer-etal- DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf

Thank you for your attention! 17 Questions?