Co-funded by the European Union Semantic CMS Community Semantic Lifting for Traditional Content Resources Copyright IKS Consortium 1 Lecturer Organization.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Requirements Engineering for Semantic CMS
XML: Extensible Markup Language
Co-funded by the European Union Semantic CMS Community Content Management From free text input to automatic entity enrichment Copyright IKS Consortium.
Co-funded by the European Union Semantic CMS Community Designing Semantic CMS – Part I Copyright IKS Consortium 1 Lecturer Organization Date of presentation.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Review of Ontology Mapping, Merging, and Integration Presenter: Yihong Ding.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 DCS861A-2007 Emerging IT II Rinaldo Di Giorgio Andres Nieto Chris Nwosisi Richard Washington March 17, 2007.
Developing a Basic Web Page with HTML
Overview of Search Engines
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Metadata Standards and Applications 4. Metadata Syntaxes and Containers.
Lecturer: Ghadah Aldehim
What Can Do for You! Fabian Christ
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Semantic Web Technologies ufiekg-20-2 | data, schemas & applications | lecture 21 original presentation by: Dr Rob Stephens
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Information Systems & Semantic Web University of Koblenz ▪ Landau, Germany Semantic Web - Multimedia Annotation – Steffen Staab
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
10/18/2015 NORTEL NETWORKS CONFIDENTIAL – FOR TRAINING PURPOSES ONLY Global Documentation Evolution System Overview and End-to-End Process Training.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Aude Dufresne and Mohamed Rouatbi University of Montreal LICEF – CIRTA – MATI CANADA Learning Object Repositories Network (CRSNG) Ontologies, Applications.
ITCS373: Internet Technology Lecture 5: More HTML.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Co-funded by the European Union Semantic CMS Community Reference Architecture for Semantic CMS Copyright IKS Consortium 1 Lecturer Organization Date of.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Information Retrieval in Practice
The Semantic Web By: Maulik Parikh.
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Embedding Knowledge in HTML
ece 627 intelligent web: ontology and beyond
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Embedding Knowledge in HTML
Presentation transcript:

Co-funded by the European Union Semantic CMS Community Semantic Lifting for Traditional Content Resources Copyright IKS Consortium 1 Lecturer Organization Date of presentation

Page: Copyright IKS Consortium Introduction of Content Management Foundations of Semantic Web Technologies Storing and Accessing Semantic Data Knowledge Interaction and Presentation Knowledge Representation and Reasoning Semantic Lifting Designing Interactive Ubiquitous IS Requirements Engineering for Semantic CMS Designing Semantic CMS Semantifying your CMS Part I: Foundations Part II: Semantic Content Management Part III: Methodologies (2) (1) (3) (4) (5) (6) (7) (8) (9) (10)

Page: What is this Lecture about?  We have learned... ... how to build ontologies representing complex knowledge domains. ... a way to reason about knowledge.  We need a way... ... to extract knowledge from content in a automatic way  Semantic Lifting Copyright IKS Consortium 3 Storing and Accessing Semantic Data Knowledge Interaction and Presentation Knowledge Representation and Reasoning Semantic Lifting Part II: Semantic Content Management (3) (4) (5) (6)

Page: Overview  What is semantic lifting?  Core concepts  Scenarios  Requirements  Technologies  Semantic Reengineering  Semantic Enhancements of textual content Copyright IKS Consortium 4

Page: What is “Semantic Lifting”?  Semantic Lifting refers to the process of associating content items with suitable semantic objects as metadata to turn “unstructured” content items into semantic knowledge resources  Semantic Lifting makes explicit “hidden” metadata in content items Copyright IKS Consortium 5

Page: Semantic Lifting Targets  Semantic Reengineering of structured data  Semantic Lifting harmonizes metadata representations  Semantic Lifting reengineers data from an existing resource so that the data from the resource can be reused within in a semantic repository  Semantic Content Enhancement  Semantic Lifting generates additional metadata and annotations by semantic analysis of content items  Semantic Lifting classifies content objects by means of semantic annotations 6 Copyright IKS Consortium

Page: Structured Content  Structured content provides implicit semantics through the structure definition  Table definitions in relational databases, XML schemata, field definitions for adressbooks, calendars, etc.  Application programs are designed to „know“ how to interpret the structures and the data within.  Semantic Lifting is used for Reengineering to support data exchange and seamless interoperability between different systems Copyright IKS Consortium 7

Page: Unstructured Content  Unstructured content  Images, texts, videos, music, web pages composed of various types of media items  Meaningful only to humans not to machines  Content must be described semantically by metadata to become meaningful to machines, e.g. what the text or image is about.  Semantic Lifting is used as content enhancement 8 Copyright IKS Consortium

Page: Mixed Content  No dichotomy of structured and unstructured content  Structured databases are used to store unstructured content types, such as texts, images etc.  Documents can be composed of unstructured content items such as free text and images as well as more structured information, e.g. tables and charts Copyright IKS Consortium 9 Free text Structured content

Page: Metadata: Variants  Metadata exist in many forms:  Free text descriptions  Descriptive content related keywords or tags from fixed vocabularies or in free form  Taxonomic and classificatory labels  Media specific metadata, such a mime-types, encoding, language, bit rate  Media-type specific structured metadata schemes such as EXIF for photos, IPTC tags for images, ID3-tags for MP3, MPEG-7 for videos, etc.  Content related structured knowledge markup, e.g. to specify what objects are shown in an image or mentioned in a text, what the actors are doing, etc. Copyright IKS Consortium 10

Page: Metadata: Variants  Inline metadata are part of content  ID3 tags embedded in MP3 files  Offline metadata are kept separate from content Copyright IKS Consortium 11

Page: Formal semantic metadata  Data representation in a formalism with a formal semantic interpretation that defines the concept of (logical) entailment for reasoning:  Soundness: conclusions are valid entailments  Completeness: every valid entailment can be deduced  Decidability: a procedure exists to determine whether a conclusion can be deduced  Embodiments:  Logics  Knowledge Representation Systems, Description Logics  Semantic Web: RDF, OWL Copyright IKS Consortium 12

Page: „Semantics“ in CMS  CMS systems provide various methods to include metadata  Organize content in hierarchies  Hierarchical taxonomies  Attachment of properties to content items for metadata  Content type definitions with inheritance  These methods are used in CMS systems in ad-hoc fashion without clear semantics. Therefore no well- defined reasoning is possible. Copyright IKS Consortium 13

Page: Semantic Lifting Usage  Content Creation and Acquisition  Authoring content  Support content editors in providing metadata of specified types  Uploading external content/documents  automatic extraction and analysis, e.g. for indexing  Importing content from external sources/documents  Integration of external content into content repository  Content needs to be transformed to match internal CMS structures and metadata schemes  Crossreferencing/linking among CMS content items and external content  Detect related or additional content  Add pointers/links to related or additional content Copyright IKS Consortium 14

Page: Semantic Lifting Usage  Access to external documents and content repositories  Semantic harmonization with CMS semantic structures  Semantic interoperability in data exchange with other content repositories  The CMS needs to understand the data structures used by external services and programs  E.g synchronization of a local calendar from Outlook with an external calendar based on iCalendar format  E.g. Importing RDF from a Linked Data endpoint such as dbpedia  The CMS must present its data in a form understood by external target services or programs Copyright IKS Consortium 15

Page: Semantic Lifting Usage  Publishing content with metadata  Metadata need to be transformed into a form compatible with the publication format  E.g. converting FreeDB metadata into ID3 tags for inclusion in an MP3 file Copyright IKS Consortium 16

Page: Publishing Web Content with semantic metadata  Augmenting web content with structured information becomes increasingly important  Several methods have emerged in recent years to include structured metadata in Web pages  Microformats  RDFa  Microdata (HTML5)  Supported by the major search engines to improve search and result presentation, e.g. Google („Rich Snippets), Bing, Yahoo Copyright IKS Consortium 17

Page: Augmenting Web Content  The HTML code contains a review of a restaurant in plain text using only line breaks for structuring  Without specialized information extraction analysis tools it cannot be interpreted, e.g. that it is a review (of what and when?), who the reviewer was, etc. L’Amourita Pizza Reviewed by Ulysses Grant on Jan 6. Delicious, tasty pizza on Eastlake! L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint. Rating: Copyright IKS Consortium

Page: Microformats  Same text but additional span elements with class attributes to encode the type of contained information (hReview) and the properties of that type L’Amourita Pizza Reviewed by Ulysses Grant on Jan 6. Delicious, tasty pizza on Eastlake! L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint. Rating: Copyright IKS Consortium

Page: RDFa  Same text but additional attributes and span elements encoding a RDF structure:  namespace declaration of the used ontology  RDF class encoded by typeof attribute and its properties by a property attribute L’Amourita Pizza Reviewed by Ulysses Grant on Jan 6. Delicious, tasty pizza on Eastlake! L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint. Rating: Copyright IKS Consortium

Page: Microdata (HTML5)  Same text but additional attributes and span elements:  A class declaration as value of an itemtype attribute and its properties as values of an itemprop attribute L’Amourita Pizza Reviewed by Ulysses Grant on Jan 6. Delicious, tasty pizza in Eastlake! L'Amourita serves up traditional wood-fired Neapolitan-style pizza, brought to your table promptly and without fuss. An ideal neighborhood pizza joint. Rating: Copyright IKS Consortium

Page: Lifting Requirements: Overview  Top-level requirements  Semantic Associations with Content  Semantic Harmonization  Semantic Linking  Interactive Lifting  Customizability  Semantically Transparent Structured Content Sources Copyright IKS Consortium 22

Page: Semantic Associations with Content  Unstructured content and information must be supplied with structured semantic annotations and metadata.  Support for various content/media types  Information extraction from text, topic classification, image tagging, …  Support for creation of semantic annotations in content authoring 23 Copyright IKS Consortium

Page: Semantic Harmonization  Metadata and annotations must be harmonized with requirements for semantic processing in the CMS  Reengineering methods, interpreters and wrappers for all types and formats of metadata and annotations, e.g. tags, microformats, XML Metadata ( MPEG-7, …), ID3 tags, EXIF data, …  Ensure semantic interoperability of data and annotation schemes within the CMS and across external resources  Ontology mapping and harmonization of annotations  External metadata  Metadata generated by semantic analysis 24 Copyright IKS Consortium

Page: Semantic Linking  Lifting must enable the interlinking of content objects by semantic relationships.  Internal linking of content items within the CMS  links to external resources, e.g. Linked Open Data  Establish semantic relatedness of content for different views as well as different search, navigation and browsing strategies, …  Direct semantic links among content items and metadata  Similarity relations over sets of content items  Clustering of content items Slide 25 Copyright IKS Consortium

Page: Interactive Lifting  Lifting must interact with CMS users.  Suggest semantic annotations during content creation  Support for various publishing formats such as microformats, RDFa, etc.  Automatic annotations (autotagging) with optional correction option  Learning capabilities and adaptability of automatic annotation components from user feedback Slide 26 Copyright IKS Consortium

Page: Customizability  Lifting components must be customizable by CMS users/customers.  Users must not be restricted to predefined vocabularies, ontologies, …  Domain ontologies, terminologies, tag sets are defined by CMS users/customers.  Browsers and editors for component resources are necessary. 27 Copyright IKS Consortium

Page: Transparent Structured Content Sources  Structured content sources need to be reengineered to semantic resources  Support uniform data access to structured content repositories, e.g. SPARQL end points based on D2RQ technologies for transparent access to RDF and non-RDF databases  Extraction of ontologies from database structures, schemata, XML, resources, …  Alignment and mapping of the descriptions 28 Copyright IKS Consortium

Page: Semantic Reengineering of structured data sources  Focus on tools for reengineering structured data sources to RDF representations  Many tools and platforms for  D2R Servers: Exhibit relational DBs as RDF  Talis platform: Linked Open Data  Triplify: like D2R but in PHP  Virtuoso middleware  Krextor/OntoCape: generating RDF from XML  Various Transformers for inducing RDF ontologies and instance data from XSD and XML  More details in presentation on Knowledge Representation (KReS) Copyright IKS Consortium 29

Page: Semantic Content Enhancements: Overview  Focus here is on textual content  Metadata Extraction from existing content in various formats to make embedded metadata explicit  Information Extraction from textual content:  Named Entities  Coreference  Relationships  Classification and Clustering of content items  Statistical methods and tools  Semantic classification based on ontological definitions Copyright IKS Consortium 30

Page: Information Extraction  Rule based approaches for shallow text analysis  Usually based on Finite State technology: fast, robust  Cascaded processing  Based on templates as target structures to be filled  Example platforms:  GATE  SProUT  Can be used for nearly any kind of extraction/annotation task, including Named-Entity-Recognition (NER)  Easy customization Copyright IKS Consortium 31

Page: Information Extraction  Semi-supervised learning approaches  Rule induction from corpora  Use example annotations as seeds for bootstrapping  Pattern Rules learned from contextual features with generalization over contexts Copyright IKS Consortium 32

Page: Named Entities  Statistical Approaches: examples  Lingpipe: Hidden Markov Models  OpenNLP: Maximum Entropy Models  Stanford NER: Conditional Random Fields  Statistical models crated by supervised learning techniques  Large annotated corpora required  Customization diffcult except by re-annotation/re-training  Not suitable for any type of named entity Copyright IKS Consortium 33

Page: NER Document Markup Copyright IKS Consortium 34

Page: NER Markup for a Web Page Copyright IKS Consortium 35

Page: IE Template Copyright IKS Consortium 36 A Person Template (as Typed Featured Structure) instantiated from text. The template supports the extraction of various properties of a person.

Page: Classification  Assign a data item to some predefined class  Statistical classification  Numerous methods, e.g.:  Bayes classifiers  K-Nearest Neighbor (KNN)  Support Vector Machines (SVM) Copyright IKS Consortium 37

Page: Semantic Classification Copyright IKS Consortium 38  Semantic classification in Knowledge Representation Formalisms  Infer the item‘s class from the item‘s properties by matching them with the class definitions: Which classes allow for these properties? Assume that our ontology contains 2 classes with some properties SpatialThing: latitude, longitude PopulatedPlace: population Paderborn is an object with latidude „51°43′0″N“, longitude „8°46′0″E“ and a population of Then we can infer that Paderborn is a SpatialThing as that are the things that have latitudes and longitudes in our ontology. Also, we can infer that it is a PopulatedPlace as that are the things that have a population.

Page: Clustering  Detection of classes in a data set  Partitioning data into classes in an unsupervised way with high intra-class similarity low inter-class similarity  Main variants:  Hierarchical clustering  Agglomerative  Partitioning clustering  K-Means Copyright IKS Consortium 39

Page: Tools for Classification and Clustering  Generic:  WEKA: Java library implementing several dozen methods for data mining. Application to textual data requires special preprocessing.  Text:  MALLET: Java library with implementations of major methods for text and document classification and clustering Copyright IKS Consortium 40

Page: Evaluation Measures  Standard evaluation measures for IE/IR etc. systems:  Accuracy:  Precision:  Recall:  F-Measure : Copyright IKS Consortium 41 tp = true positive tn = true negative fp = false positive fn = false negative

Page: Evaluation Measures: Classification  A confusion matrix which reports on the classification of 27 wines by grape variety. The reference in this case is the true variety and the response arises from the blind evaluation of a human judge. Many-way Confusion Matrix Response CabernetSyrahPinot PrecisionRecallF-Measure Refer-Cabernet930 0,690,750,72 enceSyrah351 0,56 Pinot114 0,800,670,73 Macro average0,680,660,67 Overall accuracy0,67 =9/(9+3+1 ) =4/(1+1+4) 42 Copyright IKS Consortium

Page: Evaluation Measures: NER  Reference annotations:  [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today  Recognized annotations:  [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] -> Microsoft Corp. CEO Steve Ballmer announced the release of Windows 7 today Precision: 1/(1+3) = 0,25 Recall: 1/(1+2) = 0,33 F-Measure: 2*0,25*0,33/(0,25+0,33) = 0,28 CountsEntities TP1[Microsoft Corp.] TN FP3[CEO] [Steve] [today] FN2[Windows 7] [Steve Ballmer] 43 Copyright IKS Consortium

Page: NER Evaluation  Nobel Prize Corpus from NYT, BBC, CNN  538 documents (Ø 735 words/document)  person, organization occurrences Copyright IKS Consortium 44 SproutCalaisStanford NER OpenNLP Precision 77,26 94,22 73,21 57,69 Recall 65,85 86,66 73,62 42,86 F1 71,10 90,28 73,41 49,18

Page: References  Microformats:  RDFa:  Google Rich Snippets:  Linked Data:  Linked Data: Heath and Bizer, Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, (Online:  Information Extraction: Moens, Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer 2006  Text Mining: Feldman and Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, CUP, 2007 Copyright IKS Consortium 45