Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Technology I © 2006 Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge.

Similar presentations


Presentation on theme: "Language Technology I © 2006 Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge."— Presentation transcript:

1 Language Technology I © 2006 Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge Extraction/Semantic Web

2 Language Technology I © 2006 Paul Buitelaar Overview Semantic Web  Introduction  Semantic Web Representation and Query Languages  Semantic Web Tools Ontologies and Knowledge Markup  Ontologies and other Knowledge Organization Systems  Knowledge Markup for Ontology Population  Ontology Life-Cycle Knowledge Extraction  Ontology Population  Ontology Learning

3 Language Technology I © 2006 Paul Buitelaar Semantic Web

4 Language Technology I © 2006 Paul Buitelaar Web Docs, Data Web

5 Language Technology I © 2006 Paul Buitelaar Web Docs, Data Knowledge Markup Web > Semantic Web

6 Language Technology I © 2006 Paul Buitelaar Web Docs, Data Knowledge Markup Ontologies Web > Semantic Web

7 Language Technology I © 2006 Paul Buitelaar Knowledge Markup Ontologies Web > Semantic Web

8 Language Technology I © 2006 Paul Buitelaar Knowledge Markup Ontologies Semantic Web Services Accessing the Semantic Web - Machines

9 Language Technology I © 2006 Paul Buitelaar Intelligent Man-Machine Interface Knowledge Markup Ontologies Semantic Web Services Accessing the Semantic Web - Humans

10 Language Technology I © 2006 Paul Buitelaar Semantic Web Layer cake Introduced by Tim Berners-Lee in 2001 Built upon existing WWW standards

11 Language Technology I © 2006 Paul Buitelaar Resource Description Framework (RDF) RDF is an extensible language for expressing graph-structures Serializes to XML node1 DFKI GmbH Kaiserslautern <rdf:RDF xmlns:rdf=“… rdf-syntax-ns#” xmlns:rdfs=“… rdf-schema#” xmlns=“http://example.org”> DFKI GmbH Kaiserslautern name location www http://www.dfki.de

12 Language Technology I © 2006 Paul Buitelaar RDF Schema (RDFS) Adds a vocabulary for representing classes and properties to RDF PersonTeacher Student rdf:Literal name Course teaches enrolledIn is-a

13 Language Technology I © 2006 Paul Buitelaar Web Ontology Language (OWL) OWL - Based on Description Logics Adds further modelling vocabulary on top of RDFS XML SchemaNamespaces Interpretation Context RDF Schema OWL Formalization: Classes (Inheritance), Properties Formalization: Classes, Class Definitions, Properties, Property Types (e.g. Transitivity) Data Types XML RDF SyntaxSemantics

14 Language Technology I © 2006 Paul Buitelaar Semantic Web Query Languages - SPARQL SPARQL - query language developed by W3C Syntactically based on SQL: Results available as XML Documents PREFIX foaf: SELECT ?foafName WHERE { ?x foaf:name ?foafName. OPTIONAL { ?x foaf:mbox ?mbox }. }

15 Language Technology I © 2006 Paul Buitelaar Semantic Web Tools Programming APIs  Jena - Java  Redland – Python, …  RAP - PhP Editors  Protégé  OntoStudio  Triple20 - Prolog Storage  Sesame  OntoBroker

16 Language Technology I © 2006 Paul Buitelaar Ontologies and Knowledge Markup

17 Language Technology I © 2006 Paul Buitelaar Ontologies in Philosophy Ontology is a branch of philosophy that deals with the nature and the organization of reality Science of Being (Aristotle, Metaphysics)  What characterizes being?  Eventually, what is being?

18 Language Technology I © 2006 Paul Buitelaar Ontologies in Computer Science  Ontology refers to an engineering artifact  a specific vocabulary used to describe a certain reality  a set of explicit assumptions regarding the intended meaning of the vocabulary  An Ontology is  an explicit specification of a conceptualization [Gruber 93]  a shared understanding of a domain of interest [Uschold/Gruninger 96]

19 Language Technology I © 2006 Paul Buitelaar Why Develop an Ontology? Make domain assumptions explicit  Easier to change domain assumptions  Easier to understand and update legacy data Separate domain knowledge from operational knowledge  Re-use domain and operational knowledge separately A community reference for applications Shared understanding of what information means

20 Language Technology I © 2006 Paul Buitelaar Types of Ontologies [Guarino, 98] Describe very general concepts like space, time, event, which are independent of a particular problem or domain. It seems reasonable to have unified top-level ontologies for large communities of users. Describe the vocabulary related to a generic domain by specializing the concepts introduced in the top-level ontology. Describe the vocabulary related to a generic task or activity by specializing the top-level ontologies. These are the most specific ontologies. Concepts in application ontologies often correspond to roles played by domain entities while performing a certain activity.

21 Language Technology I © 2006 Paul Buitelaar Ontologies and Their Relatives Catalog / ID Terms/ Glossary Thesauri Informal Is-a Formal Is-a Formal Instance Frames Value Restric- tions General logical constraints Axioms Disjoint Inverse Relations,...

22 Language Technology I © 2006 Paul Buitelaar Knowledge Organization Systems Semantic Lexicons – e.g. WordNet  … group together words according to lexical semantic relations like synonymy, hyponymy, meronymy, antonymy, etc. Thesauri  …group together domain terms according to a set of taxonomic relations, including broader term, narrower term, sibling, etc. Semantic Networks and Ontologies  … group together classes of objects according to a set of relations that originate in the nature of the domain of application.  Ontologies are defined by a formal semantics, but semantic networks may be informally defined. Therefore all ontologies are semantic networks, but not all semantic networks are ontologies.

23 Language Technology I © 2006 Paul Buitelaar Thesauri - Examples MeSH Heading Databases, Genetic Entry Term Genetic Databases Entry Term Genetic Sequence Databases Entry Term OMIM Entry Term Online Mendelian Inheritance in Man Entry Term Genetic Data Banks Entry Term Genetic Data Bases Entry Term Genetic Databanks Entry Term Genetic Information Databases See Also Genetic Screening MT 3606 natural and applied sciences UF gene pool genetic resource genetic stock genotype heredity BT1 biology BT2 life sciences NT1 DNA NT1 eugenics RT genetic engineering (6411) EuroVoc covers terminology in all of the official EU languages for all fields that concern the EU institutions, e.g., politics, trade, law, science, energy, agriculture, 27 such fields in total. MeSH (Medical Subject Headings) is organized by terms (currently over 250,000) that correspond to a specific medical subject. For each such term a list of syntactic, morphological or semantic variants is given.

24 Language Technology I © 2006 Paul Buitelaar Semantic Networks - Examples Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicatesPathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function Accession:GO:0009292 Ontology:biological process Synonyms:broad: genetic exchange Definition:In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) GO (Gene Ontology) allows for “consistent descriptions of gene products in different databases, including several of the world’s major repositories for plant, animal and microbial genomes…“ Organizing principles are molecular function, biological process and cellular component. UMLS (Unified Medical Language System) integrates linguistic, terminological and semantic information. The Semantic Network consists of 134 semantic types and 54 relations between types.

25 Language Technology I © 2006 Paul Buitelaar Example Ontology Consider an Example Ontology for the Newspaper Domain

26 Language Technology I © 2006 Paul Buitelaar Ontologies are used to semantically organize and retrieve data (structured, textual, multimedia) through knowledge markup Consider the following example: Knowledge Markup from Text is based on Named-Entity Recognition, Semantic Tagging (Term to Class Mapping) and Relation Extraction Knowledge Markup <news:story xmnls:jobs=“http://www.jobs.org/owl-jobs#” xmlns:com=“http://www.companies.org/owl-companies#” xmlns:it=“http://www.it.net/owl-it#”> “We were surprised by several of the results, particularly the order of finish,” said Dan Olds. IBM finished first with very strong results, and HP scored a solid number two; we expected to see Sun Microsystems challenging for first place or at least a strong second place. As the largest UNIX vendor in terms of number of installed systems, a third place finish should put their management on notice that their installed base may be vulnerable.

27 Language Technology I © 2006 Paul Buitelaar Knowledge Markup - Images Semantic Annotation of Medical Images (miAKT Project - UK)

28 Language Technology I © 2006 Paul Buitelaar Knowledge Markup - Images Semantic Annotation of Video (SmartMedia – DFKI KM)

29 Language Technology I © 2006 Paul Buitelaar Ontology Life-Cycle Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval

30 Language Technology I © 2006 Paul Buitelaar Knowledge Extraction Ontology Population & Ontology Learning

31 Language Technology I © 2006 Paul Buitelaar Ontology Life-Cycle – Ontology Population Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval

32 Language Technology I © 2006 Paul Buitelaar Ontology Population with SOBA SOBA: SmartWeb Ontology-based Annotation Application Context  SmartWeb (http://www.smartweb-projekt.de/) – German Project around World-Cup 2006  Integrates  Multimodal Dialog Processing  IR-based Question Answering  Ontology-Based Information Extraction  Semantic Web Services Ontology-Based Information Extraction …  Combines:  Semantic Wrapping of Semi-Structured Data  Semantic and Linguistic Annotation of Free Text  Inference Rules for Instantiation and Integration of Annotated Entities and Events … and Display  Ontology-driven Hyperlink Generation for Display of Extracted Information

33 Language Technology I © 2006 Paul Buitelaar Linguistic Annotation Named Entity Recognition & Semantic Tagging Named Entity Recognition & Semantic Tagging Image Extraction PDF Analysis Inference Rules for Instantiation & Integration Inference Rules for Instantiation & Integration Knowledge Base Documents Ontologies Wrapping of SemiStructured Data Wrapping of SemiStructured Data SOBA – Processing and Data Flow

34 Language Technology I © 2006 Paul Buitelaar SWIntO: SmartWeb Integrated Ontology SmartDOLCE:Entity SmartSUMO:Attribute SmartSUMO:SocialRole SmartSUMO:Proposition SportEvent:FootballPlayer SportEvent:Goalkeeper SportEvent:FootballOrganizationPerson SportEvent:FootballClubPresident … … … … … … … … SWIntO (by AIFB, DFKI KM/IUI, EML) covers  Foundational (DOLCE) and General (SUMO) Knowledge  Domain- and Task-Specific Knowledge  Football / Sport Events  Navigation, Discourse, Multimedia  other

35 Language Technology I © 2006 Paul Buitelaar SMartWeb Integrated Ontology (by AIFB, DFKI KM/IUI, EML)

36 Language Technology I © 2006 Paul Buitelaar

37 Language Technology I © 2006 Paul Buitelaar SmartWeb Corpus (Growing) Web Corpus through Monitor on  http://fifaworldcup.yahoo.com/  http://www.uefa.com/competitions/worldcup Semi-Structured Data  Tabular: Match Reports, Teams, etc. Free Text  Match Reports  Image Captions

38 Language Technology I © 2006 Paul Buitelaar Semi-Structured Data - HTML

39 Language Technology I © 2006 Paul Buitelaar Semi-Structured Data - XML

40 Language Technology I © 2006 Paul Buitelaar Semi-Structured Data – F-Logic

41 Language Technology I © 2006 Paul Buitelaar MatchEvent [Score, Team1, Team2] FootballPlayer Information Extraction from Free Text

42 Language Technology I © 2006 Paul Buitelaar FoulEvent [FootballPlayer] FootballPlayer Information Extraction from Image Captions

43 Language Technology I © 2006 Paul Buitelaar Linguistic and Semantic Annotation Mark Crossley saved twice with his legs from Huckerby. Named Entity Recognition & Semantic Tagging [Mark Crossley GOALKEEPER] [saved GOALKEEPER_ACTION] twice with his legs from [Huckerby PLAYER]. Linguistic Annotation [Mark Crossley GOALKEEPER : SUBJ] [saved PRED : GOALKEEPER_ACTION] twice [with his legs PP_OBJ] [from [Huckerby PLAYER] PP_ADJUNCT]. [ GOALKEEPER_ACTION = 'save‘, GOALKEEPER = 'Mark Crossley‘, PLAYER = 'Huckerby‘, MANNER = ‘legs']

44 Language Technology I © 2006 Paul Buitelaar Annotation/Extraction Example Example Sentence from Match Report Allerdings ist Petrow fuer die Partie gegen Schweden gesperrt und kann erst gegen Ungarn eingesetzt werden. “However Petrow has been banned for the match against Sweden and can again be deployed against Hungary.” Annotated/Extracted Information (with SProUT IE Tool - DFKI-LT ) player_action & [GAME_EVENT "Ban", AGENT player & [SURNAME "PETROW"], IN_MATCH game & [TEAM2 "SWE", TOURNAMENT "Match"]] team & [NAME "HUN"]

45 Language Technology I © 2006 Paul Buitelaar Knowledge Base Generation Transformation of SProUt Output to F-Logic via Declarative Mappings, e.g.:

46 Language Technology I © 2006 Paul Buitelaar SProUt to F-Logic FS type="player_action"> [N [N soba#player124:sportevent#FootballPlayer [sportevent#impersonatedBy -> soba#Guido_BUCHWALD]. soba#Guido_BUCHWALD:dolce#"natural-person" [dolce#"HAS-DENOMINATION" -> soba#Guido_BUCHWALD_Denomination]. soba#Guido_BUCHWALD_Denomination":dolce#"natural- person-denomination" [dolce#LASTNAME -> "Buchwald"; dolce#FIRSTNAME -> "Guido"]. SProUtF-Logic

47 Language Technology I © 2006 Paul Buitelaar A Complex Example semistruct#"Bolivien_vs_Brasilien_09_Oct_05_16_00_Luis_CRIS TALDO": sportevent#FieldMatchFootballPlayer [ externalRepresentation@(de) ->> "Luis CRISTALDO (7)"; sportevent#number -> 7; sportevent#impersonatedBy -> semistruct#"Luis_CRISTALDO" ]. semistruct#"Bolivien_vs_Brasilien_09_OCt_05_16_00" [ sportevent#matchEvents -> soba#ID25 ]. soba#ID25:sportevent#Foul [ sportevent#commitedBy -> semistruct#"Bolivien_vs_Brasilien_09_Oct_05_Luis_CRISTALDO ]. mediainst#ID67:media#Picture [ media#URL -> "http://fifaworldcup.yahoo.com/06/de/photos/index.html?aid=124155&d=1"; media#shows -> ID25 ].

48 Language Technology I © 2006 Paul Buitelaar Display of Extracted Information

49 Language Technology I © 2006 Paul Buitelaar Ontology Life-Cycle – Ontology Learning Create/Select Development and/or Selection Populate Knowledge Base Generation Validate Consistency Checks Evolve Extension, Modification Maintain Usability Tests Deploy Knowledge Retrieval

50 Language Technology I © 2006 Paul Buitelaar Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:= is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) (Multilingual) Synonyms Introduced in: Philipp Cimiano, PhD Thesis University of Karlsruhe, forthcoming Ontology Learning Layer Cake

51 Language Technology I © 2006 Paul Buitelaar Some Current Work on Ontology Learning from Text Term Extraction  Statistical Analysis  Patterns  (Shallow) Linguistic Parsing  Term Disambiguation & Compositional Interpretation  Combinations Taxonomy Extraction  Statistical Analysis & Clustering (e.g. FCA)  Patterns  (Shallow) Linguistic Parsing  WordNet  Combinations Relation Extraction  Anonymous Relations (e.g. with Association Rules)  Named Relations (Linguistic Parsing)  (Linguistic) Compound Analysis  Web Mining, Social Network Analysis  Combinations Definition Extraction  (Linguistic) Compound Analysis (incl. WordNet) Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.

52 Language Technology I © 2006 Paul Buitelaar Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:= is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) (Multilingual) Synonyms RelExt - Relation Extraction for Ontology Learning

53 Language Technology I © 2006 Paul Buitelaar RelExt - Motivation Extend Ontology with Relations  Currently ~ 60 Relations in the Sport Events Ontology –Mostly Properties, e.g. hasName, atMinute, …  Representation of (Verbal) Relations Enables Better Modeling of Events for Information Extraction Purposes Example “Ballack shoots the ball in the net.” Relation:Shoot (Domain:FootballPlayer Range:BallObject)

54 Language Technology I © 2006 Paul Buitelaar RelExt – System Architecture Named-Entity Rec. & Semantic Tagging Shallow Parsing Corpus Annotated Corpus Relevance Measure Frequencies In BNC, NZZ Relevance Scores Heads, Preds Co-occurrence Measure Co-occurrence Scores Heads <> Preds Linguistic AnnotationStatistical Processing Triple Generation Triples Head : Pred : Head Evaluation Relation Extraction and Evaluation

55 Language Technology I © 2006 Paul Buitelaar Linguistic Annotation Named-Entity Recognition “Michael Ballack” : FootballPlayer Semantic Tagging “Ball” (ball), “Leder” (leather) : BallObject Shallow Parsing  Part-of-Speech Tagging Fussballspieler (soccer player): Noun  Morphological Analysis Fussballspieler: Fussball – Spieler  Dependency Structure Analysis “The team won the second match.” SUBJECT PREDICATE DIRECT_OBJECT

56 Language Technology I © 2006 Paul Buitelaar Relevance Ranking Top-10 Head-Nouns before and after mapping to Ontology Classes Top-10 Predicates

57 Language Technology I © 2006 Paul Buitelaar Co-Occurrence Analysis............ flanken SUBJ:FOOTBALLPLAYER “Klasnic” flanken DOBJ:FOOTBALLPLAYER “Klose” flanken_in PP_ADJ “Zuschauer” (audience)...... beschimpfen (to insult) SUBJ:FOOTBALLPLAYER “Klasnic”..................

58 Language Technology I © 2006 Paul Buitelaar Integration into Ontology Development

59 Language Technology I © 2006 Paul Buitelaar Terms Concepts Taxonomy Relations Rules & Axioms disease, doctor, hospital {disease, illness, Krankheit} DISEASE:= is_a(DOCTOR, PERSON) cure(dom:DOCTOR, range:DISEASE) (Multilingual) Synonyms OntoLT – Protégé Plug-In for Ontology Extraction from Text

60 Language Technology I © 2006 Paul Buitelaar OntoLT – Basic Idea Middleware Solution in Ontology Development  Supports the Ontology Engineer through Semi-Automatic Extraction of Ontology Fragments from Domain-Relevant Document Collections  Download http://olp.dfki.de/OntoLT/OntoLT.htm Based on  Automatic Linguistic Annotation  Manual Definition of Mapping Rules  Statistical Preprocessing (Option)  Interactive Validation of Candidates  Generation in Protégé of Ontology Fragments

61 Language Technology I © 2006 Paul Buitelaar OntoLT – System Architecture

62 Language Technology I © 2006 Paul Buitelaar Corpus Example – KMI News

63 Language Technology I © 2006 Paul Buitelaar Mapping Rules

64 Language Technology I © 2006 Paul Buitelaar Statistical Relevance

65 Language Technology I © 2006 Paul Buitelaar Extract Candidates

66 Language Technology I © 2006 Paul Buitelaar Generate Ontology Fragments

67 Language Technology I © 2006 Paul Buitelaar Exercises Knowledge Extraction  Ontology Modeling (from Text)  Ontology Population  Ontology Learning (Extension)  Ontology Mapping


Download ppt "Language Technology I © 2006 Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge."

Similar presentations


Ads by Google