F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler,

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.

XML: Extensible Markup Language

Chapter 5: Introduction to Information Retrieval

CS570 Artificial Intelligence Semantic Web & Ontology 2

By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.

Dr. Alexandra I. Cristea RDF.

1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.

Chapter 19: Information Retrieval

Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.

Department of Computer Science, University of Maryland, College Park 1 Sharath Srinivas - CMSC 818Z, Spring 2007 Semantic Web and Knowledge Representation.

Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Nancy Ide Vassar College USA Resource Definition Framework A Tutorial EUROLAN 2003 July 28 - August 8 Bucharest - Romania.

RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.

SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF Web link:

16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.

The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

Logics for Data and Knowledge Representation

The Semantic Web Web Science Systems Development Spring 2015.

LIS618 lecture 2 the Boolean model Thomas Krichel

By: Dan Johnson & Jena Block. RDF definition What is Semantic web? Search Engine Example What is RDF? Triples Vocabularies RDF/XML Why RDF?

Modern Information Retrieval Lecture 3: Boolean Retrieval.

Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.

Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Semantic Web - an introduction By Daniel Wu (danielwujr)

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.

Of 35 lecture 5: rdf schema. of 35 RDF and RDF Schema basic ideas ece 627, winter ‘132 RDF is about graphs – it creates a graph structure to represent.

Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.

Understanding RDF. 2/30 What is RDF? Resource Description Framework is an XML-based language to describe resources. A common understanding of a resource.

Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.

Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar

Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.

Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.

Ch 7: RDF schema 현근수, 김영욱, 백상윤, 이용현 Team C. Introduction Semantic web modeling In RDF: simply creates graph structure to represent data In RDFS: about.

1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.

1 Information Retrieval LECTURE 1 : Introduction.

Introduction to Information Retrieval Boolean Retrieval.

Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.

EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lotzi Bölöni.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Semantic Web ０６ T ０００６ YOSHIYUKI Osawa. Problem of current web  limits of search engines Most web pages are only groups of character strings. Most web.

Chapter 5 The Semantic Web 1. The Semantic Web  Initiated by Tim Berners-Lee, the inventor of the World Wide Web.  A common framework that allows data.

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

CS315 Introduction to Information Retrieval Boolean Search 1.

Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.

Syntax and semantics >AMYLASEE1 TGCATNGY A very simple FASTA file.

OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.

Take-away Administrativa

The Semantic Web By: Maulik Parikh.

Large Scale Search: Inverted Index, etc.

Lecture 1: Introduction and the Boolean Model Information Retrieval

Information Retrieval

Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics

CSCE 561 Information Retrieval System Models

Information Retrieval

CMPS 561 Boolean Retrieval

Introduction to Information Retrieval

Chaitali Gupta, Madhusudhan Govindaraju

Chapter 31: Information Retrieval

Information Retrieval and Web Design

Semantic-Web, Triple-Strores, and SPARQL

Chapter 19: Information Retrieval

Presentation transcript:

F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler, Institut für Maschinelle Sprachverarbeitung Universität Stuttgart, 2014 AND informationretreival.org from Stanford University

2 I NFORMATION R ETRIEVAL

 You should be very familiar with Information Retrieval when performing Internet searches with Google, Bing, Yahoo, DuckDuckGo, etc.  Example: Shakespeare search I NFORMATION R ETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

 Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ?  The GREP technique  Based on UNIX command which performs a linear scan of a collection of documents.  Efficient, effective, and allows use of wildcards  For Boolean queries perform set operation (Brutus ∧ Caesar) ¬ Calpurnia  Limitations  Slow for very large collections  No relevance ranking  No use of distance vectors or semantic context, e.g., "Brutus" near "Caesar" near could be defined as within 5 words, in same sentence, etc. 4 C ONSIDER S EARCHING THE C OLLECTED W ORKS OF W ILLIAM S HAKESPEARE Grep for the word "peanut" retrieves nothing. Grep using regular expression [a-z] finds peanut followed by any lowercase letter

 Build an index that relates terms to documents  This is known as an incidence matrix 5 S OLUTION ? Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony Brutus Caesar Calpurnia Cleopatra mercy worser

 This search is resolved via a bit-wise AND for all search terms  Two hits:  Antony and Cleopatra  Hamlet 6 U SING AN I NCIDENCE M ATRIX Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony Brutus Caesar Calpurnia Cleopatra mercy worser Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ? Brutus: Caesar: complement of Calpurnia: conjunction:100100

 This technique also doesn't scale well  Solution?  These matrices tend to be sparse (many cells contain 0)  Thus, don't index the absence of words. 7 B EYOND I NCIDENCE M ATRICES Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony Brutus Caesar Calpurnia Cleopatra mercy worser works ~32,000 words

 A dictionary stores the term and has a pointer to the posting list 8 I NVERTED I NDEX TermList of Documents Brutus  Caesar  Calpurnia  Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ? Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth Doc IDs Dictionary of termsPostings, sorted by Doc ID  In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk,

I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) 2. result  postings(first(terms)) 3. terms  rest(terms) 4. while terms ≠ NIL and result ≠ NIL 5. do result  I NTERSECT (result, postings(first(terms))) 6. terms  rest(terms) 7. return result 9 C ONJUNCTIVE S EARCH A LGORITHM TermList of Documents Brutus  Caesar  Calpurnia  Dictionary of termsPostings, sorted by Doc ID

 This algorithm attempts to maximizes efficiency by always calculating the conjunction of the two rarest terms in a collection 10 T RACING THE A LGORITHM SteptermsresultComments 1Calpurnia, Brutus 2 2, 31, 54, 101 3Brutus2, 31, 54, 101 4Brutus2, 31, 54, 101 5Brutus2, 31 Intersection of and 6NIL2, 31 4NIL2, 31 7 Return SteptermsresultComments 1Calpurnia, Brutus, Caesar 2 2, 31, 54, 101 3Brutus, Caesar2, 31, 54, 101 4Brutus, Caesar2, 31, 54, 101 5Brutus, Caesar2, 31 Intersection of and 6Caesar2, 31 4Caesar2, 31 5Caesar2 Intersection of and 6NIL , 31Return Brutus ∧ Caesar ∧ Calpurnia Brutus ∧ Caesar

 To fulfill step 1, the server must access disk to get frequencies  Unless we add the frequencies to the dictionary 11 I MPROVING P ERFORMANCE OF THE A LGORITHM TermList of Documents Brutus  Caesar  Calpurnia  Dictionary of termsPostings, sorted by Doc ID I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) Disk Memory

I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) 2. result  postings(first(terms)) 3. terms  rest(terms) 4. while terms ≠ NIL and result ≠ NIL 5. do result  I NTERSECT (result, postings(first(terms))) 6. terms  rest(terms) 7. return result 12 C ONJUNCTIVE S EARCH A LGORITHM  This query can also follow the Conjunctive Search Algorithm by calculating the frequencies of the disjunctive pairs from memory (not from disk)  freq(Republican ∨ Democrat) = freq(Republican) + freq(Democrat)  These disjunctive frequencies are said to be conservative. Why?  (They assume no overlap for the disjunctions.) (Republican ∨ Democrat) ∧ (primary ∨ caucus) ∧ (Iowa ∨ Delaware)

 Previously, we walked through each DocID in the postings lists  An improved posting data structure can add "skip points" to various elements on the list.  Each skip reference should also be a skip point (unless you are near the end of the list)  DocID 16 in the candle list is both a skip reference (from DocID 2) and a skip point  Look at the algorithm (next slide) and explain why this is a good idea  Design questions  where to place skip pointers  how to do efficient merging using skip pointers. 13 N EXT G ENERATION A LGORITHM : F ASTER P OSTINGS L IST I NTERSECTION VIA S KIP P OINTERS candle  butcher 

I NTERSECT W ITH S KIPS (p 1, p 2 ) 1. answer  2. while p 1 ≠ NIL and p 2 ≠ NIL 3. do if docID(p 1 ) = docID(p 2 ) 4. then A DD (answer, docID(p 1 )) 5. p 1  next(p 1 ) 6. p 2  next(p 2 ) 7. else if docID(p 1 ) < docID(p 2 ) 8. then if hasSkip (p 1 ) and (docID(skip((p 1 ) ≤ docID(p 2 )) 9. then while hasSkip (p 1 ) and (docID(skip((p 1 ) ≤ docID(p 2 )) 10. do p 1  skip(p 1 ) 11. else p 1  next(p 1 ) 12. else if hasSkip (p 2 ) and (docID(skip((p 2 ) ≤ docID(p 1 )) 13. then while hasSkip (p 2 ) and (docID(skip((p 2 ) ≤ docID(p 1 )) 14. do p 2  skip(p 2 ) 15. else p 2  next(p 2 ) 16. return answer 14 C ONJUNCTIVE S EARCH A LGORITHM W ITH S KIPS

 Using this algorithm, we were able to completely skip over documents 19, 23, 60 and 71  Skip pointers only help for Boolean AND queries. Why?  A full trace of this algorithm for candle and butcher is available on the web site 15 N EXT G ENERATION A LGORITHM : F ASTER P OSTINGS L IST I NTERSECTION VIA S KIP P OINTERS candle  butcher 

 Recall: the number of relevant documents retrieved by a search divided by the total number of existing relevant documents 7,000 / 20,000 = 0.35  Precision: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. 7,000 / 10,000 = 0.70  Example: Search for European runners– Europe% AND runner 16 P RECISION AND R ECALL Query result set (10,000) Wildcard (using SQL standard) Relevant documents (20,000) "A French Olympic runner has developed a reputation for the way he marks the end of his races" "Socialist front-runner Martin Schulz launched his European campaign" "The Central Athletics Club runner took part in the 3000m at the European Team Championships." Intersection (7,000)

 Recall: 0.35  Precision: 0.70  Example: Search for European runners– Europe% AND runner 17 P RECISION, R ECALL AND L IMITATIONS OF B OOLEAN S EARCHES Query result set (10,000) Relevant documents (20,000) "A French Olympic runner has developed a reputation for the way he marks the end of his races" "Socialist front-runner Martin Schulz launched his European campaign" "The Central Athletics Club runner took part in the 3000m at the European Team Championships." Intersection (7,000) Boolean logic queries using AND tend to produce high precision but low recall searches

 Recall:  Precision:  Example: Search for P!nk – Pink OR P!nk 18 P RECISION, R ECALL AND L IMITATIONS OF B OOLEAN S EARCHES Query result set (1,000,000) Boolean logic queries using OR tend to produce low precision but high recall searches Intersection (19,990) 10 articles about either: the remake of the song "Get This Party Started" on the album Punk Goes Pop (2002) by Stretch Armstrong Alecia Beth Moore that do NOT mention P!nk or Pink.

19 M AXIMZING I NFORMATION R ETRIEVAL  Traditional queries can be enhanced by adding proximity  Most search engines today use  Implicit proximity  Lemmatization  Synonym substitution  Spelling correction  Contextual correction to improve accuracy

20 T HE S EMANTIC W EB

21 T IM B ERNERS -L EE ' S V ISION OF THE S EMANTIC W EB UnicodeURI XML + XML Schema + Namespaces RDF + RDF Schema Ontology vocabulary Logic Proof Trust Digital Signature Self- desc. doc. Data Rules

 Build on standards  Data is identifiable and uniformly encoded 22 U NICODE / URI UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set

<shiporder orderid="889923" xmlns:xsi=" xsi:noNamespaceSchemaLocation="shiporder.xsd"> John Smith Ola Nordmann Langgt Stavanger Norway Empire Burlesque Special Edition <shiporder orderid="889923" xmlns:xsi=" xsi:noNamespaceSchemaLocation="shiporder.xsd"> John Smith Ola Nordmann Langgt Stavanger Norway Empire Burlesque Special Edition XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc.

 Data is described in a uniform syntax  XML tags/categorizes data  XML Schema enforces syntax rules  Namespaces allow domain specific definitions  XML provides syntax for content structure within documents  XML Vision was to markup documents of arbitrary structure, but…  It really only caught on as a data interchange format and for configuration files  It was once the primary return type for web services, but is being rapidly replaced by JSON  Unwieldy as an authoring standard; unstructured data tended to remain unstructured. 24 XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc.

 Even as early as 2000, XML's importance was being questioned  XML may be relegated to quasi-structural text, such as scientific documents 25 XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc. There is no way to recognize a semantic unit from a particular domain because XML aims at document structure and imposes no common interpretation of the data contained in the document. 1 XML is useful for data interchange between applications that both know what the data is, but not for situations where new communication partners are frequently added. 1 This true “core” of science is not textual, in the sense of normal grammar expressed in a certain language, but is in itself information expressed in another vocabulary. 2 1 Decker, Stefan, et. al. The Semantic Web: The Roles of XML and RDF. IEEE Internet Computing Sept/Oct Marchiori, Massimo. Accessing Scientific Information on the Web. International Journal of Computer and Electronics Research. Volume 3, Issue 6, December 2014.

<rdf:RDF xmlns:rdf=" xmlns:cd=" <rdf:Description rdf:about=" Disturbed USA Reprise <rdf:Description rdf:about=" Jukebox"> Bruno Mars USA Atlantic Records <rdf:RDF xmlns:rdf=" xmlns:cd=" <rdf:Description rdf:about=" Disturbed USA Reprise <rdf:Description rdf:about=" Jukebox"> Bruno Mars USA Atlantic Records  Resource Description Framework  Subject - predicate - object triples 26 RDF L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Note: This is RDF in an XML syntax. There is also a way to write RDF using JSON syntax.

 Includes standard properties such as  rdfs:subClassOf - the subject is a subclass of a class  rdfs:subPropertyOf - the subject is a subproperty of a property  rdfs:domain - a domain of the subject property  rdfs:range - a range of the subject property  rdfs:label - a human-readable name for the subject  rdfs:comment - a description of the subject resource  rdfs:member - a member of the subject resource  rdfs:seeAlso - further information about the subject resource  rdfs:isDefinedBy - the definition of the subject resource 27 RDF L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data From RDF Schema

RDF L AYER RDF + RDF Schema RDF Triples are often represented in a graph structure

Avatar Director: James Cameron (born August 16, 1954) Science fiction Trailer 29 RDF L AYER HTML W HICH IS M ORE E ASILY T RANSFORMED INTO RDF UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Avatar Director: James Cameron (born August 16, 1954) Science fiction Trailer Add machine- readable information to Web pages

Ontology : a formal naming and definition of the types, properties, and interrelationships of the entities that exist for a particular domain of discourse, SPARQL  SPARQL: query language of the Semantic Web. It lets us:  Pull values from structured and semi-structured data  Explore data by querying unknown relationships  Perform complex joins of disparate databases in a single, simple query  Transform RDF data from one vocabulary to another  SPARQL Protocol And RDF Query Language  SPARQL 1.0  2008  SPARQL 1.1  O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data recursive acronym!

SPARQL  foaf ("friend of a friend") is an ontology describing persons, their activities and their relationships to other persons and objects  Example  foaf:friendOf rdfs:subPropertyOf foraf:knows 31 O NTOLOGY L AYER RDF + RDF Schema Data Ontology vocabulary Data PREFIX foaf: SELECT ?name WHERE { foaf:knows [ foaf:name ?name ]. SERVICE { foaf:name ?name } } RDF triple This SPARQL query uses the foaf ontology to query RDF data to find all friends of Bob. Another popular ontology is OWL.

 Protégé  Open source ontology development environment developed by Stanford  Demonstration: Fleetwood Mac (partial ontology) Building and Utilizing Ontologies for Knowledge Representation. Myers, Jack  Protégé will infer certain facts from other facts. 32 O NTOLOGY D EMO SPARQL RDF + RDF Schema Data Ontology vocabulary Data Excerpt from Fleetwood Mac ontology (the album) "Bare Trees" is of the Folk Rock genre

 First-order logic rules can enable deductions of new facts.  Deduce Charlie Sheen's father 33 O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Father(Martin_Sheen, Charlie_Sheen) ∀ x,y(Parent(x,y) ∧ Male(x) ⇒ Father(x,y)) Parent(Martin_Sheen, Charlie_Sheen) Male(Martin_Sheen)

 But dealing with incorrect or incomplete information is problematic  Inconsistencies can occur when we get to know more about a domain  Birds fly  Penguins are bird  Joe is a penguin  New fact: Penguins don't fly 34 O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces RDF + RDF Schema Ontology vocabulary Logic Rules fly(Joe) ∀ x: bird(x)  fly(x) ∀ x: penguin(x)  bird(x) penguin(Joe) See: On the Expressiveness of the Languages for the Semantic Web - Making a Case for ‘A Little More’

 The Logic layer enables the writing of rules  The Proof layer executes the rules and evaluates together with  The Trust layer mechanism for applications whether to trust the given proof or not.  Digital Signatures authenticate document accuracy 35 F INAL L AYERS UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Proof Trust Digital Signature

 Proof layer determines whether an answer found in the Semantic Web is correct  It is based on:  How has it been derived – i.e., the logic  On which data – i.e., data sources  By whom -- i.e., chain of providers of data needs to be considered, too! (Trust) 36 F INAL L AYERS UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Proof Trust Digital Signature Network Security at play!

37 S EMANTIC S EARCH

 Traditional information retrieval is keyword-based.  No interpretation of the "meaning" of the information.  Problems of this basic approach:  Polysemy (jaguar the cat vs. jaguar the car)  Synonyms (movie vs. film)  Missing information about subclass or part-of relation (watersport vs. diving, surfing,... )  Relations between search terms (“books about recommender systems” vs. “systems that recommend books”)  This is where Semantic Web technologies can help!  Search engine functionalities:  Query construction,  Query processing  Result presentation  Semantic technologies:  Knowledge extraction,  Knowledge representation  Reasoning. 38 I MPROVING I NFORMATION S EARCH WITH S EMANTICS Semantic Search Semantic Search is a process of information access, where one or several activities can be supported by a set of functionalities enabled by semantic technologies.

 As we have seen, entries in the dictionary are keywords  We can add an ontology and map keywords to elements in the ontology.  The ontology is used to disambiguate the query, e.g., to select the right word sense, and to expand the query. 39 S EMANTIC S EARCH Disambiguation Expand the query

 The query can be expanded with words/concepts from a thesaurus e.g., Princeton University's WordNet and Medical Subject Headings (MeSH):  Expand query with words/concepts from a domain ontology, e.g., MeSH:  Between organ & physiological process: Bone and Bones see related Osteogenesis  Between organ & drug acting on it: Bronchi see related Bronchoconstrictor Agents  Between organ & procedure: Bile Ducts see related Cholangiography  Linguistic roots: Brain consider also terms at CEREBR- and ENCEPHAL- 40 P OSSIBLE W AYS FOR Q UERY E XPANSION (1) MeSH: WordNet:

41 W ORD N ET E XAMPLE

 More specific concepts and/or more general concepts.  Related concepts  Synonyms and Antonyms  Meronym -- The name of a constituent part of, the substance of, or a member of something. X is a meronym of Y if X is a part of Y.  "nose" is a meronym of "face"  Entailment – A verb X entails Y if X cannot be done unless Y is, or has been, done.  "massage" entails "touch"  Holonym -- The name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.  "tree" is a holonym of "bark"  Hypernym -- The generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y.  "bird" is a hyponym of "seagull"  Hyponym -- The specific term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y.  "seagull" is a hyponym of "bird"  Troponym -- A verb expressing a specific manner elaboration of another verb. X is a troponym of Y if to X is to Y in some manner.  "stroll" is a troponym of "walk" 42 P OSSIBLE W AYS FOR Q UERY E XPANSION (2)

43 A D OCUMENT AS C ONCEPTS AND R ELATIONS angel food cake mix cherry pie filling almond extract Sliced almonds contains Kristina_Vanni has author instance of dessert recipe instance of A document doesn’t contain keywords, it discusses concepts that are in a relation with the document. Concepts serve as description of the document for some known properties, e.g., type of document, author…

 Use one index for every relation (field) we are interested in  Possible relations are specified in an ontology  Relations may depend on the type of document (most applications only support a specific class from a small domain, e.g., scientific documents, recipes) 44 K EYWORD I NDICES RelationTermList of Documents "contains"  almond   basil  "has author"  basil   vanni  Dictionary of relations and terms Postings, sorted by Doc ID

 The user decides what type of results (class in the ontology) for which he is looking  Properties of this class in the ontology can be used to narrow down results (“faceted search”)  Faceted searches allow users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways  The possible values of the facet can be literals or other concepts, they can be restricted by the user  The ontology hierarchy relations can be used for inference (search for companies includes all subclasses/instances)  As a result, documents from the class that match the property restrictions are returned 45 S EARCHING FOR C ONCEPTS AND R ELATIONS

E XAMPLE : Y UMMLY 46 Search for cherry cake Refine Use facet for prep Use facets for taste Use facet for diet

G OOGLE K NOWLEDGE G RAPH 47 The Google Knowledge Graph is a system that Google launched in May 2012 that understands facts about people, places and things and how these entities are all connected.

B ROCCOLI (1) 48

B ROCCOLI (2) 49 After selecting an Instance, Relations are displayed