Presentation is loading. Please wait.

Presentation is loading. Please wait.

F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler,

Similar presentations


Presentation on theme: "F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler,"— Presentation transcript:

1 F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler, Institut für Maschinelle Sprachverarbeitung Universität Stuttgart, 2014 AND informationretreival.org from Stanford University

2 2 I NFORMATION R ETRIEVAL

3  You should be very familiar with Information Retrieval when performing Internet searches with Google, Bing, Yahoo, DuckDuckGo, etc.  Example: Shakespeare search http://www.rhymezone.com/shakespeare/ http://www.rhymezone.com/shakespeare/ 3 I NFORMATION R ETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html

4  Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ?  The GREP technique  Based on UNIX command which performs a linear scan of a collection of documents.  Efficient, effective, and allows use of wildcards  For Boolean queries perform set operation (Brutus ∧ Caesar) ¬ Calpurnia  Limitations  Slow for very large collections  No relevance ranking  No use of distance vectors or semantic context, e.g., "Brutus" near "Caesar" near could be defined as within 5 words, in same sentence, etc. 4 C ONSIDER S EARCHING THE C OLLECTED W ORKS OF W ILLIAM S HAKESPEARE Grep for the word "peanut" retrieves nothing. Grep using regular expression [a-z] finds peanut followed by any lowercase letter

5  Build an index that relates terms to documents  This is known as an incidence matrix 5 S OLUTION ? Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony 110001 Brutus 110100 Caesar 110111 Calpurnia 010000 Cleopatra 100000 mercy 101111 worser 101110...

6  This search is resolved via a bit-wise AND for all search terms  Two hits:  Antony and Cleopatra  Hamlet 6 U SING AN I NCIDENCE M ATRIX Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony 110001 Brutus 110100 Caesar 110111 Calpurnia 010000 Cleopatra 100000 mercy 101111 worser 101110... Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ? Brutus: 110100 Caesar: 110111 complement of Calpurnia: 101111 conjunction:100100

7  This technique also doesn't scale well  Solution?  These matrices tend to be sparse (many cells contain 0)  Thus, don't index the absence of words. 7 B EYOND I NCIDENCE M ATRICES Term Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth... Antony 110001 Brutus 110100 Caesar 110111 Calpurnia 010000 Cleopatra 100000 mercy 101111 worser 101110... 197 works ~32,000 words

8  A dictionary stores the term and has a pointer to the posting list 8 I NVERTED I NDEX TermList of Documents Brutus  124113145173174 Caesar  124561657132... Calpurnia  23154101 Which plays of Shakespeare contain the words "Brutus" and "Caesar", but not "Calpurnia" ? 123456 Antony and Cleopatra Julius Caesar The TempestHamletOthelloMacbeth Doc IDs Dictionary of termsPostings, sorted by Doc ID  In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk,

9 I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) 2. result  postings(first(terms)) 3. terms  rest(terms) 4. while terms ≠ NIL and result ≠ NIL 5. do result  I NTERSECT (result, postings(first(terms))) 6. terms  rest(terms) 7. return result 9 C ONJUNCTIVE S EARCH A LGORITHM TermList of Documents Brutus  124113145173174 Caesar  124561657132... Calpurnia  23154101 Dictionary of termsPostings, sorted by Doc ID

10  This algorithm attempts to maximizes efficiency by always calculating the conjunction of the two rarest terms in a collection 10 T RACING THE A LGORITHM SteptermsresultComments 1Calpurnia, Brutus 2 2, 31, 54, 101 3Brutus2, 31, 54, 101 4Brutus2, 31, 54, 101 5Brutus2, 31 Intersection of and 6NIL2, 31 4NIL2, 31 7 Return SteptermsresultComments 1Calpurnia, Brutus, Caesar 2 2, 31, 54, 101 3Brutus, Caesar2, 31, 54, 101 4Brutus, Caesar2, 31, 54, 101 5Brutus, Caesar2, 31 Intersection of and 6Caesar2, 31 4Caesar2, 31 5Caesar2 Intersection of and 6NIL2 4 2 7 2, 31Return Brutus ∧ Caesar ∧ Calpurnia Brutus ∧ Caesar

11  To fulfill step 1, the server must access disk to get frequencies  Unless we add the frequencies to the dictionary 11 I MPROVING P ERFORMANCE OF THE A LGORITHM TermList of Documents Brutus  124113145173174 Caesar  124561657132... Calpurnia  23154101 Dictionary of termsPostings, sorted by Doc ID I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) 8 10 4 Disk Memory

12 I NTERSECT ( ) 1. terms  S ORT B Y I NCREASING F REQUENCY ( ) 2. result  postings(first(terms)) 3. terms  rest(terms) 4. while terms ≠ NIL and result ≠ NIL 5. do result  I NTERSECT (result, postings(first(terms))) 6. terms  rest(terms) 7. return result 12 C ONJUNCTIVE S EARCH A LGORITHM  This query can also follow the Conjunctive Search Algorithm by calculating the frequencies of the disjunctive pairs from memory (not from disk)  freq(Republican ∨ Democrat) = freq(Republican) + freq(Democrat)  These disjunctive frequencies are said to be conservative. Why?  (They assume no overlap for the disjunctions.) (Republican ∨ Democrat) ∧ (primary ∨ caucus) ∧ (Iowa ∨ Delaware)

13  Previously, we walked through each DocID in the postings lists  An improved posting data structure can add "skip points" to various elements on the list.  Each skip reference should also be a skip point (unless you are near the end of the list)  DocID 16 in the candle list is both a skip reference (from DocID 2) and a skip point  Look at the algorithm (next slide) and explain why this is a good idea  Design questions  where to place skip pointers  how to do efficient merging using skip pointers. 13 N EXT G ENERATION A LGORITHM : F ASTER P OSTINGS L IST I NTERSECTION VIA S KIP P OINTERS candle  2481619232843111113 butcher  123584151607198140 1628113 55198

14 I NTERSECT W ITH S KIPS (p 1, p 2 ) 1. answer  2. while p 1 ≠ NIL and p 2 ≠ NIL 3. do if docID(p 1 ) = docID(p 2 ) 4. then A DD (answer, docID(p 1 )) 5. p 1  next(p 1 ) 6. p 2  next(p 2 ) 7. else if docID(p 1 ) < docID(p 2 ) 8. then if hasSkip (p 1 ) and (docID(skip((p 1 ) ≤ docID(p 2 )) 9. then while hasSkip (p 1 ) and (docID(skip((p 1 ) ≤ docID(p 2 )) 10. do p 1  skip(p 1 ) 11. else p 1  next(p 1 ) 12. else if hasSkip (p 2 ) and (docID(skip((p 2 ) ≤ docID(p 1 )) 13. then while hasSkip (p 2 ) and (docID(skip((p 2 ) ≤ docID(p 1 )) 14. do p 2  skip(p 2 ) 15. else p 2  next(p 2 ) 16. return answer 14 C ONJUNCTIVE S EARCH A LGORITHM W ITH S KIPS

15  Using this algorithm, we were able to completely skip over documents 19, 23, 60 and 71  Skip pointers only help for Boolean AND queries. Why?  A full trace of this algorithm for candle and butcher is available on the web site 15 N EXT G ENERATION A LGORITHM : F ASTER P OSTINGS L IST I NTERSECTION VIA S KIP P OINTERS candle  2481619232843111113 butcher  123584151607198140 1628113 55198

16  Recall: the number of relevant documents retrieved by a search divided by the total number of existing relevant documents 7,000 / 20,000 = 0.35  Precision: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. 7,000 / 10,000 = 0.70  Example: Search for European runners– Europe% AND runner 16 P RECISION AND R ECALL Query result set (10,000) Wildcard (using SQL standard) Relevant documents (20,000) "A French Olympic runner has developed a reputation for the way he marks the end of his races" "Socialist front-runner Martin Schulz launched his European campaign" "The Central Athletics Club runner took part in the 3000m at the European Team Championships." Intersection (7,000)

17  Recall: 0.35  Precision: 0.70  Example: Search for European runners– Europe% AND runner 17 P RECISION, R ECALL AND L IMITATIONS OF B OOLEAN S EARCHES Query result set (10,000) Relevant documents (20,000) "A French Olympic runner has developed a reputation for the way he marks the end of his races" "Socialist front-runner Martin Schulz launched his European campaign" "The Central Athletics Club runner took part in the 3000m at the European Team Championships." Intersection (7,000) Boolean logic queries using AND tend to produce high precision but low recall searches

18  Recall: 0.9995  Precision: 0.0200  Example: Search for P!nk – Pink OR P!nk 18 P RECISION, R ECALL AND L IMITATIONS OF B OOLEAN S EARCHES Query result set (1,000,000) Boolean logic queries using OR tend to produce low precision but high recall searches Intersection (19,990) 10 articles about either: the remake of the song "Get This Party Started" on the album Punk Goes Pop (2002) by Stretch Armstrong Alecia Beth Moore that do NOT mention P!nk or Pink.

19 19 M AXIMZING I NFORMATION R ETRIEVAL  Traditional queries can be enhanced by adding proximity  Most search engines today use  Implicit proximity  Lemmatization  Synonym substitution  Spelling correction  Contextual correction to improve accuracy

20 20 T HE S EMANTIC W EB

21 21 T IM B ERNERS -L EE ' S V ISION OF THE S EMANTIC W EB UnicodeURI XML + XML Schema + Namespaces RDF + RDF Schema Ontology vocabulary Logic Proof Trust Digital Signature Self- desc. doc. Data Rules

22  Build on standards  Data is identifiable and uniformly encoded 22 U NICODE / URI UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set

23 <shiporder orderid="889923" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="shiporder.xsd"> John Smith Ola Nordmann Langgt 23 4000 Stavanger Norway Empire Burlesque Special Edition 1 10.90... <shiporder orderid="889923" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="shiporder.xsd"> John Smith Ola Nordmann Langgt 23 4000 Stavanger Norway Empire Burlesque Special Edition 1 10.90... 23 XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc.

24  Data is described in a uniform syntax  XML tags/categorizes data  XML Schema enforces syntax rules  Namespaces allow domain specific definitions  XML provides syntax for content structure within documents  XML Vision was to markup documents of arbitrary structure, but…  It really only caught on as a data interchange format and for configuration files  It was once the primary return type for web services, but is being rapidly replaced by JSON  Unwieldy as an authoring standard; unstructured data tended to remain unstructured. 24 XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc.

25  Even as early as 2000, XML's importance was being questioned  XML may be relegated to quasi-structural text, such as scientific documents 25 XML L AYER UnicodeURI Standard character set XML + XML Schema + Namespaces Self- desc. doc. There is no way to recognize a semantic unit from a particular domain because XML aims at document structure and imposes no common interpretation of the data contained in the document. 1 XML is useful for data interchange between applications that both know what the data is, but not for situations where new communication partners are frequently added. 1 This true “core” of science is not textual, in the sense of normal grammar expressed in a certain language, but is in itself information expressed in another vocabulary. 2 1 Decker, Stefan, et. al. The Semantic Web: The Roles of XML and RDF. IEEE Internet Computing Sept/Oct 2000 2 Marchiori, Massimo. Accessing Scientific Information on the Web. International Journal of Computer and Electronics Research. Volume 3, Issue 6, December 2014.

26 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cd="http://www.recshop.fake/cd#"> <rdf:Description rdf:about="http://www.recshop.fake/cd/Importalized"> Disturbed USA Reprise 13.99 2015 <rdf:Description rdf:about="http://www.recshop.fake/cd/Unauthorized Jukebox"> Bruno Mars USA Atlantic Records 13.99 2012 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cd="http://www.recshop.fake/cd#"> <rdf:Description rdf:about="http://www.recshop.fake/cd/Importalized"> Disturbed USA Reprise 13.99 2015 <rdf:Description rdf:about="http://www.recshop.fake/cd/Unauthorized Jukebox"> Bruno Mars USA Atlantic Records 13.99 2012  Resource Description Framework  Subject - predicate - object triples 26 RDF L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Note: This is RDF in an XML syntax. There is also a way to write RDF using JSON syntax.

27  Includes standard properties such as  rdfs:subClassOf - the subject is a subclass of a class  rdfs:subPropertyOf - the subject is a subproperty of a property  rdfs:domain - a domain of the subject property  rdfs:range - a range of the subject property  rdfs:label - a human-readable name for the subject  rdfs:comment - a description of the subject resource  rdfs:member - a member of the subject resource  rdfs:seeAlso - further information about the subject resource  rdfs:isDefinedBy - the definition of the subject resource 27 RDF L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data From RDF Schema

28 ...... 28 RDF L AYER RDF + RDF Schema RDF Triples are often represented in a graph structure

29 Avatar Director: James Cameron (born August 16, 1954) Science fiction Trailer 29 RDF L AYER HTML W HICH IS M ORE E ASILY T RANSFORMED INTO RDF UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Avatar Director: James Cameron (born August 16, 1954) Science fiction Trailer Add machine- readable information to Web pages

30 Ontology : a formal naming and definition of the types, properties, and interrelationships of the entities that exist for a particular domain of discourse, SPARQL  SPARQL: query language of the Semantic Web. It lets us:  Pull values from structured and semi-structured data  Explore data by querying unknown relationships  Perform complex joins of disparate databases in a single, simple query  Transform RDF data from one vocabulary to another  SPARQL Protocol And RDF Query Language  SPARQL 1.0  2008  SPARQL 1.1  2013 30 O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data recursive acronym!

31 SPARQL  foaf ("friend of a friend") is an ontology describing persons, their activities and their relationships to other persons and objects  Example  foaf:friendOf rdfs:subPropertyOf foraf:knows 31 O NTOLOGY L AYER RDF + RDF Schema Data Ontology vocabulary Data PREFIX foaf: SELECT ?name WHERE { foaf:knows [ foaf:name ?name ]. SERVICE { foaf:name ?name } } RDF triple This SPARQL query uses the foaf ontology to query RDF data to find all friends of Bob. Another popular ontology is OWL.

32  Protégé  Open source ontology development environment developed by Stanford  Demonstration: Fleetwood Mac (partial ontology) Building and Utilizing Ontologies for Knowledge Representation. Myers, Jack  Protégé will infer certain facts from other facts. 32 O NTOLOGY D EMO SPARQL RDF + RDF Schema Data Ontology vocabulary Data Excerpt from Fleetwood Mac ontology (the album) "Bare Trees" is of the Folk Rock genre

33  First-order logic rules can enable deductions of new facts.  Deduce Charlie Sheen's father 33 O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Father(Martin_Sheen, Charlie_Sheen) ∀ x,y(Parent(x,y) ∧ Male(x) ⇒ Father(x,y)) Parent(Martin_Sheen, Charlie_Sheen) Male(Martin_Sheen)

34  But dealing with incorrect or incomplete information is problematic  Inconsistencies can occur when we get to know more about a domain  Birds fly  Penguins are bird  Joe is a penguin  New fact: Penguins don't fly 34 O NTOLOGY L AYER UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces RDF + RDF Schema Ontology vocabulary Logic Rules fly(Joe) ∀ x: bird(x)  fly(x) ∀ x: penguin(x)  bird(x) penguin(Joe) See: On the Expressiveness of the Languages for the Semantic Web - Making a Case for ‘A Little More’

35  The Logic layer enables the writing of rules  The Proof layer executes the rules and evaluates together with  The Trust layer mechanism for applications whether to trust the given proof or not.  Digital Signatures authenticate document accuracy 35 F INAL L AYERS UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Proof Trust Digital Signature

36  Proof layer determines whether an answer found in the Semantic Web is correct  It is based on:  How has it been derived – i.e., the logic  On which data – i.e., data sources  By whom -- i.e., chain of providers of data needs to be considered, too! (Trust) 36 F INAL L AYERS UnicodeURI Uniform resource identifier: - To identify a web resource (e.g., URL) Standard character set XML + XML Schema + Namespaces Self- desc. doc. RDF + RDF Schema Data Ontology vocabulary Data Logic Rules Proof Trust Digital Signature Network Security at play!

37 37 S EMANTIC S EARCH

38  Traditional information retrieval is keyword-based.  No interpretation of the "meaning" of the information.  Problems of this basic approach:  Polysemy (jaguar the cat vs. jaguar the car)  Synonyms (movie vs. film)  Missing information about subclass or part-of relation (watersport vs. diving, surfing,... )  Relations between search terms (“books about recommender systems” vs. “systems that recommend books”)  This is where Semantic Web technologies can help!  Search engine functionalities:  Query construction,  Query processing  Result presentation  Semantic technologies:  Knowledge extraction,  Knowledge representation  Reasoning. 38 I MPROVING I NFORMATION S EARCH WITH S EMANTICS Semantic Search Semantic Search is a process of information access, where one or several activities can be supported by a set of functionalities enabled by semantic technologies.

39  As we have seen, entries in the dictionary are keywords  We can add an ontology and map keywords to elements in the ontology.  The ontology is used to disambiguate the query, e.g., to select the right word sense, and to expand the query. 39 S EMANTIC S EARCH Disambiguation Expand the query

40  The query can be expanded with words/concepts from a thesaurus e.g., Princeton University's WordNet and Medical Subject Headings (MeSH):  Expand query with words/concepts from a domain ontology, e.g., MeSH:  Between organ & physiological process: Bone and Bones see related Osteogenesis  Between organ & drug acting on it: Bronchi see related Bronchoconstrictor Agents  Between organ & procedure: Bile Ducts see related Cholangiography  Linguistic roots: Brain consider also terms at CEREBR- and ENCEPHAL- 40 P OSSIBLE W AYS FOR Q UERY E XPANSION (1) MeSH: https://www.nlm.nih.gov/mesh/introduction.htmlhttps://www.nlm.nih.gov/mesh/introduction.html WordNet: https://wordnet.princeton.edu/https://wordnet.princeton.edu/

41 41 W ORD N ET E XAMPLE

42  More specific concepts and/or more general concepts.  Related concepts  Synonyms and Antonyms  Meronym -- The name of a constituent part of, the substance of, or a member of something. X is a meronym of Y if X is a part of Y.  "nose" is a meronym of "face"  Entailment – A verb X entails Y if X cannot be done unless Y is, or has been, done.  "massage" entails "touch"  Holonym -- The name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.  "tree" is a holonym of "bark"  Hypernym -- The generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y.  "bird" is a hyponym of "seagull"  Hyponym -- The specific term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y.  "seagull" is a hyponym of "bird"  Troponym -- A verb expressing a specific manner elaboration of another verb. X is a troponym of Y if to X is to Y in some manner.  "stroll" is a troponym of "walk" 42 P OSSIBLE W AYS FOR Q UERY E XPANSION (2)

43 43 A D OCUMENT AS C ONCEPTS AND R ELATIONS angel food cake mix cherry pie filling almond extract Sliced almonds contains Kristina_Vanni has author instance of dessert recipe instance of A document doesn’t contain keywords, it discusses concepts that are in a relation with the document. Concepts serve as description of the document for some known properties, e.g., type of document, author…

44  Use one index for every relation (field) we are interested in  Possible relations are specified in an ontology  Relations may depend on the type of document (most applications only support a specific class from a small domain, e.g., scientific documents, recipes) 44 K EYWORD I NDICES RelationTermList of Documents "contains"  almond  124113145173174  basil  34721242957132... "has author"  basil  56182  vanni  1822104 Dictionary of relations and terms Postings, sorted by Doc ID

45  The user decides what type of results (class in the ontology) for which he is looking  Properties of this class in the ontology can be used to narrow down results (“faceted search”)  Faceted searches allow users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways  The possible values of the facet can be literals or other concepts, they can be restricted by the user  The ontology hierarchy relations can be used for inference (search for companies includes all subclasses/instances)  As a result, documents from the class that match the property restrictions are returned 45 S EARCHING FOR C ONCEPTS AND R ELATIONS

46 E XAMPLE : Y UMMLY 46 Search for cherry cake Refine Use facet for prep Use facets for taste Use facet for diet http://www.yummly.com/

47 G OOGLE K NOWLEDGE G RAPH 47 The Google Knowledge Graph is a system that Google launched in May 2012 that understands facts about people, places and things and how these entities are all connected.

48 B ROCCOLI (1) 48 http://broccoli.informatik.uni-freiburg.de/demos/BroccoliFreebase/

49 B ROCCOLI (2) 49 After selecting an Instance, Relations are displayed


Download ppt "F ROM I NFORMATION S EARCH TO S EMANTIC S EARCH 1 Content derived from many sources, however most notably: Semantic Search presentation by Wiltrud Kessler,"

Similar presentations


Ads by Google