Presentation on theme: "Query Models Use Types What do search engines do."— Presentation transcript:
Query Models Use Types What do search engines do
What we have covered What is IR Evaluation Tokenization and properties of text Vector models of documents Web crawling This time –Query models
Interface Query Engine Indexer Index Crawler Users Web A Typical Web Search Engine
Query Engine Interface Users Web Online vs offline processing Off-line Indexer Index Crawler
Interface Query Engine Indexer Index Crawler Users Web A Typical Web Search Engine Queries
Why the interest in Queries? Queries are ways we interact with IR systems –Expression of an information need Nonquery methods? Types of queries?
Issues with Query Structures Matching and ranking criteria Given a query, what documents are retrieved? In what order (rank)?
Types of Query Structures Query Models (languages) – most common Boolean Queries Extended-Boolean Queries –Vector space Boolean Vector queries Natural Language Queries Others?
Simple query language: BooleanBoolean –Earliest query model –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases thesaurus terms –connectors AND OR NOT –Ex: Beethoven AND sonata
Simple query language: Boolean –Geek-speak –Variations are still used in search engines! –Ex: X AND Y, Y AND X
Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0
Problems with Boolean Queries How do you express your need in a Boolean Query???? (geekspeak) No good way to weight terms for significance –Want music by Beethoven, preferably a sonata –Query?Query Ranking? –Binary (either there or not)
Problems with Boolean Queries Ranking? Incorrect interpretation of Boolean connectives AND and OR Example - Seeking Saturday entertainment Queries: Dinner AND sports AND symphony Dinner OR sports OR symphony Dinner AND sports OR symphony
Order of precedence of operators Example of query. Is A AND B the same as B AND A Why?
Sample Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)
Satisfaction of Boolean Query (Cat OR Dog) AND (Collar OR Leash) –Each of the following column combinations works: Catxxxx Dogxxxxx Collarxxxx Leashxxxx Others?
Satisfaction of Boolean Query (Cat OR Dog) AND (Collar OR Leash) –None of the following column combinations work: Catxx Dogxx Collarxx Leashxx
Order of Preference –Define order of preference EX: a OR b AND c –Infix notation Parenthesis evaluated 1 st with left to right precedence of operators Next NOT’s are applied Then AND’s Then OR’s –a OR b AND c becomes –a OR (b AND c)
Infix Notation –Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators (a AND b AND c AND d) –Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a
DNFs and CNFs All queries can be rewritten as –Disjunctive Normal Forms (DNFs) –Conjunctive Normal Forms (CNFs) DNF Constituents: –Terms (words or phrases) –Conjuncts (terms joined by ANDs ) –Disjuncts (conjuncts joined by ORs ) –Ex: (A AND B) OR (A AND NOT C) CNF Constituents: –Terms (words or phrases) –Disjuncts (terms joined by ORs ) –Conjuncts (disjuncts joined by ANDs ) –Ex: (A OR B) AND (A OR NOT C)
Effect of CNFs All complex Boolean queries can be simplified Why do reference librarians like CNFs? AND’s reduce the size of the set returned and are easily expandable –So do minus’s
Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D 10 11 D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3
Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)
Pseudo-Boolean Queries A new notation, from web search –+cat dog +collar leash –+ means this term must appear in the document Does not mean the same thing! Need a way to group combinations. Phrases: –“stray cat” AND “frayed collar” –+“stray cat” + “frayed collar”
Information need Index Pre-process Parse Collections Rank Query text input
Result Sets Run a query, get a result set Two choices –Reformulate query, run on entire collection –Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents
Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank
Ordering (ranking) of Retrieved Documents Pure Boolean has no ordering Term is there or it’s not In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to have one of each term or many of one term?
Boolean Query - Summary Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial systems until the WWW
Vector Space Model Queries treated as small documents Documents and queries are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents
Document Vectors Documents are represented as “bags of words” –Words are terms with no order Represented as vectors when used computationally –A vector is like an array of floating point values –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse
Queries Vocabulary (dog, house, white) Queries: dog(1,0,0) house(0,1,0) white(0,0,1) house and dog(1,1,0) dog and house(1,1,0) Show 3-D space plot
Documents (queries) in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6
Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.
Vector Query Problems Significance of queries –Can different values be placed on the different terms – eg. 2dog 1house Scaling – size of vectors Number of words in the dictionary? 100,000
Proximity Searches Proximity: terms occur within K positions of one another –pen w/5 paper A “Near” function can be more vague –near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations –“United Nations” “Bill Clinton” Phrase Variants –“retrieval of information” “information retrieval” Proximity - wikipedia
Filters/field limiters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata –restrict by: date range internet domain (.edu.com.berkeley.edu) author size limit number of documents returned
Natural Language Queries The “Holy Grail” of information retrieval Issues in Natural Language Processing –syntax –semantics –pragmatics –speech understanding –speech generation
What do search engines do? Tags –Title –Meta Term frequency and location Popularity Others
What do search engines do? Collection of various methods, sometimes called pseudo-Boolean –quotes, minus, plus –pseudo AND truth in vs in truth –stop words?
What does Google do? Basic search Search operators
Search query string The portion of a dynamic URL that contains the search parameters when a dynamic Web site is searched. Query strings do not exist until a user plugs the variables into a database search, at which point the search engine will create the dynamic URL with the query string based on the results. Query strings typically contain ? and % characters.
Lucene Basics Searches are supported through a wide range of Query options –Keyword –Terms –Phrases –Wildcards –Many, many more
QueryParser syntax examples Query expressionDocument matches if… javaContains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit java AND junit Contains both java and junit in the default field title:antContains the term ant in the title field title:extreme –subject:sportsContains extreme in the title and not sports in subject (agile OR extreme) AND javaBoolean expression matches title:”junit in action”Phrase matches in title title:”junit action”~5Proximity matches (within 5) in title java*Wildcard matches java~Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches
Types of Query Structures Query Models (languages) – most common Boolean Queries –Old model Vector queries –Very common - in all search engines to some extent Web queries –Search engines Probabilistic models –Mostly research (Indri)Indri Holy grail of search –Natural Language Queries