Presentation on theme: "Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:
1 Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query LanguagesAlexander Gelbukh
2 Previous Chapter Main measures: Precision & Recall. For setsRankings are evaluated through initial subsetsThere are measures that combine them into oneInvolve user-defined preferences. In F-measure set to 50-50Many (other) characteristicsAn algorithm can be good at some and bad at othersAverages are used, but not always are meaningfulReference collection exists with known answers to evaluate new algorithms
3 Previous chapter: research issues Different types of interfaces; interactive systems:What measures to use?How people judge relevance?How the “user satisfaction” can be measured? Modeled?
4 Query languages Query language = type of possible queries Type of queries depend on the IR modelTypes:IR (= ranked output)Data retrievalUser-orientedLow-level (= protocols)Assume all pre-processing has been doneThesaurus, stop-words, ...(I think this must be a part of the language!)Returns “documents” (chapter, paragraph, ...)
5 In this chapter Keyword-based languages Pattern matching Structure taken into accountProtocols
6 Keyword-based languages: Single word Intuitive, easy to express, fast ranking.Words can be highlighted in the output.What a word is?Letters, separatorsNon-splitting characters: on-line.Database decides.TF-IDF are designed for wordsUsed for the main models (Boolean, Vector, Probabilistic)
7 Keyword-based languages: Context Queries Ensure that the words are relatedPhrase“enhance retrieval”Allows separators and stopwords: “enhance the retrieval”Proximity“enhance the quality of information retrieval”Distance: words, letters. Order: same or notNot clear how to rankResearch issue
8 Keyword-based languages: Boolean Queries Boolean expressions (can combine basic queries)Query syntax treetranslation AND (syntax OR syntactic) operations on the setsResult: setOR, AND, e1 BUT e2NOT not used, could give (almost) all docs (= unsafe)Good: Can highlight occurrences, sortBad: Difficult for the usersRemedy (?): fuzzy Boolean (see below).Basic = keyword, pattern
9 Keyword-based languages: Fuzzy Boolean, Natural Language Fuzzy Boolean: OR AND = some.AND punishes for absence, OR encourages multiple.Natural ranking: how many times?Natural Language: OR = ANDBUT can be expressed (= penalty)How to rank? Different waysVector space modelQuery is a vectorA doc can be taken as a vector. Relevance feedback!Proximity is ignored(Why? Research issue.)
10 Pattern matching... Pattern = sequence of features Types: Words Text segment matches the patternTypes:WordsPrefixes, suffixes, substrings:comput-, -ters, -any flow- (many flowers).Rangesimplies some order, e.g., lexicographical = alphabeticAllowing errorsLevenshtein (= edit) distance: historical / hysterical# insertions, deletions, replacements. Threshold.
11 ...Pattern matching ...Types Regular expressions Extended patterns union = or: if e1, e2 are expressions, (e1 | e2) tooconcatenation: e1 e2repetition: e* (0 or more occurrences)Extended patternsuser-friendly; can be internally converted into simplecase-insensitive, “anything” (wildcard), digit, vowel, ...conditionals, optionalsome parts match exactly and other with errors,etc.
12 Structural queriesOld days: fields. No nesting, no overlap, fixed order.subject, body, sender, ...= Relational database with text type, treated as text should beVersions of SQL with text operatorsHypertextNot well developed. Too freeWebGlimpse: search the neighborhoodHierarchicalIntermediate level of freedomVolumes, chapters, sections, paragraphs, sentences, ...
14 Hierarchical Models ... PAT expressions Overlapped lists Hierarchy is defined at query time.Regions are included in the index, e.g., sections, italics, ...Different types of regions can overlap, same type can’tCan query for words in a region, regions in a region, etc.Complex computation, unclear semanticsOverlapped listsEvolution of PAT: areas of same type can overlap (not nest)Uses same inverted fileCan combine regions, specify order, ...n-words: all (overlapping) areas of n words.
16 ... Hierarchical Models ... List of references Proximal nodes Answers are references (pointers) to regionsOnly one type of regions (e.g., only sections). No nesting.Known at index timeAncestry of nodes. Can query pathsProximal nodesCompromise between expressiveness and efficiencyMany (overlapping) fixed hierarchiesInteresting queries: “3rd paragraph of each chapter”, ...
18 ... Hierarchical Models Tree matching Query is a tree. Match the text tree.Ordered or unordered trees (are siblings ordered?)Prolog-like constraints on different parts of the treeVariablesAnswer: root of a matchVery inefficient (usually NP-hard)Due to variables and unordered matching
19 Research issues in hierarchical models Static or dynamic?Define the hierarchy at index time or at query time?Static: text markup. Dynamic: tags, indexed.Restrictions on the structureRestrict structure of restrict the query languageFor efficiencyIntegration with textof secondary importance: structure (in IR) or text (in DB)?combineQuery languageStandardization, expressiveness taxonomy, categorization
20 Query protocols Used internally Standard: one client can query different librariesIn CD-ROMS, disk interchangeabilityZ39.50: bibliographic (used for other types, too)WAIS (Wide Area Information Service)Includes Z39.50For CD-ROMs:CCL, Common Command LanguageCD-RDx (Compact Disk Read only Data Exchange)SFQL (Structured Full-text Query Language). Like DB.
22 Trends and research topics Models: to better understand the user needsQuery languages: flexibility, power, expressiveness, functionalityVisual languagesExample: library shown on the screen. Act: take books, open catalogs, etc.Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
23 Conclusions Width-wide: Depth-wide: words, phrases, proximity, fuzzy Boolean, natural languageDepth-wide:Pattern matchingIf return sets, can be combined using Boolean modelCombining with structureHierarchical structureStandardized low level languages: protocolsReusable
24 Thank you!Till October 16October 23: midterm exam