Presentation on theme: "Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:
Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh
2 Previous Chapter Main measures: Precision & Recall. oFor sets oRankings are evaluated through initial subsets There are measures that combine them into one oInvolve user-defined preferences. In F-measure set to Many (other) characteristics oAn algorithm can be good at some and bad at others oAverages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms
3 Previous chapter: research issues Different types of interfaces; interactive systems: oWhat measures to use? oHow people judge relevance? oHow the user satisfaction can be measured? Modeled?
4 Query languages Query language = type of possible queries Type of queries depend on the IR model Types: oIR (= ranked output) oData retrieval oUser-oriented oLow-level (= protocols) Assume all pre-processing has been done oThesaurus, stop-words,... o(I think this must be a part of the language!) Returns documents (chapter, paragraph,...)
5 In this chapter Keyword-based languages Pattern matching Structure taken into account Protocols
6 Keyword-based languages: Single word Intuitive, easy to express, fast ranking. oWords can be highlighted in the output. What a word is? oLetters, separators oNon-splitting characters: on-line. oDatabase decides. TF-IDF are designed for words Used for the main models (Boolean, Vector, Probabilistic)
7 Keyword-based languages: Context Queries Ensure that the words are related Phrase oenhance retrieval oAllows separators and stopwords: enhance the retrieval Proximity oenhance the quality of information retrieval oDistance: words, letters. Order: same or not Not clear how to rank oResearch issue
8 Keyword-based languages: Boolean Queries Boolean expressions (can combine basic queries) Query syntax tree otranslation AND (syntax OR syntactic) operations on the sets oResult: set OR, AND, e 1 BUT e 2 oNOT not used, could give (almost) all docs (= unsafe) Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern
9 Keyword-based languages: Fuzzy Boolean, Natural Language Fuzzy Boolean: OR AND = some. oAND punishes for absence, OR encourages multiple. oNatural ranking: how many times? Natural Language: OR = AND oBUT can be expressed (= penalty) oHow to rank? Different ways Vector space model oQuery is a vector oA doc can be taken as a vector. Relevance feedback! Proximity is ignored o(Why? Research issue.)
10 Pattern matching... Pattern = sequence of features oText segment matches the pattern Types: Words Prefixes, suffixes, substrings: ocomput-, -ters, -any flow- (many flowers). Ranges oimplies some order, e.g., lexicographical = alphabetic Allowing errors oLevenshtein (= edit) distance: historical / hysterical o# insertions, deletions, replacements. Threshold.
11...Pattern matching...Types Regular expressions ounion = or: if e 1, e 2 are expressions, (e 1 | e 2 ) too oconcatenation: e 1 e 2 orepetition: e* (0 or more occurrences) Extended patterns ouser-friendly; can be internally converted into simple ocase-insensitive, anything (wildcard), digit, vowel,... oconditionals, optional osome parts match exactly and other with errors, oetc.
12 Structural queries Old days: fields. No nesting, no overlap, fixed order. o subject, body, sender,... o= Relational database with text type, treated as text should be oVersions of SQL with text operators Hypertext oNot well developed. Too free oWebGlimpse: search the neighborhood Hierarchical oIntermediate level of freedom oVolumes, chapters, sections, paragraphs, sentences,...
Too fixed Too free Intermediate
14 Hierarchical Models... PAT expressions oHierarchy is defined at query time. oRegions are included in the index, e.g., sections, italics,... oDifferent types of regions can overlap, same type cant oCan query for words in a region, regions in a region, etc. oComplex computation, unclear semantics Overlapped lists oEvolution of PAT: areas of same type can overlap (not nest) oUses same inverted file oCan combine regions, specify order,... on-words: all (overlapping) areas of n words.
15 Overlapping lists
16... Hierarchical Models... List of references oAnswers are references (pointers) to regions oOnly one type of regions (e.g., only sections). No nesting. oKnown at index time oAncestry of nodes. Can query paths Proximal nodes oCompromise between expressiveness and efficiency oMany (overlapping) fixed hierarchies oInteresting queries: 3 rd paragraph of each chapter,...
17 Proximal nodes
18... Hierarchical Models Tree matching oQuery is a tree. Match the text tree. oOrdered or unordered trees (are siblings ordered?) oProlog-like constraints on different parts of the tree Variables oAnswer: root of a match oVery inefficient (usually NP-hard) Due to variables and unordered matching
19 Research issues in hierarchical models Static or dynamic? oDefine the hierarchy at index time or at query time? oStatic: text markup. Dynamic: tags, indexed. Restrictions on the structure oRestrict structure of restrict the query language oFor efficiency Integration with text oof secondary importance: structure (in IR) or text (in DB)? ocombine Query language oStandardization, expressiveness taxonomy, categorization
20 Query protocols Used internally Standard: one client can query different libraries oIn CD-ROMS, disk interchangeability Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service) oIncludes Z39.50 For CD-ROMs: oCCL, Common Command Language oCD-RDx (Compact Disk Read only Data Exchange) oSFQL (Structured Full-text Query Language). Like DB.
Types of queries we have discussed
22 Trends and research topics Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages oExample: library shown on the screen. Act: take books, open catalogs, etc. oBetter Boolean queries: I need books by Cervantes AND Lope de Vega?!
23 Conclusions Width-wide: owords, phrases, proximity, fuzzy Boolean, natural language Depth-wide: oPattern matching If return sets, can be combined using Boolean model Combining with structure oHierarchical structure Standardized low level languages: protocols oReusable
24 Thank you! Till October 16 October 23: midterm exam