Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

CSE 5331/7331 F07 2 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.

CSE 5331/7331 F07 3 DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles

CSE 5331/7331 F07 4 DB vs IR (cont’d)  Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure!  Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated  IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

CSE 5331/7331 F07 5 Motivation  IR in the last 20 years:  classification and categorization  systems and languages  user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

CSE 5331/7331 F07 6 Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

CSE 5331/7331 F07 7 User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database Text The Retrieval Process

CSE 5331/7331 F07 8 IR is Fuzzy SimpleFuzzy Accept Reject

CSE 5331/7331 F07 10 Indexing IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming might be used: connect: connecting, connection, connections An inverted file is built for the chosen index terms

CSE 5331/7331 F07 11 Indexing Docs Information Need Index Terms doc query Ranking match

CSE 5331/7331 F07 12 Inverted Files There are two main elements: vocabulary – set of unique terms Occurrences – where those terms appear The occurrences can be recorded as terms or byte offsets Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access VocabularyOccurrences (byte offset) ……

CSE 5331/7331 F07 13 Inverted Files The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file That may lead to a quite considerable index overhead

CSE 5331/7331 F07 14 Example Text: Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 VocabularyOccurrences

CSE 5331/7331 F07 15 Ranking A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model

CSE 5331/7331 F07 16 Classic IR Models - Basic Concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes Usually, index terms are nouns because nouns have meaning by themselves However, search engines assume that all words are index terms (full text representation)

CSE 5331/7331 F07 17 Classic IR Models - Basic Concepts The importance of the index terms is represented by weights associated to them k i - an index term d j - a document w ij - a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term for describing the document contents

CSE 5331/7331 F07 18 Classic IR Models - Basic Concepts t is the total number of index terms K = {k 1, k 2, …, k t } is the set of all index terms w ij >= 0 is a weight associated with (k i,d j ) w ij = 0 indicates that term does not belong to doc d j = (w 1j, w 2j, …, w tj ) is a weighted vector associated with the document d j g i (d j ) = w ij is a function which returns the weight associated with pair (k i,d j )

CSE 5331/7331 F07 19 The Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics and neat formalism Terms are either present or absent. Thus, w ij  {0,1} Consider q = k a  (k b   k c ) q dnf = (1,1,1)  (1,1,0)  (1,0,0) q cc = (1,1,0) is a conjunctive component

CSE 5331/7331 F07 20 The Vector Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

CSE 5331/7331 F07 21 The Vector Model w ij > 0 whenever k i appears in d j w iq >= 0 associated with the pair (k i,q) d j = (w 1j, w 2j,..., w tj ) q = (w 1q, w 2q,..., w tq ) To each term k i is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

CSE 5331/7331 F07 22 Query Languages Keyword Based Boolean Weighted Boolean Context Based (Phrasal & Proximity) Pattern Matching Structural Queries

CSE 5331/7331 F07 23 Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity

CSE 5331/7331 F07 24 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

CSE 5331/7331 F07 25 Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

CSE 5331/7331 F07 26 Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.

CSE 5331/7331 F07 27 Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

CSE 5331/7331 F07 28 Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

CSE 5331/7331 F07 29 Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 5331/7331 F07 30 Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

Similar presentations

Presentation on theme: "Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.

Similar presentations

Presentation on theme: "Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto."— Presentation transcript:

Similar presentations

About project

Feedback