Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

What is a document? Representing the content of documents – Luhn's analysis – Generation of document representatives – Weighting Inverted files Roadmap

Indexing Language Language used to describe documents and queries index terms – selected subset of words – derived from the text or arrived at independently Keyword searching – Statistical analysis of document based of word occurrence frequency – Automated, efficient and potentially inaccurate Searching using controlled vocabularies – More accurate results but time consuming if documents manually indexed

Luhn's analysis Resolving power of significant words: – ability of words to discriminate document content – peak at rank order position half way between the two cut-offs

Generating document representatives

Input text: full text, abstract, title Document representative: list of (weighted) class names, each name representing a class of concepts (words) occurring in input text Document indexed by a class name if one of its significant words occurs as a member of that class Phases: – identify words - Lexical Analysis (Tokenising) – removal of high frequency words – suffix stripping (stemming) – detecting equivalent stems – thesauri – others (noun-phrase, noun group, logical formula, structure) – Index structure creation

Process View Document Lexical Analysis Stopwords removal stemming Indexing features

Lexical Analysis The process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) – treating digits, hyphens, punctuation marks, and the case of the letters.

Stopword Removal Removal of high frequency words list of stop words (implement Luhn's upper cut- off) filtering out words with very low discrimination values for retrieval purposes example: “been", “a", “about", “otherwise" compare input text with stop list reduction: between 30 and 50 per cent

Conflation Conflation reduces word variants into a single form – similar words generally have similar meaning – retrieval effectiveness increased if the query is expanded with those which are similar in meaning to those originally contained within it. Stemming algorithm is a conflation procedure – reduces all words with same root into a single root

Different forms - stemming Stemming – Matching the query term “forests” to “forest” and “forested” – “choke", “choking", “choked" Suffix removal – removal of suffixes - worker – Porter algorithm: remove longest suffix Porter algorithm – error: “equal" -> “eq": heuristic rules – more effective than ordinary word forms Detecting equivalent stems – example: ABSORB- and ABSORPT- Stemmers remove affixes – prefixes? - megavolt

Plural stemmer Plurals in English – If word ends in “ies” but not “eies”, “aies” “ies” -> “y” – if word ends in “es” but not “aes, “ees”, “oes” “es” -> “e” – if word ends in “s” but not “us” or “ss” “s” -> “” – First applicable rule is the one used

Processing “The destruction of the amazon rain forests” Case normalisation Stop word removal. – From fixed list – “destruction amazon rain forests” Suffix removal (stemming). – “destruct amazon rain forest”

Thesauri A collection of terms along with some structure or relationships between them. Scope notes etc.. 1. provide standard vocabulary for indexing & searching 2. assist user locating terms for proper query formulation 3. provide classification hierarchy for broadening and narrowing current query according to user need – Equivalence: synonyms, preferred terms – Hierarchical: broader/narrower terms (BT/NT) – Association: related terms across the hierarchy (RT)

Thesauri Examples: WordNet

Faceted Classification

Thesauri Examples: AAT Art and Architecture Thesaurus

Hierarchical Classifications Alphanumeric coding schemes Subject classifications A taxonomy that represents a classification or kind-of hierarchy. Examples: Dewey Decimal, AAT, SHIC, ICONCLASS 41A324 Metalwork of a Door 41A322 Closing the Door 41A323 Monumental Door 41A32 Door 41A3241 Door-Knocker 41A327 Door-keeper, houseguard 41A325 Threshold Action associated with a door Something attached to a door Kind of a door

Terminology/Controlled vocabulary The descriptors from a thesauri form a controlled vocabulary Normalise indexing concepts Identification of indexing concepts with clear semantics Retrieval based on concepts rather than terms Good for specific domains (e.g., medical) Problematic for general domains (large, new, dynamic)

No One Classification

Generating document representatives - Outcome Class –words with the same stem Class name –stem Document representative: –list of class names (index terms or keywords) Same process applied to query

Precision and Recall Precision – Ratio of the number of relevant documents retrieved to the total number of documents retrieved. – The number of hits that are relevant Recall – Ratio of number of relevant documents retrieved to the total number of relevant documents – The number of relevant documents that are hits

Precision and Recall Retrieved Documents Relevant Documents Document Space Low Precision Low Recall Low Precision High Recall High Precision Low Recall High Precision High Recall

Precision and Recall The user isn’t usually given the answer set RA at once The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A Retrieved Documents |A| Relevant Documents |R| Information Space |RA| Precision = |RA| |A| Recall = |RA| |R|

Precision and Recall Trade Off Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision Recall Precision 100%

Index term weighting Effectiveness of an indexing language: Exhaustivity – number of different topics indexed – high exhaustivity: high recall and low precision Specificity – ability of the indexing language to describe topics precisely – high specificity: high precision and low recall

Index term weighting Exhaustivity – related to the number of index terms assigned to a given document Specificity – number of documents to which a term is assigned in a collection – related to the distribution of index terms in collection Index term weighting – index term frequency: occurrence frequency of a term in document – document frequency: number of documents in which a term occurs

IR as Clustering A query is a vague spec of a set of objects, A IR is reduced to the problem of determining which documents are in set A and which ones are not Intra clustering similarity: – What are the features that better describe the objects in A Inter clustering dissimilarity: – What are the features that better distinguish the objects A from the remaining objects in C A: Retrieved Documents C: Document Collection x x x x x x

Index term weighting Weight(t,d) = tf(t,d) x idf(t) NNumber of documents in collection n(t)Number of documents in which term t occurs idf(t)Inverse document frequency occ(t,d)Occurrence of term t in document d t max Term in document d with highest occurrence tf(t,d)Term frequency of t in document d

Index term weighting Intra-clustering similarity – The raw frequency of a term t inside a document d. – A measure of how well the document term describes the document contents Inter-cluster dissimilarity – Inverse document frequency – Inverse of the frequency of a term t among the documents in the collection. – Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Normalised frequency of term t in document d Inverse document frequency n(t) N logidf(t) = occ(t max, d) occ(t,d) tf(t,d) = Weight(t,d) = tf(t,d) x idf(t)

Term weighting schemes Best known Variation for query term weights n(t) N log occ(t max, d) occ(t,d) weight(t,d) = x occ(t max, q) 0.5occ(t,q) n(t) N logx 0.5 + Term frequency Inverse document frequency

Example Nuclear 7 Computer 9 Poverty 5 Unemployment 1 Luddites 3 Machines 19 People 25 And 49 Weight(machine) = 19/25 x log(100/50) = 0.76 x 0.3013 = 0.228988 Weight(luddite) = 3/25 x log(100/2) = 0.12 x 1.69897 = 0.2038764 Weight(poverty) = 5/25 x log(100/2) = 0.2 x 1.69897 = 0.339794

Inverted Files Word-oriented mechanism for indexing test collections to speed up searching Searching: – vocabulary search (query terms) – retrieval of occurrence – manipulation of occurrence

Original Document view Cosmonaut astronaut moon car truck D110111 D20 1100 D3 00011

Inverted view D1D2D3 Cosmonaut 100 astronaut 01 0 moon 1 10 Car 1 01 truck 1 01

Inverted index cosmonaut astronaut moon car truck D1 D2 D1 D2 D3

Inverted File The speed of retrieval is maximised by considering only those terms that have been specified in the query This speed is achieved only at the cost of very substantial storage and processing overheads

Components of an inverted file term Field type frequency pointer Document number frequency Postings file Header Information

Producing an Inverted file quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3 Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

An Inverted file quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

Searching Algorithm For each document D, Score(D) =0; For each query term – Search the vocabulary list – Pull out the postings list – for each document J in the list, Score(J) +=Score(J) +1

What Goes in a Postings File? Boolean retrieval – Just the document number Ranked Retrieval – Document number and term weight (TF*IDF,...) Proximity operators – Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? Very compact for Boolean retrieval – About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval – Perhaps 20% Enormous for proximity operators – Sometimes larger than the documents But access is fast - you know where to look

Tokenize Stop word Stemming Documents Query Tokenize Stop word Stemming Query features Indexing features Matching Term 1 Term 2 Term 3 didjdk Doc Score dj di dk s1 s2 s3 s1>s2>s3>... Storage: inverted index indexing

Similarity Matching The process in which we compute the relevance of a document for a query A similarity measure comprises – term weighting scheme which allocates numerical values to each of the index terms in a query or document reflecting their relative importance – similarity coefficient - uses the term weights to compute the overall degree of similarity between a query and a document 

Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Similar presentations

Presentation on theme: "Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Similar presentations

Presentation on theme: "Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)"— Presentation transcript:

Similar presentations

About project

Feedback