Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files Remaining components must parse, index, find, and rank documents. Traditional approach is based on the words in the documents

Extracting Lexical Features Process a string of characters –assemble characters into tokens –choose tokens to index In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex,

Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output state Must be very efficient; gets used a LOT 0 1 2 blank A-Z blank, EOF

Design Issues for Lexical Analyser Punctuation –treat as whitespace? –treat as characters? –treat specially? Case –fold? Digits –assemble into numbers? –treat as characters? –treat as punctuation?

Lexical Analyser Output of lexical analyser is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more efficient, but limits the power of our search

Stemming Additional processing at the token level Turn words into a canonical form: –“cars” into “car” –“children” into “child” –“walked” into “walk” Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall

Stemming -- How? Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-sensitive transformation grammar: –(.*)SSES -> /1SS 50-250 rules are typical

Noise Words (Stop Words) Function words that contribute little or nothing to meaning Very frequent words –If a word occurs in every document, it is not useful in choosing among documents –However, need to be careful, because this is corpus-dependent Often implemented as a discrete list (stop.wrd on CD)

Example Corpora We are assuming a fixed corpus. Text uses two sample corpora: –AI Abstracts –Email. Anyone’s email. Textual fields, structured attributes Textual: free, unformatted, no meta- information Structured: additional information beyond the content

Structured Atributes for AI Theses Thesis # Author Year University Advisor Language

Textual Fields for AIT Abstract –Reasonably complete standard academic English capturing the basic meaning of document Title –Short, formalized, captures most critical part of meaning –(proxy for abstract)

Indexing We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index terms –Assume that each token is a word and we don’t want to recognize any more complex structures than single words. When all documents are processed, create index

Basic Indexing Algorithm For each document in the corpus –get the next token –save the posting in a list doc ID, frequency For each token found in the corpus –calculate #doc, total frequency –sort by frequency (p 53-54)

Fine Points Dynamic Corpora Higher-resolution data (eg, char position) Giving extra weight to proxy text (typically by doubling or tripling frequency count) Document-type-specific processing –In HTML, want to ignore tages –In email, maybe want to ignore quoted material

Choosing Keyword Don’t necessarily want to index on every word –Takes more space for index –Takes more processing time –May not improve our resolving power How do we choose keywords? –Manually –Statistically Exhaustivity vs specificity

Manually Choosing Keywords Unconstrained vocabulary: allow creator of document to choose whatever he/she wants –“best” match –captures new terms easily –easiest for person choosing keywords Constrained vocabulary: hand-crafted ontologies –can include hierarchical and other relations –more consistent –easier for searching; possible “magic bullet” search

Examples of Constrained Vocabularies ACM headings Medline Subject Headings

Automated Vocabulary Selection Frequency: Zipf’s Law –Within one corpus, words with middle frequencies are typically “best” Document-oriented representation bias: lots of keywords/document Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.

Choosing Keywords “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. Words which have no resolving power within a corpus may be best choices across corpora Not very important for web searching; will be more relevant for text mining.

Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are likely to be common search terms Queries that people want to make are wide- ranging and unpredictable Therefore: can’t limit keywords, except possibly to eliminate stop words. Even stop words are language-dependent. So determine language first.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Similar presentations

Presentation on theme: "Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Similar presentations

Presentation on theme: "Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents."— Presentation transcript:

Similar presentations

About project

Feedback