Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006.

Similar presentations


Presentation on theme: "Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006."— Presentation transcript:

1 Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

2 Roadmap ● Information Retrieval (IR) – Goal: Match Information Need to Document Concept – Solution: Vector Space Model ● Representation of Documents and Queries ● Computing Similarity ● Implementation: – Indexing: Documents -> Vectors – Query Construction: Query -> Vector – Retrieval: Finding “Best” match: Query/Document

3 The Information Retrieval Task ● Goal: – Match the information need expressed by user ● (the Query) – With concepts in documents ● (the Document collection) ● Issues: – How do we represent documents and queries ? – How do we know if they're “similar”? Match?

4 Vector Space Model ● Represent documents and queries with – Pattern of words ● I.E. Queries and documents with lots of the same words – Vector of word occurrences: – Each position in vector = word ● Value of position x in vector = # times word x occurs ● Similarity: – Dot product of document vector & query vector – Biggest wins

5 Vector Space Model Computer Tv Program Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

6 Information Retrieval in Scheme ● Representation: – A vector-rep is (vectorof number) – (define-struct doc-rep (id vec)) – A doc is (make-doc-rep id vec) ● Where id:symbol; vec: vector-rep – A doc-index is (listof doc) – A query is vector-rep ● A simple-web-page (swp) is: ● (make-swp h b) ● Where (define-struct swp h b); h:symbol; b: (listof symbol)

7 Three Steps to IR ● Three phases: – Indexing: Build collection of document representations ● Convert web pages to doc-rep – Vectors of word counts – Query construction: ● Convert query text to vector of word counts – Retrieval: ● Compute similarity between query and doc representation ● Return closest match

8 Words-to-vector (define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? wlist) wvec) (else (let ((wpos (posn (car wlist) dict)))) (let ((cur-count (vector-ref wvec wpos))) (vector-set! wvec wpos (+ cur-count 1)) (words-to-vector (cdr wlist) wvec))))) (define (posn wd dict) (cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict))))

9 Indexing (define (build-index swp-list) ;; build-index: (listof swp) -> (listof doc-rep) ;; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '()) (else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list)) (make-vector dictionary-size 0))) (build-index (cdr swp-list)))))

10 Query Construction (define (build-query wlist) ;; build-query: (listof symbol) -> vector-rep ;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))

11 Retrieval (define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol ;; Finds id of document with best match with query (doc-rep-id (max (map (lambda (doc) (dot-product (doc-rep-vec doc) query)) index))))


Download ppt "Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006."

Similar presentations


Ads by Google