Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
1 Programming Languages (CS 550) Lecture Summary Functional Programming and Operational Semantics for Scheme Jeremy R. Johnson.
Functional Programming. Pure Functional Programming Computation is largely performed by applying functions to values. The value of an expression depends.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
מבוא מורחב - שיעור 91 Lecture 9 Lists continued: Map, Filter, Accumulate, Lists as interfaces.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
SICP Data abstraction revisited Data structures: association list, vector, hash table Table abstract data type No implementation of an ADT is necessarily.
Ch 4: Information Retrieval and Text Mining
6.001 SICP SICP – October Introduction Trevor Darrell 32-D512 Office Hour: W web page:
6.001 SICP SICP Sections 5 & 6 – Oct 5, 2001 Quote & symbols Equality Quiz.
Hinrich Schütze and Christina Lioma
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Vector Space Model CS 652 Information Extraction and Integration.
( (lambda (z) (define x (lambda (x) (lambda (y z) (y x)))) ( ( (x (lambda () z)) (lambda (z) z) 3 ) ) ) 2)
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
APPLYING INFORMATION RETRIEVAL TO TEXT MINING Data mining Lab 이아람.
Plt /8/ Inductive Sets of Data Programming Language Essentials 2nd edition Chapter 1.2 Recursively Specified Programs.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Arbitrarily Long Data Structures: Lists and Recursion CMSC Introduction to Computer Programming October 4, 2002.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Plt /12/ Data Abstraction Programming Language Essentials 2nd edition Chapter 2.3 Representation Strategies for Data Types.
1 Lecture 16: Lists and vectors Binary search, Sorting.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Scheme CS 480/680 – Comparative Languages “And now for something completely different…” – Monty Python “And now for something completely.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
SICP Search Algorithms Why search Data structures that support search Breadth first vs. depth first.
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Mutual Recursion: Web pages CMSC Introduction to Computer Programming November 25, 2002.
1 Data abstraction, revisited Design tradeoffs: Speed vs robustness modularity ease of maintenance Table abstract data type: 3 versions No implementation.
Functional Programming Universitatea Politehnica Bucuresti Adina Magda Florea
Abstraction: Procedures as Parameters CMSC Introduction to Computer Programming October 14, 2002.
CS220 Programming Principles 프로그래밍의 이해 2002 가을학기 Class 6 한 태숙.
CS535 Programming Languages Chapter - 10 Functional Programming With Lists.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Scope: What’s in a Name? CMSC Introduction to Computer Programming October 16, 2002.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Vectors, binary search, and sorting. 2 We know about lists O(n) time to get the n-th item. Consecutive cons cell are not necessarily consecutive in.
Cs784 (Prasad)L6AST1 Abstract Syntax. cs784 (Prasad)L6AST2 Language of -expressions ::= | (lambda ( ) ) | ( ) E.g., concrete syntax Scheme S-expressions.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Functional Programming Language 1 Scheme Language: part 3.
Additional Scheme examples
Racket CSC270 Pepper major portions credited to
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Metacircular Evaluator
Basic Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Data abstraction, revisited
From frequency to meaning: vector space models of semantics
topics mutable data structures
Mutators for compound data Stack Queue
6.001 SICP Interpretation Parts of an interpreter
Information Retrieval and Web Design
Changing Data: (Continued)
More Scheme CS 331.
Defining Macros in Scheme
Bringing it all Together: Family Trees
Presentation transcript:

Information Retrieval: aka “Google-lite” CMSC November 27, 2006

Roadmap ● Information Retrieval (IR) – Goal: Match Information Need to Document Concept – Solution: Vector Space Model ● Representation of Documents and Queries ● Computing Similarity ● Implementation: – Indexing: Documents -> Vectors – Query Construction: Query -> Vector – Retrieval: Finding “Best” match: Query/Document

The Information Retrieval Task ● Goal: – Match the information need expressed by user ● (the Query) – With concepts in documents ● (the Document collection) ● Issues: – How do we represent documents and queries ? – How do we know if they're “similar”? Match?

Vector Space Model ● Represent documents and queries with – Pattern of words ● I.E. Queries and documents with lots of the same words – Vector of word occurrences: – Each position in vector = word ● Value of position x in vector = # times word x occurs ● Similarity: – Dot product of document vector & query vector – Biggest wins

Vector Space Model Computer Tv Program Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

Information Retrieval in Scheme ● Representation: – A vector-rep is (vectorof number) – (define-struct doc-rep (id vec)) – A doc is (make-doc-rep id vec) ● Where id:symbol; vec: vector-rep – A doc-index is (listof doc) – A query is vector-rep ● A simple-web-page (swp) is: ● (make-swp h b) ● Where (define-struct swp h b); h:symbol; b: (listof symbol)

Three Steps to IR ● Three phases: – Indexing: Build collection of document representations ● Convert web pages to doc-rep – Vectors of word counts – Query construction: ● Convert query text to vector of word counts – Retrieval: ● Compute similarity between query and doc representation ● Return closest match

Words-to-vector (define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? wlist) wvec) (else (let ((wpos (posn (car wlist) dict)))) (let ((cur-count (vector-ref wvec wpos))) (vector-set! wvec wpos (+ cur-count 1)) (words-to-vector (cdr wlist) wvec))))) (define (posn wd dict) (cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict))))

Indexing (define (build-index swp-list) ;; build-index: (listof swp) -> (listof doc-rep) ;; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '()) (else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list)) (make-vector dictionary-size 0))) (build-index (cdr swp-list)))))

Query Construction (define (build-query wlist) ;; build-query: (listof symbol) -> vector-rep ;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))

Retrieval (define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol ;; Finds id of document with best match with query (doc-rep-id (max (map (lambda (doc) (dot-product (doc-rep-vec doc) query)) index))))