1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
1 CS/INFO 430 Information Retrieval Lecture 16 Metadata 3.
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 7 Descriptive Metadata 3 Dublin Core Automatic Generation of Catalog Records.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 CS 430: Information Discovery Lecture 7 Automatic Generation of Catalog Records.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Vector Methods Classical IR
CS 430: Information Discovery
Multimedia Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Vector Methods Classical IR
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods

2 Course Administration

3 Collection-level metadata Several of the most difficult fields to extract automatically are the same across all pages in a web site. Therefore create a collection record manually and combine it with automatic extraction of other fields at item level. For the CS 430 home page, collection-level metadata: See: Jenkins and Inman

4 Collection-level metadata Compare: (a) Metadata extracted automatically by DC-dot (b) Collection-level record (c) Combined item-level record (DC-dot plus collection-level) (d) Manual record

5

6 Metadata extracted automatically by DC-dot D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF type DCMIType Text format text/html format bytes identifier

7 Collection-level record D.C. Field Qualifier Content publisher Corporation for National Research Initiatives type article type resource work relation rel-type InSerial relation serial-name D-Lib Magazine relation issn language English rights Permission is hereby given for the material in D-Lib Magazine to be used for...

8 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose publisher (*) Corporation for National Research Initiatives date W3CDTF type (*) article type resource (*) work type DCMIType Text format text/html format bytes (*) indicates collection-level metadata continued on next slide

9 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content relation rel-type (*) InSerial relation serial-name (*) D-Lib Magazine relation issn (*) language (*) English rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for... identifier (*) indicates collection-level metadata

10 Manually created record D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose creator (+) David M. Levy publisher Corporation for National Research Initiatives date publication January 2000 type article type resource work (+) entry that is not in the automatically generated records continued on next slide

11 Manually created record D.C. Field Qualifier Content relation rel-type InSerial relation serial-name D-Lib Magazine relation issn relation volume (+) 6 relation issue (+) 1 identifier DOI (+) /january2000-levy identifier URL language English rights (+) Copyright (c) David M. Levy (+) entry that is not in the automatically generated records

12 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries collect related documents into common subject classes identify documents to be retrieved by calculating similarities between documents and queries procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard Cornell

13 Vector Space Methods Problem: Given two text documents, how similar are they? (One document may be a query.) Vector space methods that measure similarity do not assume exact matches. Benefits of similarity measures rather than exact matches Encourage long queries, which are rich in information. An abstract should be very similar to its source document. Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice.

14 Vector space revision x = (x 1, x 2, x 3,..., x n ) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x| 2 = x x x x n 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1.x 2 = x 11 x 21 + x 12 x 22 + x 13 x x 1n x 1n Cosine of the angle between the vectors x 1 and x 2: cos (  ) = x 1.x 2 |x 1 | |x 2 |

15 Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors.

16 Three terms represented in 3 dimensions t1t1 t2t2 t3t3 d1d1 d2d2 

17 Example 1: Incidence array terms in d 1 -> ant ant bee terms in d 2 -> bee hog ant dog terms in d 3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog length d  2 d  4 d  5 Weights: t ij = 1 if document i contains term j and zero otherwise

18 Example 1 (continued) d 1 d 2 d 3 d d d Similarity of documents in example: Similarity measures the occurrences of terms, but no other characteristics of the documents.

19 Example 2: frequency array terms in d 1 -> ant ant bee terms in d 2 -> bee hog ant dog terms in d 3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length d  5 d  4 d  5 Weights: t ij = frequency that term j occurs in document i

20 Example 2 (continued) d 1 d 2 d 3 d d d Similarity of documents in example: Similarity depends upon the weights given to the terms.

21 Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, t ij = 0 if term j occurs in document i, t ij is greater than zero (the value of t ij is called the weight of term j in document i) Similarity between d i and d j is defined as  t ik t jk |d i | |d j | k=1 n cos(di, d j ) =

22 Simple use of vector similarity Threshold For query q, retrieve all documents with similarity more than 0.50 Ranking For query q, return the n most similar documents ranked in order of similarity

23 Contrast with Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents Encourages long queries to have as many dimensions as possible Benefits from large numbers of index terms Benefits from queries with many terms, not all of which need match the document

24 Document vectors as points on a surface Normalize all document vectors to be of length 1 Then the ends of the vectors all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface

25 Relevance feedback (concept) x x x x o o o   hits from original search x documents identified as non-relevant o documents identified as relevant  original query reformulated query 

26 Document clustering (concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.