Download presentation
Presentation is loading. Please wait.
Published byPhoebe Hicks Modified over 8 years ago
1
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods
2
2 Course Administration
3
3 Collection-level metadata Several of the most difficult fields to extract automatically are the same across all pages in a web site. Therefore create a collection record manually and combine it with automatic extraction of other fields at item level. For the CS 430 home page, collection-level metadata: See: Jenkins and Inman
4
4 Collection-level metadata Compare: (a) Metadata extracted automatically by DC-dot (b) Collection-level record (c) Combined item-level record (DC-dot plus collection-level) (d) Manual record
5
5
6
6 Metadata extracted automatically by DC-dot D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF 2000-05-11 type DCMIType Text format text/html format 27718 bytes identifier http://www.dlib.org/dlib/january00/01levy.html
7
7 Collection-level record D.C. Field Qualifier Content publisher Corporation for National Research Initiatives type article type resource work relation rel-type InSerial relation serial-name D-Lib Magazine relation issn 1082-9873 language English rights Permission is hereby given for the material in D-Lib Magazine to be used for...
8
8 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose publisher (*) Corporation for National Research Initiatives date W3CDTF 2000-05-11 type (*) article type resource (*) work type DCMIType Text format text/html format 27718 bytes (*) indicates collection-level metadata continued on next slide
9
9 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content relation rel-type (*) InSerial relation serial-name (*) D-Lib Magazine relation issn (*) 1082-9873 language (*) English rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for... identifier http://www.dlib.org/dlib/january00/01levy.html (*) indicates collection-level metadata
10
10 Manually created record D.C. Field Qualifier Content title Digital Libraries and the Problem of Purpose creator (+) David M. Levy publisher Corporation for National Research Initiatives date publication January 2000 type article type resource work (+) entry that is not in the automatically generated records continued on next slide
11
11 Manually created record D.C. Field Qualifier Content relation rel-type InSerial relation serial-name D-Lib Magazine relation issn 1082-9873 relation volume (+) 6 relation issue (+) 1 identifier DOI (+) 10.1045/january2000-levy identifier URL http://www.dlib.org/dlib/january00/01levy.html language English rights (+) Copyright (c) David M. Levy (+) entry that is not in the automatically generated records
12
12 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries collect related documents into common subject classes identify documents to be retrieved by calculating similarities between documents and queries procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988
13
13 Vector Space Methods Problem: Given two text documents, how similar are they? (One document may be a query.) Vector space methods that measure similarity do not assume exact matches. Benefits of similarity measures rather than exact matches Encourage long queries, which are rich in information. An abstract should be very similar to its source document. Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice.
14
14 Vector space revision x = (x 1, x 2, x 3,..., x n ) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x| 2 = x 1 2 + x 2 2 + x 3 2 +... + x n 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1.x 2 = x 11 x 21 + x 12 x 22 + x 13 x 23 +... + x 1n x 1n Cosine of the angle between the vectors x 1 and x 2: cos ( ) = x 1.x 2 |x 1 | |x 2 |
15
15 Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors.
16
16 Three terms represented in 3 dimensions t1t1 t2t2 t3t3 d1d1 d2d2
17
17 Example 1: Incidence array terms in d 1 -> ant ant bee terms in d 2 -> bee hog ant dog terms in d 3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog length d 1 1 1 2 d 2 1 1 1 1 4 d 3 1 1 1 1 1 5 Weights: t ij = 1 if document i contains term j and zero otherwise
18
18 Example 1 (continued) d 1 d 2 d 3 d 1 10.71 0 d 2 0.71 10.22 d 3 00.22 1 Similarity of documents in example: Similarity measures the occurrences of terms, but no other characteristics of the documents.
19
19 Example 2: frequency array terms in d 1 -> ant ant bee terms in d 2 -> bee hog ant dog terms in d 3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length d 1 2 1 5 d 2 1 1 1 1 4 d 3 1 1 1 1 1 5 Weights: t ij = frequency that term j occurs in document i
20
20 Example 2 (continued) d 1 d 2 d 3 d 1 10.67 0 d 2 0.67 10.22 d 3 00.22 1 Similarity of documents in example: Similarity depends upon the weights given to the terms.
21
21 Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, t ij = 0 if term j occurs in document i, t ij is greater than zero (the value of t ij is called the weight of term j in document i) Similarity between d i and d j is defined as t ik t jk |d i | |d j | k=1 n cos(di, d j ) =
22
22 Simple use of vector similarity Threshold For query q, retrieve all documents with similarity more than 0.50 Ranking For query q, return the n most similar documents ranked in order of similarity
23
23 Contrast with Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents Encourages long queries to have as many dimensions as possible Benefits from large numbers of index terms Benefits from queries with many terms, not all of which need match the document
24
24 Document vectors as points on a surface Normalize all document vectors to be of length 1 Then the ends of the vectors all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface
25
25 Relevance feedback (concept) x x x x o o o hits from original search x documents identified as non-relevant o documents identified as relevant original query reformulated query
26
26 Document clustering (concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.