Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Text Databases  Text Types  Unstructured text  semi-structured text  structured text  Query: User wants to find documents related to a topic T 

Similar presentations


Presentation on theme: "1 Text Databases  Text Types  Unstructured text  semi-structured text  structured text  Query: User wants to find documents related to a topic T "— Presentation transcript:

1 1 Text Databases  Text Types  Unstructured text  semi-structured text  structured text  Query: User wants to find documents related to a topic T  The search program tries to find the documents in the text database that contain the string T  Two problems  Synonymy: Given a word T, the word T does not occur anywhere in a document D, even though D is in fact closely related to topic T  Polysemy: The same word may mean many different things in different contexts

2 2 We discuss,  Measures of performance of a text retrieval system  Latent semantic indexing  Telescopic-Vector trees for document retrieval

3 3 Precision and Recall  Precision:  How many of the returned documents are relevant?  (20+1)/( )  Recall:  How many of the relevant documents are returned?  (20+1)/( ) Relevant documents Returned documents All documents

4 4 Some Concepts  Stop List  A set of words that do not “discriminate” between the documents in a given archive  E.g.: Cornell SMART system has about 440 words on its stop list  Word Stems  Many words are small syntactic variants of each other  E.g., drug, drugged, drugs are similar in the sense that they share a common “stem,” the word drug  Most document retrieval systems first eliminate words on stop lists and reduce words to their stems, before creating a frequency table  Frequency Tables

5 5 Some Concepts  Frequency Tables  D is a set of N documents  T is a set of M words/terms occurring in the documents of D  Assume no words on the stop list for D occur in T and all words in T have been stemmed  The frequency table FreqT is an (M  N) matrix such that FreqT(i,j) equals the number of occurrences of the word t i in the document d j Term/Doc d1 d2 d3 d4 drug boat iran connection Doc String d1 Sex, Drugs and Videotape d2 The Iranian Connection d3 Boating and Drugs: Slips owned by Cartel d4 Connections between Terrorism and Asian Dope Operations

6 6 Similarity  d1 and d2 are similar because the distribution of the words in d1 mirrors the distribution of words in d2  both contain lots of occurrences of t1 and t4 and relatively few occurrences of t2 and t3 and moderately many occurrences of t5  d3 and d5 are also similar  d4 and d6 stand out as sharply different Term/Doc d1 d2 d3 d4 d5 d6 t t t t t

7 7 Similarity  Is merely counting words enough?  It does not indicate the importance of the words  What about document lengths?  We should also include the importance of the word in the document - How?  If a word occurs 3 times in a 100 word document may have more significance than if it occurs 3 times in a million word document  ratio of the number of occurrences of a word to the total number of words

8 8 Queries  User wants to execute the query  Find the 25 documents that are maximally relevant wrt banking operations and drugs?  After stemming, relevant keywords are “drug, bank”  Assume the query Q as vector  We want to find the columns in FreqT that are as close as possible to the Q’s vector  Closeness Metrics  Term Distance: (between Q and d r ) =   M j = 1 (vec Q (j) - FreqT(j,r)) 2  Cosine Distance:  M j = 1 (vec Q (j)  FreqT(j,r))   M j = 1 (vec Q (j)) 2   M j = 1 (FreqT(j,r)) 2  Complexity of retrievals may be O(N  M) which could be very large (Latent Semantic Indexing- A solution!!!)

9 9 Latent Semantic Indexing  The number of documents M and the number of terms N is very large  N could be over 10,000,000 (English words, proper nouns)  LSI tries to find a relatively small subset of K words which discriminate between M documents in the archive  LSI is claimed to work effectively for around K = 200  Advantage: Each document is now a column vector of length 200, instead of length N (This is a big plus!!!)  But, how do we find such a subset K?  A technique called singular valued decomposition

10 10 LSI  4 steps approach used by LSI  Table creation: Creation of the frequency matrix FreqT  SVD Construction: Compute the singular valued decompositions (A,S,B) of FreqT  Vector Identification: For each document d, let vec(d) be the set of all terms in FreqT whose corresponding rows have not been eliminated in the singular matrix S  Index Creation: Store the set of all vec(d)’s indexed by any one of the number of techniques (such as TV-tree)

11 11 Singular Valued Decomposition  Let M 1 and M 2 are two matrices of order (m 1  n 1 ) and (m 2  n 2 ), respectively  M 1  M 2 is well defined iff n 1 = m 2  Transpose of M, M T  Vector = matrix of order (1  m)  = T =

12 12 Singular Valued Decomposition  Two vectors X and Y of the same order are said to be orthogonal iff X T Y = 0  X = [10, 5, 20], Y = [1, 2, -1]  A Matrix M is orthogonal iff M T  M is the identity matrix 10 0 X T Y = 5  [1 2 -1] = M = is orthogonal 0 0

13 13 Singular Valued Decomposition  Matrix M is said to be diagonal iff the order of M is (m  m) and for all 1  i, j  m, i  j  M(i,j) = 0  A and B are diagonal, but C is not  A diagonal matrix M of order (m  m) is said to be non- decreasing iff for all 1  i, j  m, i  j  M(i,i)  M(j,j)  A is a non-decreasing diagonal matrix but B is not A = ; B = ; C =

14 14 SVD  A singular value decomposition of FreqT is a triple (A,S,B) where:  1. FreqT = (A  S  B T )  2. A is an (M  M) orthogonal matrix such that A T A = I  3. B is an (N  N) orthogonal matrix such that B T B = I  4. S is a diagonal matrix called a singular matrix  Theorem: Given any matrix M of order (m  m), it is possible to find a singular value decomposition, (A,S,B) of M such that S is a non-decreasing diagonal matrix  The SVD of the matrix is given by: here the singular values are 5, and the singular matrix S is non-decreasing

15 15 Returning to LSI  Given a frequency matrix FreqT, we can decompose it into SVD TSD T where S is non-decreasing  If FreqT is of size (M  N), then T is of size (M  M) and S is of order (M  R) where R is the rank of FreqT, and D T is of the order (R  N)  We can now shrink the problem substantially by eliminating the least significant singular values from the singular matrix S  Choose an integer k that is substantially smaller than R  Replace S by S*, which is a (k  k) matrix such that S*(i,j) = S(i,j) for 1  i, j  k  Replace the (R  N) matrix D T by the (k  N) matrix D* T where D* T (i,j) = D T (i,j) if 1  i  k and 1  j  N

16 16 LSI  How?  Bottom line:  Throw away the least significant values and retain the rest of the matrix  Key claim in LSI is that if k is chosen judiciously, then the k rows appearing in the singular matrix S* represent the k “most important” (from the point of view of retrieval) terms occurring in the “entire” document

17 17 Analysis  Usually R is taken to be 200  The size of FreqT is (M  N),  where M = number of terms = 1,000,000  N = number of documents = 10,000 (even for a small database)  After shrinking the singular matrix to 200  the first matrix: (M  R) = 1,000,000  200 = 200,000,000  the singular matrix: (R  R) = 200  200 = 400,000 (only 200 need to be stored because all others are 0’s)  the last matrix: (R  N) = 200  10,000  A total of 202,000,200 (200 million)  In contrast, (M  N) is close to 10,000 million!!!  SVD reduced the space utilized to about 1/50th of that required by the original frequency table

18 18 LSI: Document Retrieval using SVD  Given 2 documents d 1 and d 2 in the archive, how similar are they?  Given a query string/document Q, what are the n documents in the archive that are most relevant for the query?  Dot Product  Suppose x = (x 1, … x w ) and y = (y 1, …, y w )  The dot product of x and y = x  y =  x i  y i (where i = 1,..w)  Similarity of these two documents wrt the SVD representation TS*  D* T of a freq table is the dot product of the two columns in the matrix D* T of the two documents

19 19 LSI: Document Retrieval using SVD  The top p matches for Q  1. For all 1  i  j  p, the similarity between vec Q and d i is greater than or equal to the similarity between vec Q and d j  2. There is no other document d z such that the similarity between d z and vec Q exceeds that of d p  Can be done by using any indexing structure for R- dimensional spaces (R-trees, k-d trees)  However R-trees, k-d trees do not work well for high- dimensional data (>20)  Solution: TV-trees!

20 20 Telescopic Vector (TV) Trees  Access to point data in very large dimensional spaces should be highly efficient  A document d may be viewed as a vector v of length k, where the singular matrix is of size (k  k)  Thus each document is a point in a k-dimensional space  A document database is a collection of such points  To find the top p matches for Q, expressed as vec Q of length k, we need to find the k-nearest neighbors vec Q  TV-tree is a data structure similar to R-trees

21 21 Organization of a TV-tree  NumChild: Max number of a node is allowed to have   : is a number,  > 0,  < k is the number of active dimensions  Each in TV(k,NumChild,  ) represents a region, for this purpose, each node contains 3 fields  N.Center: this is a point in k-dimensional space  N.Radius: A real number > 0  N.ActiveDims: A list of at most  dimensions, It is a subset of {1,…k} of cardinality  or less

22 22 Region associated with a node N  Suppose x and y are points in k-dim space  act-dist(x,y) =   (x i - y i ) 2 where i  ActiveDims  Let k = 200,  = 5 and the set of ActiveDims = {1,2,3,4,5}  x = (10,5,11,13,7,x6, ….x200)  y = (2,4,14,8,6,y6, …y200)  act-dist(x,y) =  (10-2) 2 + (5-4) 2 + (11-14) 2 + (13-8) 2 + (7-6) 2 = 10  Node N represents the region containing all points x such that the active distance between x and N.Center  N.Radius  if N.Center = (10,5,11,13,7,0,0,0…0)  N.ActiveDims = {1,2,3,4,5}  then N represents the region consisting of all points x such that  (x 1 -10) 2 + (x 2 -5) 2 + (x 3 -14) 2 + (x 4 -13) 2 + (x 5 -7) 2  N.Radius  A node also contains an array, Child, of pointers to other nodes

23 23 Properties of TV- Trees  All data is stored at the leaf nodes  Each node (except the root and the leaves) must be at least half full  If N is a node, and N 1,.. N r are its children, then  Region(N) is Union of all Region(N i )’s

24 24 Insertion into TV-trees  Three steps:  1. Branch Selection: When we insert a new vector v at node N,  for each child Nj of N, compute exp(v) = the amount we must expand Nj.Radius so that v’s active distance from Nj.Center falls within this region  select a branch such that exp(v) is minimum  2. Splitting: When a leaf node is full and cannot accommodate the new vector v, we have to split.  Split vectors into 2 groups G1,H1 such that we enclose all vectors in G1 with center c1 and radius r1, and all in H1 with center c1’ and r1’  There exist many such cases: G2,H2 (with (c2,r2), (c2’,r2’)  take the one with minimum sum of radii, i.e., G1,H1 is better if (r1+r1’) < (r2+r2’)

25 25 Insertion into TV-trees  3. Telescoping: The active dimensions associated with a node or the children of the node change (either expand or contract); this is called telescoping. This happens in 2 cases:  When a node splits into two subnodes N1 and N2, vectors in region(N1) all agree on not just the active dimensions of N, but a few more as well  When a new vector is added to a node N, the active dimensions may reduce

26 26 Other Retrieval Techniques: Inverted Indices  A document_record contains 2 fields: doc_id, postings_list  postings_list is a list of terms (or pointers to terms) that occur in the document. Sorted using a suitable relevance measure  A term_record consists of 2 fields: term, postings_list  postings_list is list specifying which documents the term appeared in  Two hash tables are maintained: DocTable, TermTable  DocTable is constructed by hashing on doc_id  TermTable by hashing on term  To find all documents associated with a term, merely return the postings_list

27 27 Other Retrieval Techniques: Signature Files  Associate a signature with each document  signature: is a representation of an ordered list of terms that describe the document  the list of terms in the signature may be derived from a frequency analysis, stemming, usage of stop lists


Download ppt "1 Text Databases  Text Types  Unstructured text  semi-structured text  structured text  Query: User wants to find documents related to a topic T "

Similar presentations


Ads by Google