Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFO624 -- Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Similar presentations


Presentation on theme: "INFO624 -- Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."— Presentation transcript:

1 INFO624 -- Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University

2 Effective Information Retrieval Data Structures Data Structures Knowledge Representation Knowledge Representation User Interface and User Interaction User Interface and User Interaction

3 Data Structures Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes the nature of relationships among information elements Describes the nature of relationships among information elements

4 Data Models Logical data model Logical data model  how the user view the data  how to represent or catch semantic (logical) relationships of data  how to present the relationships of data to the user  independent of physical implementation and systems Physical data model Physical data model  How data are actually stored in computer  Technical techniques for improving efficiency of data storage and access.

5 Logical Data Models Linear sequential model Linear sequential model  Arrange records by an order defined.  Advantage: fast access  Disadvantage: need to move data around when sorting Linkage sequential model Linkage sequential model  Data arranged in the order they are inserted.  Each element has a link pointer to the next element  Advantage: Don't need to move data around  Disadvantage: additional space for links

6 Hierarchical (tree) model Hierarchical (tree) model  Have an unique root  Each element (except root) has one and only one parent  Advantage: relationships are precisely defined  Disadvantage: describe only one type of relationships;

7 Poly-hierarchical model Poly-hierarchical model  Allows to have more than one parent for each element in the tree structure  Computationally complex  Advantage: represent more complex relationships  Disadvantage: possible infinite loops

8 Network model Network model  Hypertext model  Emphasize on links and nodes  Less formalism  A node can have any number of links  A node can be freely defined (don't have to be the same type)  Advantage: flexible  Disadvantage: lack of controls; lack of theories;

9 Space model Space model  Basics of the physical space:  Dimensions, axes, coordinates  Geography, physics, and rules of law.  Semantic space  Giving meaning to place  Searching for features of the space Vector space model Vector space model  Each indexing term is an axe  Each document is a vector

10 Physical Data Structure How data are actually stored in computer How data are actually stored in computer Technical implementation of data storage Technical implementation of data storage Record structures Record structures  Fixed-length  Variable-length  Tradeoff between speed and space

11 Examples: Fixed-length record: Fixed-length record: SMITH,JON 1287 MAPLE AVE,AKRON OH, 44444 8005551234890315SMIT Name-- 1, address -- 21, telephone -- 61, date -- 91 Next Record-- 97

12 Variable-length record: Variable-length record: SMITH,JON|18287 MAPLE AVE,AKRON OH, 44444|8 005551234|900315#SMIT 1 -- name, 2 -- address, 3 -- telephone 4 -- date

13 00075|021|030|63|73* SMITH,JON1287 MAPLE AVE,AKRON OH, 444448 0055512348900315#000 Variable-length record: (Fixed header) Variable-length record: (Fixed header)

14 File structures The focus of data structures in IR is file structures The focus of data structures in IR is file structures  A collection of documents is called a file.  Each document is called a record. The key to file structures is different search techniques or models for the files and indexes of files. The key to file structures is different search techniques or models for the files and indexes of files.

15 Index structures A main file and several indexing files A main file and several indexing files Main file is sequential without sorting Main file is sequential without sorting Indexing files are sorted and pointed to the main file Indexing files are sorted and pointed to the main file  Inverted files  How large is the inverted indexing files?

16 Sizes of Inverted Indexing Files Index Small Collection (1 MB) Medium Collection (200Mb) Large Collection (2GB) Addressing words 45%73%36%64%35%63% Addressing documents 19%26%18%32%26%47% Addressing 64 blocks 27%41%18%32%5%9% Addressing 256 blocks 18%25%1.7%2.4%0.5%0.7% First column, without stop words Second column, with stop words

17 Searching To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. Data structures need to be created to speed up the search: Data structures need to be created to speed up the search:  Trees  Hashing tables  Signature files

18 Trees Binary tree Binary tree  Each node contains a key  Left sub-trees stored all keys smaller than the parent key  Right sub-trees stored all keys larger then the parent key Balanced trees Balanced trees  every parent has a balance left-and right sub- trees

19 B-tree B-tree  Each node can have more than one key  If a node has m keys, it will have m+1 children branches.  All keys in i-1 branch is smaller than key I  All leaves are at the same depth. B+ Trees B+ Trees  B-tree that stored all data in the leaves. Example: Example:  a B-tree of 10,000,000 keys with 50 keys per node  never needs to retrieve more than 4 nodes to find any key.

20 Procedures for Constructing Balanced Trees 1. Check if the original tree is balanced b Check if the left child is balanced b If it is not balanced, go to step two b Check if the right child is balanced b If it not balanced, go to step two

21 2. Rotate the unbalanced tree: 2. Rotate the unbalanced tree:  If the left branch is deeper  Move the left child of the root to become the new root, move the right branch of new root to become the left branch of the old root  Make the old root to become the right child of new root  If the right branch is deeper  … … 3. Go back to step 1 to check if the new tree is balanced or not 3. Go back to step 1 to check if the new tree is balanced or not

22 B+ tree: (F, M) (Ap, Bs, E)(Gr, H, L) (P, Ru, T) 123456789101112

23 Direct-Access Structures Hashing Hashing  Evenly distribute a long list to a short list using a hashing function  Remainder of a primary number is a common hashing function  Example:  Hashing function: H(k)=K mod 7  Put following numbers into the hashing table: 5, 22, 25, 89, 50, 71, 995, 22, 25, 89, 50, 71, 99

24

25 Signature Files Word Word Signature data 0000 0000 0000 0010 0000 base 0000 0001 0000 0000 0000 management 0000 1000 0000 0000 0000 system 0000 0000 0000 0000 1000 Block signature 0000 1001 0000 0010 1000

26 Which of the following blocks contain the term “database”? Which of the following blocks contain the term “database”?  0000 1000 0000 0010 1000  0000 1010 0000 0010 1000  0000 1011 0000 0010 1001  0000 1011 0000 0000 1000

27 Document Similarity Documents Documents  D 1 ={t 11, t 12, t 13, …, t 1n }  D 2 ={t 21, t 22, t 23, …, t 2n } t ik is either 0 or 1. Simple measurement of difference/ similarity: Simple measurement of difference/ similarity:  w=the number of times t 1k =1, t 2k =1.  x=the number of times t 1k =1, t 2k =0.  y=the number of times t 1k =0, t 2k =1.  z=the number of times t 1k =0, t 2k =0.

28 Similarity Measure Cosine Coefficient: Cosine Coefficient: The same as: The same as:

29 D1’s terms only: n1=w+x (the number of times t 1k =1) D1’s terms only: n1=w+x (the number of times t 1k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) Sameness count: sc =(w+z)/(n1+n2) Sameness count: sc =(w+z)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Rectangular Distance: rd = MAX(n1, n2) Rectangular Distance: rd = MAX(n1, n2) Conditional probability: cp=min(n1, n2) Conditional probability: cp=min(n1, n2) mean:mean =(n1+n2)/2 mean:mean =(n1+n2)/2

30 Similarity Measure Dice’s Coefficient: Dice’s Coefficient:  Dice(D1, D2)= 2w/(n1+n2)  where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. Jaccard Coefficient: Jaccard Coefficient:  Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w) = w/(n1+n2-w)

31 Similarity Metric A metric has three defining properties A metric has three defining properties  It’s value are non-negative  It’s symmetric  It satisfies the triangle inequality: |AC|  |AB|+|BC|

32 L p Metrics

33 Similarity Matrix Pairwise coupling of similarities among a group of documents Pairwise coupling of similarities among a group of documents S 11 S 12 S 13 S 14 S 15 S 16 S 17 S 18 S 21 S 22 S 23 S 24 S 25 S 26 S 27 S 28 S 31 S 32 S 33 S 34 S 35 S 36 S 37 S 38 S 41 S 42 S 43 S 44 S 45 S 46 S 47 S 48 S 51 S 52 S 53 S 54 S 55 S 56 S 57 S 58 S 61 S 62 S 63 S 64 S 65 S 66 S 67 S 68 S 71 S 72 S 73 S 74 S 75 S 76 S 77 S 78 S 81 S 82 S 83 S 84 S 85 S 86 S 87 S 88

34 Document clustering Grouping similar documents to different sets Grouping similar documents to different sets  Create similarity matrix  Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1

35 Application of Document Clustering Vivisimo Vivisimo Vivisimo  Cluster search results on the fly  Hierarchical categories for drill-down capability AltaVista AltaVista  Refine search:  Cluster related words into different groups based on their co-occurrence rates in documents.

36 AltaVista

37 ViVisimo Cluster Search Engine

38 Clusty.com

39 Concept Clusters Use terms’ co-occurring frequencies Use terms’ co-occurring frequencies  to predict semantic relationships  to build concept clusters  to suggest search terms Visualization of term relationships Visualization of term relationships  Link displays  Map displays  Drag-and drop interface for searching


Download ppt "INFO624 -- Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."

Similar presentations


Ads by Google