1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Data Mining Techniques: Clustering
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Discussion Class 12 Medical Subject Headings (MeSH) and Unified Medical Language System (UML)
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
CS/Info 430: Information Retrieval
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 Discussion Class 12 User Interfaces and Visualization.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
CLUSTER ANALYSIS.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CNI Spring Meeting April 26, 1999 Washington, DC THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory Graduate School.
Information Visualization: Ten Years in Review Xia Lin Drexel University.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction.
Clustering C.Watters CS6403.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
CODE (Committee on Digital Environment) July 26, 2000 Rice University THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory.
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
Graduate School of Informatics Kyoto University, November 21, 2001 Technologies of the Interspace Peer-Peer Semantic Indexing Bruce Schatz CANIS Laboratory.
Revolution & Kids: Building the Future of the Net & Understanding the Structures of the World Bruce R. Schatz CANIS - Community Systems Laboratory University.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Automated Information Retrieval
Unsupervised Learning
Information Organization: Clustering
CS 430: Information Discovery
DATA MINING Introductory and Advanced Topics Part II - Clustering
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval in Digital Libraries: Bringing Search to the Net
CS 430: Information Discovery
CS 430: Information Discovery
Unsupervised Learning
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2

2 Course Administration CS 490 and CS 790 Independent Research Projects Web Research Infrastructure -- Build a system to bring complete crawls of the Web from the Internet Archive to the Cornell Theory Center and make them available for researchers through a standard API. (Continues planning work carried out this semester.) There will not be an independent research project in information retrieval.

3 Course Administration Final Examination The final examination is on Monday, December 13, between 12:00 and 1:30. There appears to be little interest in a make-up examination on another date. If, however, you have real problems with the scheduled date, send an message to Anat Nidar- Levi

4 Cluster Analysis Methods that divide a set of n objects into m non- overlapping subsets. For information discovery, cluster analysis is applied to terms for thesaurus construction documents to divide into categories (sometimes called automatic classification, but classification usually requires a pre-determined set of categories).

5 Cluster Analysis Metrics  Documents clustered on the basis of a similarity measure calculated from the terms that they contain.  Documents clustered on the basis of co-occurring citations.  Terms clustered on the basis of the documents in which they co-occur.

6 Non-hierarchical and Hierarchical Methods Non-hierarchical methods Elements are divided into m non-overlapping sets where m is predetermined. Hierarchical methods m is varied progressively to create a hierarchy of solutions. Agglomerative methods m is initially equal to n, the total number of elements, where every element is considered to be a cluster with one element. The hierarchy is produced by incrementally combining clusters.

7 Simple methods: Single Link x x x xx x x x x x x x Similarity between clusters is similarity between most similar elements Concept

8 Simple methods: Single Link Single Link A simple agglomerative method. Initially, each element is its own cluster with one element. At each step, calculate the similarity between each pair of clusters as the most similar pair of elements that are not yet in the same cluster. Merge the two clusters that are most similar. May lead to long, straggling clusters (chaining). Very simple computation.

9 Similarities: Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D D D D n

10 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot 0.33 golf Using incidence matrix and dice weighting

11 Example -- single link alpha delta golf bravo echo charlie foxtrot 1 Agglomerative: step 1

12 Example -- single link alpha delta golf bravo echo charlie foxtrot 1 2 Agglomerative: step 2

13 Example -- single link alpha delta golf bravo echo charlie foxtrot Agglomerative: step 3

14 Example -- single link alpha delta golf bravo echo charlie foxtrot This style of diagram is called a dendrogram.

15 Simple methods: Complete Linkage x x x xx x x x x x x x Similarity between clusters is similarity between least similar elements Concept

16 Simple methods: complete linkage Complete linkage A simple agglomerative method. Initially, each element is its own cluster with one element. At each step, calculate the similarity between each pair of clusters as the similarity between the least similar pair of elements in the two clusters. Merge the two clusters that are most similar. Generates small, tightly bound clusters

17 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot 0.33 golf Using incidence matrix and dice weighting

18 Example – complete linkage Cluster abcdefg elements Least similar pair / distance a-ab/.2ac/.2ad/.5ae/.2af/.33ag/.5 b-bc/.5bd/.2be/.5bf/.4bg/.2 c-cd/.2ce/.5cf/.4cg/.2 d-de/.2df/.33dg/.5 e-ef/.4eg/.2 f-fg/.33 g- Step 1. Merge clusters {a} and {d}

19 Example – complete linkage Clustera,dbcefg elements Least similar pair / distance a,d-ab/.2ac/.2ae/.2df/.33ag/.5 b-bc/.5be/.5bf/.4bg/.2 c-ce/.5cf/.4cg/.2 e-ef/.4eg/.2 f-fg/.33 g- Step 2. Merge clusters {a,d} and {g}

20 Example – complete linkage Clustera,d,gbcef elements Least similar pair / distance a,d,g-ab/.2ac/.2ae/.2af/.33 b-bc/.5be/.5bf/.4 c-ce/.5cf/.4 e-ef/.4 f- Step 3. Merge clusters {b} and {c}

21 Example – complete linkage Clustera,d,gb,cef elements Least similar pair / distance a,d,g-ab/.2ae/.2af/.33 b,c-be/.5bf/.4 e-ef/.4 f- Step 4. Merge clusters {b,c} and {e}

22 Example -- complete linkage alpha delta golf bravo charlie echo foxtrot Step 1 Step 6 Step 5 Step 2 Step 4 Step 3

23 Problems with cluster analysis  Selection of attributes on which items are clustered  Choice of similarity measure and algorithm  Computational resources  Assessing validity and stability of clusters  Updating clusters as data changes  Method for searching the clusters

24 Example 1: Concept Spaces for Scientific Terms Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms. Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May Federating Diverse Collections of Scientific Literature

25 Concept Spaces: Methodology Concept space: A similarity matrix based on co-occurrence of terms. Approach: Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept. Arrange concepts in a hierarchical classification.

26 Concept Spaces: INSPEC Data Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links. [24.5 hours of CPU on 16-node Silicon Graphics supercomputer.] computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching

27 Concept Space: Compendex Data Data set 2: (a) 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories. [ Four days of CPU on 64-processor Convex Exemplar.] (b) In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by the existing classification scheme.

28 Objectives Semantic retrieval (using concept spaces for term suggestion) Semantic interoperability (vocabulary switching across subject domains) Semantic indexing (concept identification of document content) Information representation (information units for uniform manipulation)

29 Use of Concept Space: Term Suggestion

30 Future Use of Concept Space: Vocabulary Switching "I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

31 Example 2: Visual thesaurus for geographic images Methodology: Divide images into small regions. Create a similarity measure based on properties of these images. Use cluster analysis tools to generate clusters of similar images. Provide alternative representations of clusters. Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May Thesaurus.html

32

33 Example 3: Cluster Analysis of Social Science Journal In the social sciences, subject boundaries are unclear. Can citation patterns be used to develop criteria for matching information services to the interests of users? W. Y. Arms and C. R. Arms, Cluster analysis used on social science citations, Journal of Documentation, 34 (1) pp 1-11, March 1978.

34 Methodology Assumption: Two journals are close to each other if they are cited by the same source journals, with similar relative frequencies. Sources of citations: Select a sample of n social science journals. Citation matrix: Construct an m x n matrix in which the ijth element is the number of citations to journal i from journal j. Normalization: All data was normalized so that the sum of the elements in each row is 1.

35 Data Pilot study: 5,000 citations from the 1970 volumes of 17 major journals from across the social sciences. Criminology citations: Every fifth citation from a set of criminology journals (3 sets of data for 1950, 1960, 1970). Main file (52,000 citations): (a)Every citation from the 1970 volumes of the 48 most cited source journals in the pilot study. (b)Every citation from the 1970 volumes of 47 randomly selected journals.

36 Sample sizes SampleSource journalsTarget journals Pilot17115 Criminology: Main file: ranked48495 random47254 Excludes journals that are cited by only one source. These were assumed to cluster with the source.

37 Algorithms Main analysis used a non-hierarchical method of E. M. L. Beale and M. G. Kendal based on Euclidean distance. For comparison, 36 psychology journals clustered using: single-linkage complete-linkage van Rijsbergen's algorithm Beale/Kendal algorithm and complete-linkage produced similar results. Single-linkage suffered from chaining. Van Rijsbergen algorithm seeks very clear-cut clusters, which were not found in the data.

38 Non-hierarchical clusters Economics clusters in the pilot study

39 Non-hierarchical dendrogram Part of a dendrogram showing non-hierarchical structure

40 Conclusion "The overall conclusion must be that cluster analysis is not a practical method of designing secondary services in the social sciences." Because of skewed distributions very large amounts of data are required. Results are complex and difficult to interpret. Overlap between social sciences leads to results that are sensitive to the precise data and algorithms chosen.

41 The End Search index Return hits Browse content Return objects