1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction.

Slides:



Advertisements
Similar presentations
ENV Envisioning Information Lecture 6 – Document Visualization Ken Brodlie
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
The Statistics of Fingerprints A Matching Algorithm to be used in an Investigation into the Reliability of the Use of Fingerprints for Identification Bob.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
High-Performance Digital Library Classification Systems: PI: Hsinchun Chen, The University of Arizona From Information Retrieval to Knowledge Management.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
IEEE Knowledge Media Networking KMN’02 Keynote Address, CRL, Kyoto Japan, July 11, 2002 Concept Switching in the Interspace: Networking Infrastructure.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
CNI Spring Meeting April 26, 1999 Washington, DC THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory Graduate School.
An Internet of Things: People, Processes, and Products in the Spotfire Cloud Library Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS430: Information Discovery Lecture 18 Usability 3.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Media Arts and Technology Graduate Program UC Santa Barbara MAT 259 Visualizing Information Winter 2006George Legrady1 MAT 259 Visualizing Information.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Clustering C.Watters CS6403.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
CODE (Committee on Digital Environment) July 26, 2000 Rice University THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory.
Vector Space Models.
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
Graduate School of Informatics Kyoto University, November 21, 2001 Technologies of the Interspace Peer-Peer Semantic Indexing Bruce Schatz CANIS Laboratory.
Revolution & Kids: Building the Future of the Net & Understanding the Structures of the World Bruce R. Schatz CANIS - Community Systems Laboratory University.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Colour and Texture. Extract 3-D information Using Vision Extract 3-D information for performing certain tasks such as manipulation, navigation, and recognition.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430 / INFO 430 Information Retrieval Lecture 25 Usability 3.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Semantic Interoperability for Geographic Information Systems Tobun Dorbin Ng Artificial Intelligence Lab The University of Arizona.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Automated Information Retrieval
Plan for Today’s Lecture(s)
CS 430: Information Discovery
CS 430: Information Discovery
Search Techniques and Advanced tools for Researchers
Visualizing Document Collections
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval in Digital Libraries: Bringing Search to the Net
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction

2 Course Administration Next week Guest lecture on Thursday, Thorsten Joachims. Final examination The final examination will include questions on all lectures, including the guest lectures, and the readings for the discussion classes. Examination date: Wednesday, December 18, 12:00 noon - 1:30 p.m. Early examination: Thursday December 12, 12:00 noon - 1:30 p.m. Contact Anat Nidar-Levi if you plan to take the early examination.

3 Example 2: Concept Spaces for Scientific Terms Large-scale searches can only match terms specified by the user to terms appearing in documents. Cluster analysis can be used to provide information retrieval by concepts, rather than by terms. Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen (University of Arizona), Federating Diverse Collections of Scientific Literature, IEEE Computer, May Federating Diverse Collections of Scientific Literature

4 Concept Spaces: Methodology Concept space: A similarity matrix based on co-occurrence of terms. Approach: Use cluster analysis to generate "concept spaces" automatically, i.e., clusters of terms that embrace a single semantic concept. Arrange concepts in a hierarchical classification.

5 Concept Spaces: INSPEC Data Data set 1: All terms in 400,000 records from INSPEC, containing 270,000 terms with 4,000,000 links. [24.5 hours of CPU on 16-node Silicon Graphics supercomputer.] computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching

6 Concept Space: Compendex Data Data set 2: (a) 4,000,000 abstracts from the Compendex database covering all of engineering as the collection, partitioned along classification code lines into some 600 community repositories. [ Four days of CPU on 64-processor Convex Exemplar.] (b) In the largest experiment, 10,000,000 abstracts, were divided into sets of 100,000 and the concept space for each set generated separately. The sets were selected by the existing classification scheme.

7 Objectives Semantic retrieval (using concept spaces for term suggestion) Semantic interoperability (vocabulary switching across subject domains) Semantic indexing (concept identification of document content) Information representation (information units for uniform manipulation)

8 Use of Concept Space: Term Suggestion

9 Future Use of Concept Space: Vocabulary Switching "I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature."

10 Example 3: Visual thesaurus for browsing large collections of geographic images Methodology: Divide images into small regions. Create a similarity measure based on properties of these images. Use cluster analysis tools to generate clusters of similar images. Provide alternative representations of clusters. Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of Visual Thesauri for Browsing Large Collections of Geographic Images, May ( Thesaurus.html)

11

12 Information Visualization Human eye is excellent in identifying patterns in graphical data. Trends in time-dependent data. Broad patterns in complex data. Anomalies in scientific data. Visualizing information spaces for browsing.

13 Pad++ Concept. A large collection of information viewed at many different scales. Imagine a collection of documents spread out on an enormous wall. Zoom. Zoom out and see the whole collection with little detail. Zoom in part way to see sections of the collection. Zoom in to see every detail. Semantic Zooming. Objects change appearance when they change size, so as to be most meaningful. (Compare maps.) Performance. Rendering operations timed so that the frame refresh rate remains constant during pans and zooms.

14 Pad++ File Browser

15 Pad++ File Browser

16 Pad++ File Browser

17 Example: Tilebars The figure represents a set of hits from a text search. Each large rectangle represents a document or section of text. Each row represents a search term or subquery. The density of each small square indicates the frequency with which a term appears in a section of a document. Hearst 1995

18 Self Organizing Maps (SOM)

19 Automatic Thesaurus Construction Approach Select a subject domain. Choose a corpus of documents that cover the domain. Create vocabulary by extracting terms, normalization, precoordination of phrase, etc. Devise a measure of similarity between terms and thesaurus classes. Cluster terms into thesaurus classes, using complete linkage or other cluster method that generates compact clusters.

20 Decisions in creating a thesaurus 1. Which terms should be included in the thesaurus? 2. How should the terms be grouped?

21 Terms to include Only terms that are likely to be of interest for content identification Ambiguous terms should be coded for the senses likely to be important in the document collection Each thesaurus class should have approximately the same frequency of occurrence Terms of negative discrimination should be eliminated after Salton and McGill

22 Discriminant value Discriminant value is the degree to which a term is able to discriminate between the documents of a collection = (average document similarity without term k) - (average document similarity with term k) Good discriminators decrease the average document similarity Note that this definition uses the document similarity.

23 Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D D D D 73447344

24 Document similarity matrix D 1 D 2 D 3 D 4 D D D D Average similarity = 0.55

25 Discriminant value Average similarity = 0.55 withoutaverage similarityDV alpha bravo charlie delta echo foxtrot golf alpha, delta, foxtrot, golf are good discriminators

26 Phrase construction In a thesaurus, term classes may contain phrases. Informal definitions: pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence) phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency cohesion (i, j) = pair-frequency (i, j) frequency(i)*frequency(j)

27 Phrase construction Salton and McGill algorithm 1. Computer pair-frequency for all terms. 2. Reject all pairs that fall below a certain threshold 3. Calculate cohesion values 4. If cohesion above a threshold value, consider word pair as a phrase. Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics