1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.

Slides:



Advertisements
Similar presentations
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
Advertisements

ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Text mining.
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
LANGUAGE NETWORKS THE SMALL WORLD OF HUMAN LANGUAGE Akilan Velmurugan Computer Networks – CS 790G.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction.
Clustering C.Watters CS6403.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Vector Space Models.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Times Tables 1 x 1=. Times Tables 1 x 2= Times Tables 1 x 3=
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Positive and Negative Integers on a Number Line
Automated Information Retrieval
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
CS 430: Information Discovery
CS 430: Information Discovery
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Review for Test1.
Information Organization: Clustering
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

2 Course Administration Midterm examination Grades will be mailed over the weekend Answer books will not be returned Most questions will be discussed in class Question paper will be posted on the course web site

3 Decisions in creating a thesaurus 1. Which terms should be included in the thesaurus? 2. How should the terms be grouped?

4 Terms to include Only terms that are likely to be of interest for content identification Ambiguous terms should be coded for the senses likely to be important in the document collection Each thesaurus class should have approximately the same frequency of occurrence Terms of negative discrimination should be eliminated after Salton and McGill

5 Discriminant value Discriminant value is the degree to which a term is able to discriminate between the documents of a collection = (average document similarity without term k) - (average document similarity with term k) Good discriminators decrease the average document similarity Note that this definition uses the document similarity.

6 Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D D D D 73447344

7 Document similarity matrix D 1 D 2 D 3 D 4 D D D D Average similarity = 0.55

8 Discriminant value Average similarity = 0.55 withoutaverage similarityDV alpha bravo charlie delta echo foxtrot golf

9 Similarities Automatic thesaurus construction depends on a measure of similarity between terms One measure of similarity is the number of documents that have terms i and k in common: S(t j, t k ) =  t ij t ik where t ij if document i contains term j and 0 otherwise. i=1 n

10 Similarity measures Improved similarity measures can be generated by: Using term frequency matrix instead of incidence matrix Weighting terms by frequency: cosine measure  t ij t ik |t j | |t k | dice measure  t ij t ik  t ik +  t ij i=1 n n n n S(t j, t k ) =

11 Similarities: Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D D D D n

12 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot golf Using incidence matrix and dice weighting

13 Clustering -- nearest neighbor alpha delta 1 golf 2 echo bravo 3 6 charlie 4 5 foxtrot

14 Phrase construction In a thesaurus, term classes may contain phrases. Informal definitions: pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence) phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency cohesion (i, j) = pair-frequency (i, j) frequency(i)*frequency(j)

15 Phrase construction Salton and McGill algorithm 1. Computer pair-frequency for all terms. 2. Reject all pairs that fall below a certain threshold 3. Calculate cohesion values 4. If cohesion above a threshold value, consider word pair as a phrase. Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics

16 Types of Information Discovery media type textimage, video, audio, etc. searchingbrowsing linking statistical user-in-loop catalogs, indexes (metadata) CS 502 natural language processing CS 474