(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.

Slides:



Advertisements
Similar presentations
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
(C) 2000, The University of Michigan 1 Database Application Design Handout #11 March 24, 2000.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Sequence Alignment.
(C) 2000, The University of Michigan 1 Database Application Design Handout #3 January 21, 2000.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Data Clustering 1 – An introduction
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
(C) 2000, The University of Michigan 1 Database Application Design Handout #10 March 17, 2000.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
(C) 2000, The University of Michigan 1 Database Application Design Handout #5 February 4, 2000.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Text Clustering Hongning Wang
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
(C) 2003, The University of Michigan1 Information Retrieval Handout #10 April 7, 2003.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Information Retrieval (4) Prof. Dragomir R. Radev
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Information Organization: Overview
Semantic Processing with Context Analysis
Information Organization: Clustering
Text Categorization Berlin Chen 2003 Reference:
Information Organization: Overview
Presentation transcript:

(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M & Th 12-1 or via Course page: Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan3 Approximate string matching

(C) 2003, The University of Michigan4 Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming

(C) 2003, The University of Michigan5 Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j- 1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)

(C) 2003, The University of Michigan6 Example Gusfield 1997 WRITERS V11 I22 N33 T44 N55 E66 R77

(C) 2003, The University of Michigan7 Example (cont’d) Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

(C) 2003, The University of Michigan8 Tracebacks Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

(C) 2003, The University of Michigan9 Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum – heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html

(C) 2003, The University of Michigan10 Web sites: – – Demo: –/clair4/class/ir-w05/dp –./dp.pl theater theatre

(C) 2003, The University of Michigan11 Clustering

(C) 2003, The University of Michigan12 Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query

(C) 2003, The University of Michigan13 Representations for document clustering Typically: vector-based –Words: “cat”, “dog”, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

(C) 2003, The University of Michigan14 Hierarchical clustering Dendrograms E.g., language similarity:

(C) 2003, The University of Michigan15 Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon)

(C) 2003, The University of Michigan16 Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

(C) 2003, The University of Michigan17 Methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Centroid-based (online) –Look at distances to centroids Demo: –/clair4/class/ir-w05/clustering

(C) 2003, The University of Michigan18 k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka

(C) 2003, The University of Michigan19 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

(C) 2003, The University of Michigan20 Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =

(C) 2003, The University of Michigan21 Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

(C) 2003, The University of Michigan22 Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank

(C) 2003, The University of Michigan23 Demos pletKM.htmlhttp:// pletKM.html rhttp:// r % cd /data2/tools/weka % export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar % java weka.clusterers.SimpleKMeans -t data/weather.arff

(C) 2003, The University of Michigan24 Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).