本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17

Slides:

Advertisements

Similar presentations

Lecture 15(Ch16): Clustering

Advertisements

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !

Divide-and-Conquer. 什麼是 divide-and-conquer ？ Divide 就是把問題分割 Conquer 則是把答案結合起來.

Chapter 10 馬可夫鏈緒言如果讀者仔細觀察日常生活中所發生的諸多事件，必然會發現有些事件的未來發展或演變與該事件現階段的狀況全然無關，這種事件稱為獨立試行過程 (process of independent trials) ；而另一些事件則會受到該事件現階段的狀況影響。

布林代數的應用--- 全及項(最小項)和全或項(最大項)展開式

第七章抽樣與抽樣分配蒐集統計資料最常見的方式是抽查。這牽涉到兩個問題：抽出的樣本是否具有代表性?是否能反應出母體的特徵?

: A-Sequence 星級 : ★★☆☆☆ 題組： Online-judge.uva.es PROBLEM SET Volume CIX 題號： Problem D : A-Sequence 解題者：薛祖淵解題日期： 2006 年 2 月 21 日題意：一開始先輸入一個.

Ch05 點估計與抽樣分配授課老師薛欣達. 學習目標估計母體參數的樣本統計量應用中央極限定理根據估計式的需求性質判斷估計式的好壞應用自由度的概念利用樣板計算抽樣分配與相關的結果.

:Word Morphing ★★☆☆☆ 題組： Problem Set Archive with Online Judge 題號： 10508:word morphing 解題者：楊家豪解題日期： 2006 年 5 月 21 日題意：第一行給你兩個正整數, 第一個代表下面會出現幾個字串,

Section 1.2 Describing Distributions with Numbers 用數字描述分配.

亂數產生器安全性評估之統計測試 SEC HW7 姓名：翁玉芬學號：

Review of Chapter 3 - 已學過的 rules( 回顧 )- 朝陽科技大學資訊管理系李麗華教授.

STAT0_sampling Random Sampling  母體： Finite population & Infinity population  由一大小為 N 的有限母體中抽出一樣本數為 n 的樣本，若每一樣本被抽出的機率是一樣的，這樣本稱為隨機樣本 (random sample)

1 政大公企中心產業人才投資課程 -- 企業決策分析方法 -- 黃智聰政大公企中心產業人才投資課程課程名稱：企業決策分析方法授課老師：黃智聰授課內容：利用分公司之追蹤資料進行企業決策分析參考書目： Hill, C. R., W. E. Griffiths, and G. G. Judge,

JAVA 程式設計與資料結構第十四章 Linked List. Introduction Linked List 的結構就是將物件排成一列，有點像是 Array ，但是我們卻無法直接經由 index 得到其中的物件在 Linked List 中，每一個點我們稱之為 node ，第一個 node.

8.1 何謂高度平衡二元搜尋樹 8.2 高度平衡二元搜尋樹的加入 8.3 高度平衡二元搜尋樹的刪除

1 政治大學東亞所選修 -- 計量分析與中國大陸研究黃智聰政治大學東亞所選修課程名稱：計量分析與中國大陸研究（量化分析）授課老師：黃智聰授課內容：時間序列與橫斷面資料的共用參考書目： Hill, C. R., W. E. Griffiths, and G. G. Judge, (2001),

Monte Carlo Simulation Part.2 Metropolis Algorithm Dept. Phys. Tunghai Univ. Numerical Methods C. T. Shih.

1 Part IC. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple Regression ( 多元迴歸、複迴歸 ) Spring 2007.

2009fallStat_samplec.i.1 Chap10 Sampling distribution (review) 樣本必須是隨機樣本 (random sample) ，才能代表母體 Sample mean 是一隨機變數，隨著每一次抽出來的樣本值不同，它的值也不同，但會有規律性為了要知道估計的精確性，必需要知道樣本平均數.

Introduction to Java Programming Lecture 17 Abstract Classes & Interfaces.

:Problem D: Bit-wise Sequence ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10232: Problem D: Bit-wise Sequence 解題者：李濟宇解題日期： 2006 年 4 月 16.

: The largest Clique ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11324: The largest Clique 解題者：李重儀解題日期： 2008 年 11 月 24 日題意：簡單來說，給你一個 directed.

第三部分：研究設計 ( 二）：研究工具的信效度與研究效度（第九章之第 306 頁 -308 頁；第四章）

計算機概論 - 排序 1 排序 (Sorting) 李明山編撰 ※手動換頁.

3-3 使用幾何繪圖工具 Flash 的幾何繪圖工具包括線段工具 (Line Tool) 、橢圓形工具 (Oval Tool) 、多邊星形工具 (Rectangle Tool) 3 種。這些工具畫出來的幾何圖形包括了筆畫線條和填色區域, 將它們適當地組合加上有技巧地變形與配色, 不但比鉛筆工具簡單,

Matlab Assignment Due Assignment 兩個 matlab 程式 : Eigenface ： Eigenvector 和 eigenvalue 的應用. Fractal ： Affine transform( rotation, translation,

: Happy Number ★ ? 題組： Problem Set Archive with Online Judge 題號： 10591: Happy Number 解題者：陳瀅文解題日期： 2006 年 6 月 6 日題意：判斷一個正整數 N 是否為 Happy Number.

選舉制度、政府結構與政黨體系 Cox (1997) Electoral institutions, cleavage strucuters, and the number of parties.

: Multisets and Sequences ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 11023: Multisets and Sequences 解題者：葉貫中解題日期： 2007 年 4 月 24 日題意：在這個題目中，我們要定義.

:Nuts for nuts..Nuts for nuts.. ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 10944:Nuts for nuts.. 解題者：楊家豪解題日期： 2006 年 2 月題意：給定兩個正整數 x,y.

資料結構實習-一參數傳遞.

6-2 認識元件庫與內建元件庫 Flash 的元件庫分兩種, 一種是每個動畫專屬的元件庫 (Library) ；另一種則是內建元件庫 (Common Libraries), 兩者皆可透過『視窗』功能表來開啟, 以下即為您說明。

政治大學公企中心必修課-- 社會科學研究方法（量化分析）--黃智聰

Section 4.2 Probability Models 機率模式. 由實驗看機率實驗前先列出所有可能的實驗結果。 – 擲銅板：正面或反面。 – 擲骰子： 1~6 點。 – 擲骰子兩顆： (1,1),(1,2),(1,3),… 等 36 種。決定每一個可能的實驗結果發生機率。 – 實驗後所有的實驗結果整理得到。

Analyzing Case Study Evidence

JAVA 程式設計與資料結構第二十章 Searching. Sequential Searching Sequential Searching 是最簡單的一種搜尋法，此演算法可應用在 Array 或是 Linked List 此等資料結構。 Sequential Searching 的 worst-case.

資料結構實習-二.

演算法 8-1 最大數及最小數找法 8-2 排序 8-3 二元搜尋法.

Chapter 3 Entropy : An Additional Balance Equation

-Antidifferentiation- Chapter 6 朝陽科技大學資訊管理系李麗華教授.

845: Gas Station Numbers ★★★ 題組： Problem Set Archive with Online Judge 題號： 845: Gas Station Numbers. 解題者：張維珊解題日期： 2006 年 2 月題意：將輸入的數字，經過重新排列組合或旋轉數字，得到比原先的數字大，

廣電新聞播報品質電腦化評估系統之研發國立政治大學資訊科學系指導教授：廖文宏學生：蘇以暄.

Learning Method in Multilingual Speech Recognition Author : Hui Lin, Li Deng, Jasha Droppo Professor: 陳嘉平 Reporter: 許峰閤.

Chapter 10 m-way 搜尋樹與B-Tree

演算法課程 (Algorithms) 國立聯合大學資訊管理學系陳士杰老師 Course 7 貪婪法則 Greedy Approach.

Probability Distribution 機率分配汪群超 12/12. 目的：產生具均等分配的數值 (Data) ，並以『直方圖』的功能計算出數值在不同範圍內出現的頻率，及繪製數值的分配圖，以反應出該機率分配的特性。

1/17 A Study on Separation between Acoustic Models and Its Application Author : Yu Tsao, Jinyu Li, Chin-Hui Lee Professor : 陳嘉平 Reporter : 許峰閤.

Chapter 7 Sampling Distribution

Cluster Analysis 目的 – 將資料分成幾個相異性最大的群組基本問題 – 如何衡量事務之間的相似性 – 如何將相似的資料歸入同一群組 – 如何解釋群組的特性.

McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. 壹企業研究導論.

Teacher : Ing-Jer Huang TA : Chien-Hung Chen 2015/6/30 Course Embedded Systems : Principles and Implementations Weekly Preview Question CH7.1~CH /12/26.

: Wine trading in Gergovia ★★☆☆☆ 題組： Contest Volumes with Online Judge 題號： 11054: Wine trading in Gergovia 解題者：劉洙愷解題日期： 2008 年 2 月 29 日題意：在 Gergovia.

: SAM I AM ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11419: SAM I AM 解題者：李重儀解題日期： 2008 年 9 月 11 日題意：簡單的說，就是一個長方形的廟裡面有敵人，然後可以橫的方向開砲或縱向開砲，每次開砲可以.

: Finding Paths in Grid ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11486: Finding Paths in Grid 解題者：李重儀解題日期： 2008 年 10 月 14 日題意：給一個 7 個 column.

著作權所有 © 旗標出版股份有限公司第 14 章製作信封、標籤. 本章提要製作單一信封製作單一郵寄標籤.

幼兒行為觀察與記錄第八章事件取樣法.

McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. 肆資料分析與表達.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.

Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.

Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.

Sampath Jayarathna Cal Poly Pomona

本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17

Presentation transcript:

本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17 Lecture 6 : Clustering 楊立偉教授台灣科大資管系 wyang@ntu.edu.tw 本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17

Clustering : Introduction

Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 3

Data set with clear cluster structure Propose algorithm for finding the cluster structure in this example 4

Classification vs. Clustering Supervised learning Classes are human-defined and part of the input to the learning algorithm. Clustering Unsupervised learning Clusters are inferred from the data without human input. 5

Why cluster documents? Whole corpus analysis/navigation Better user interface 提供文件集合的分析與導覽 For improving recall in search applications Better search results 提供更好的搜尋結果 For better navigation of search results Effective “user recall” will be higher 搜尋結果導覽 For speeding up vector space retrieval Faster search 加快搜尋速度

Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism Final data set: Yahoo Science hierarchy, consisting of text of web pages pointed to by Yahoo, gathered summer of 1997. 264 classes, only sample shown here. agronomy HCI missions forestry evolution relativity

For visualizing a document collection Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]

For improving search recall Cluster hypothesis - “closely associated documents tend to be relevant to the same requests”. Therefore, to improve search recall: Cluster docs in corpus 先將文件做分群 When a query matches a doc D, also return other docs in the cluster containing D 也建議符合的整群 Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 具有類似的文件特徵

For better navigation of search results For grouping search results thematically clusty.com / Vivisimo (Enterprise Search – Velocity)

Issues for clustering (1) General goal: put related docs in the same cluster, put unrelated docs in different clusters. Representation for clustering Document representation 如何表示一篇文件 Need a notion of similarity/distance 如何表示相似度 whittle 削切

Issues for clustering (2) How to decide the number of clusters Fixed a priori : assume the number of clusters K is given. Data driven : semiautomatic methods for determining K Avoid very small and very large clusters Define clusters that are easy to explain to the user whittle 削切

Clustering Algorithms Flat (Partitional) algorithms 無階層的聚類演算法 Usually start with a random (partial) partitioning Refine it iteratively 不斷地修正調整 K means clustering Hierarchical algorithms 有階層的聚類演算法 Create a hierarchy Bottom-up, agglomerative 由下往上聚合 Top-down, divisive 由上往下分裂

Flat (Partitioning) Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters 將 n 篇文件分到 K 群中 Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions 找出最佳切割 → 通常很耗時 Effective heuristic methods: K-means and K-medoids algorithms 用經驗法則找出近似解即可

Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. For applications like creating browsable hierarchies Ex. Put sneakers in two clusters: sports apparel, shoes You can only do that with a soft clustering approach. We will do hard clustering only in this class. See IIR 16.5, IIR 17, IIR 18 for soft clustering 15

K-means algorithm

K-means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 17

K-means In vector space model, Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity 重心 or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

K-means algorithm 1. Select K random docs {s1, s2,… sK} as seeds. 先挑選種子 2. Until clustering converges or other stopping criterion: 重複下列步驟直到收斂或其它停止條件成立 2.1 For each doc di: 針對每一篇文件 Assign di to the cluster cj such that dist(xi, sj) is minimal. 將該文件加入最近的一群 2.2 For each cluster cj sj = (cj) 以各群的重心為種子，再做一次 (Update the seeds to the centroid of each cluster)

K-means algorithm

K-means example (K=2) Pick seeds Reassign clusters Compute centroids Converged! 通常做3至4回就大致穩定（但仍需視資料與群集多寡而調整）

Termination conditions Several possibilities, e.g., A fixed number of iterations. 只做固定幾回合 Doc partition unchanged. 群集不再改變 Centroid positions don’t change. 重心不再改變

Convergence of K-Means Why should the K-means algorithm ever reach a fixed point? A state in which clusters don’t change. 收斂 K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. 在理論上一定會收斂，只是要做幾回合的問題

Convergence of K-Means : 證明 Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck)2 (sum over all di in cluster k) G = Σk Gk 計算每一群中文件與中心的距離平方，然後加總 Reassignment monotonically decreases G since each vector is assigned to the closest centroid. 每回合的動作只會讓G越來越小

Time Complexity Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm). 執行 I 回合；分 K 群；n 篇文件；m 個詞 → 慢且不scalable 改善方法：用近似估計, 抽樣, 選擇等技巧來加速

Issue (1) Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

Issue (2) How Many Clusters? Number of clusters K is given Partition n docs into predetermined number of clusters Finding the “right” number of clusters is part of the problem 假設連應該分成幾群都不知道 Given docs, partition into an “appropriate” number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. 查詢結果分群時通常不會預先知道該分幾群

If K not specified in advance Suggest K automatically using heuristics based on N using K vs. Cluster-size diagram Tradeoff between having less clusters (better focus within each cluster) and having too many clusters 如何取捨

K-means variations Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means 每個點調整後就重算重心，可以加快收斂 spherical 球面的

Evaluation of Clustering

What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high 群內同質性越高越好 the inter-class similarity is low 群間差異大 The measured quality of a clustering depends on both the document representation and the similarity measure used

External criteria for clustering quality Based on a gold standard data set (ground truth) e.g., the Reuters collection we also used for the evaluation of classification Goal: Clustering should reproduce the classes in the gold standard Quality measured by its ability to discover some or all of the hidden patterns 用挑出中間不符合的份子來評估分群好不好

External criterion: Purity Ω= {ω1, ω2, . . . , ωK} is the set of clusters and C = {c1, c2, . . . , cJ} is the set of classes. For each cluster ωk : find class cj with most members nkj in ωk Sum all nkj and divide by total number of points purity是群中最多一類佔該群總數之比例 33

Example for computing purity To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1) 4 = maxj |ω2 ∩ cj | (class o, cluster 2) 3 = maxj |ω3 ∩ cj | (class ⋄, cluster 3) Purity is (1/17) × (5 + 4 + 3) ≈ 0.71. 34

Rand Index A B C D Number of points Same Cluster in clustering 分在同一群 Different Clusters in clustering 分在不同群 Same class in ground truth 已知同一類 A C Different classes in ground truth 已知不同類 B D

Rand index: symmetric version Compare with standard Precision and Recall.

20 24 72 Rand Index example: 0.68 Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth 20 24 Different classes in ground truth 72

Hierarchical Clustering

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. One approach: recursive application of a partitional clustering algorithm. 可由每一層不斷執行分群演算法所組成 animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate Vertebrate 脊索動物

Dendrogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 另一種思考：對階層樹橫向切一刀, 留下有連接在一起的, 就構成一群 Dendrogram 樹狀圖

Hierarchical Clustering algorithms Agglomerative (bottom-up): 由下往上聚合 Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): 由上往下分裂 Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. Does not require the number of clusters k in advance 不需要先決定要分成幾群 Needs a termination condition

Hierarchical Agglomerative Clustering (HAC) Algorithm Starts with each doc in a separate cluster 每篇文件剛開始都自成一群 then repeatedly joins the closest pair of clusters, until there is only one cluster. 不斷地將最近的二群做連接 The history of merging forms a binary tree or hierarchy. 連接的過程就構成一個二元階層樹

Dendrogram: Document Example As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3

Closest pair of clusters 如何計算最近的二群 Many variants to defining closest pair of clusters Single-link 挑群中最近的一點來代表 Similarity of the most cosine-similar (single-link) Complete-link 挑群中最遠的一點來代表 Similarity of the “furthest” points, the least cosine-similar Centroid 挑群中的重心來代表 Clusters whose centroids (centers of gravity) are the most cosine-similar Average-link 跟群中的所有點計算距離後取平均值 Average cosine between pairs of elements

Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. 長而鬆散的群集 After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Single Link Example

Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. 緊密一點的群集 After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Ci Cj Ck

Complete Link Example

Computational Complexity In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). 兩兩文件計算相似性 In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. 包含合併過程 O(n2 log n)

Key notion: cluster representative We want a notion of a representative point in a cluster 如何代表該群→可以用中心或其它點代表 Representative should be some sort of “typical” or central point in the cluster, e.g.,

Example: n=6, k=3, closest pair of centroids Centroid after second step. d1 d2 Centroid after first step.

Outliers in centroid computation Can ignore outliers when computing centroid. What is an outlier? Lots of statistical definitions, e.g. moment of point to centroid > M  some cluster moment. Say 10. Centroid Outlier 簡單說就是距離太遠的點 (ex. 10倍遠)，直接忽略

Using Medoid As Cluster Representative The centroid does not have to be a document. Medoid: A cluster representative that is one of the documents 用以代表該群的某一份文件 Ex. the document closest to the centroid Why use Medoid ? Consider the representative of a large cluster (>1000 documents) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector

Clustering : discussion

Feature selection 選擇好的詞再來做分群 Which terms to use as axes for vector space? IDF is a form of feature selection the most discriminating terms 鑑別力好的詞 Ex. use only nouns/noun phrases

Labeling 在分好的群上加標記 After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster 加上簡潔厄要的標記 In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. 事後以人工編輯

How to Label Clusters Show titles of typical documents 用幾份代表文件的標題做標記 Show words/phrases prominent in cluster 用幾個較具代表性的詞做標記 More likely to fully represent cluster Use distinguishing words/phrases 配合自動產生關鍵詞的技術

Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. 通常用5~10個詞來代表該群 Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. 要挑選有鑑別力的詞