Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Slides:

Advertisements

Similar presentations

Heuristic Search techniques

Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Conceptual Clustering

PARTITIONAL CLUSTERING

Imbalanced data David Kauchak CS 451 – Fall 2013.

Topic 12 – Further Topics in ANOVA

Chapter 4: Trees Part II - AVL Tree

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Decision Errors and Power

SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.

Thursday, September 12, 2013 Effect Size, Power, and Exam Review.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

Evaluating Search Engine

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions An Algorithm for Temporal Analysis of Social Positions Scott Christley, Greg Madey Dept.

Mutual Information Mathematical Biology Seminar

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

CIS101 Introduction to Computing Week 11. Agenda Your questions Copy and Paste Assignment Practice Test JavaScript: Functions and Selection Lesson 06,

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.

What is Cluster Analysis?

Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

CIS101 Introduction to Computing Week 12 Spring 2004.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Clustering Unsupervised learning Generating “classes”

Estimation and Hypothesis Testing Now the real fun begins.

1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Text Clustering.

UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

AVL Trees Opening Discussion zWhat did we talk about last class? zDo you have any questions about the assignments? zYour minute essay last.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

Bitwise Sort By Matt Hannon. What is Bitwise Sort It is an algorithm that works with the individual bits of each entry in order to place them in groups.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Week #2 CT Survey. Survey Not scientific: – Sample size – Arrangement For reference only. Draw your own conclusions.

User Modeling and Recommender Systems: recommendation algorithms

New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.

Meeting 8: Features for Object Classification Ullman et al.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Data Mining and Text Mining. The Standard Data Mining process.

Chapter 21 More About Tests.

Data Mining K-means Algorithm

Parallel Density-based Hybrid Clustering

Central Limit Theorem, z-tests, & t-tests

Information Organization: Clustering

Text Categorization Berlin Chen 2003 Reference:

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Presentation transcript:

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst

Task this year 4 times the size of TDT4 (407,503 stories in three languages) Many clustering algorithms not feasible (all algorithms with complexity Ω(n 2 ) will take too long) Time limited - one month Pilot study this year We need a simple algorithm that can be finished in a short time

HTD system of UMass Two step clustering Step 1 – K-NN Step 2 – agglomerative clustering Similarity>threshold? √×

Step 1 – event threading Why event threading? Event: something that happens at a specific time and location An event contains multiple stories Each topic is composed of one or more related events Events have temporal locality What do we do Each story is compared to limited previous stories For simplicity, events do not overlap (false assumption)

Step 2 – agglomerative clustering Agglomerative clustering has complexity of Ω(n 2 ) Modification required Online clustering algorithm Limited window size Merge until 1/3 left First half clusters removed and new events come in Clusters not overlapping Assumption: stories in the same source are more likely to be in the same topic Clusters in the same source are merged first Then the same language Finally all languages

Official runs We submitted 3 runs for each condition UMASSv1 (UMass3): baseline run Tf-idf term weight Cosine similarity Threshold=0.3 Window size=120 UMASSv12 (UMass2): smaller clusters have higher priority in agglomerative clustering UMASSv19 (UMass1): similar to UMASSv12 Double window size

Evaluation results site score Condition TNOICTUMassCUHK eng,nat ( ) TNO ( ) ICT3d ( ) UMass ( ) CUHK1 mul,eng ( ) TNO ( ) ICT1e ( ) UMass ( ) CUHK1

Our result is not good, why? Online clustering algorithm Reduces complexity Stories far away in time cannot be in the same cluster The assumption of time locality is not valid for topic Non-overlapping clusters Increase miss rate Miss correct granularity Hard to find UMass HTD system reasonably quick but ineffective One day per run

What did TNO do? TNO – UMass: 1/8 detection cost, similar travel cost. How? Four steps Build the similarity matrix for a sample with size 20,000 Agglomerative clustering to build a binary tree Simplify the tree to reduce travel cost For each story not in the sample, find the 10 closest stories in the sample and add it to all the relevant clusters

Why is TNO successful? To deal with the large size, TNO used a 20,000 documents sample for clustering Clustering tree is binary which gets most possible granularities Branching factor of 2 or 3 reduces travel cost Each story can be assigned to at most 10 clusters greatly increases the probability to find a perfect or nearly perfect cluster

Detection cost Overlapping clusters According to TNO’s observation, adding a story to different clusters decreases miss rate significantly Branching factor Smaller branching factor keeps more possible granularities. In our experiment, limited branching factor improved performance Similarity function There is no evidence that different similarity functions show large difference Time locality Our experiment denies the assumption, larger window size gets better results

Travel cost With the current parameter setting, a smaller branching factor is preferred (optimal value 3) Comparison of travel cost ICT: eng,nat mul,eng CUHK: UMass: TNO: Reason: branching factors The current normalization factor is very large normalized travel cost negligible in comparison to detection cost

Toy example Most topics are small Only 20(8%) have more than 100 stories Generate all possible clusters of size 1 to 100 Put them in a binary tree Detection cost for 92% topics is 0!!! Plus empty cluster and whole set, the other 8% at most 1 Travel cost is So the combined cost is It is comparable to most participants! With careful arrangement of the binary tree, it can be easily improved

What is wrong? The idea of the travel cost is to avoid cheating experiments like power set The normalized travel cost and detection cost should be comparable With current parameter setting, small branching factor can reduce both travel cost and detection cost Suggested modification smaller normalization factor, like the old one - travel cost of the optimal hierarchy If normalized travel cost too large, give a smaller weight to it Increase CTITLE and decrease CBRANCH so that the optimal branching factor is larger (5~10?) Other evaluation algorithms, like expected travel cost (still too expensive, need some approximation algorithm)

Summary This year’s evaluation shows that overlapping clusters and small branching factor can get better results Current normalization scheme of travel cost does not work well Need some modification New evaluation methods? Reference Allan, J., Feng, A., and Bolivar, A., Flexible Intrinsic Evaluation of Hierarchical Clustering for TDT, in CIKM 2003, pp