Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst

Task this year 4 times the size of TDT4 (407,503 stories in three languages) Many clustering algorithms not feasible (all algorithms with complexity Ω(n 2 ) will take too long) Time limited - one month Pilot study this year We need a simple algorithm that can be finished in a short time

HTD system of UMass Two step clustering Step 1 – K-NN Step 2 – agglomerative clustering Similarity>threshold? √×

Step 1 – event threading Why event threading? Event: something that happens at a specific time and location An event contains multiple stories Each topic is composed of one or more related events Events have temporal locality What do we do Each story is compared to limited previous stories For simplicity, events do not overlap (false assumption)

Step 2 – agglomerative clustering Agglomerative clustering has complexity of Ω(n 2 ) Modification required Online clustering algorithm Limited window size Merge until 1/3 left First half clusters removed and new events come in Clusters not overlapping Assumption: stories in the same source are more likely to be in the same topic Clusters in the same source are merged first Then the same language Finally all languages

Official runs We submitted 3 runs for each condition UMASSv1 (UMass3): baseline run Tf-idf term weight Cosine similarity Threshold=0.3 Window size=120 UMASSv12 (UMass2): smaller clusters have higher priority in agglomerative clustering UMASSv19 (UMass1): similar to UMASSv12 Double window size

Evaluation results site score Condition TNOICTUMassCUHK eng,nat 0.0262 (0.0377 0.0040) TNO2 0.0898 (0.0966 0.0767) ICT3d 0.2125 (0.3204 0.0030) UMass1 0.3273 (0.4674 0.0554) CUHK1 mul,eng 0.0275 (0.0403 0.0027) TNO3 0.1118 (0.1212 0.0934) ICT1e 0.1942 (0.2910 0.0063) UMass1 0.2783 (0.3969 0.0481) CUHK1

Our result is not good, why? Online clustering algorithm Reduces complexity Stories far away in time cannot be in the same cluster The assumption of time locality is not valid for topic Non-overlapping clusters Increase miss rate Miss correct granularity Hard to find UMass HTD system reasonably quick but ineffective One day per run

What did TNO do? TNO – UMass: 1/8 detection cost, similar travel cost. How? Four steps Build the similarity matrix for a sample with size 20,000 Agglomerative clustering to build a binary tree Simplify the tree to reduce travel cost For each story not in the sample, find the 10 closest stories in the sample and add it to all the relevant clusters

Why is TNO successful? To deal with the large size, TNO used a 20,000 documents sample for clustering Clustering tree is binary which gets most possible granularities Branching factor of 2 or 3 reduces travel cost Each story can be assigned to at most 10 clusters greatly increases the probability to find a perfect or nearly perfect cluster

Detection cost Overlapping clusters According to TNO’s observation, adding a story to different clusters decreases miss rate significantly Branching factor Smaller branching factor keeps more possible granularities. In our experiment, limited branching factor improved performance Similarity function There is no evidence that different similarity functions show large difference Time locality Our experiment denies the assumption, larger window size gets better results

Travel cost With the current parameter setting, a smaller branching factor is preferred (optimal value 3) Comparison of travel cost ICT: eng,nat 0.0767 mul,eng 0.0934 CUHK: 0.0554 0.0481 UMass: 0.0030 0.0063 TNO: 0.0040 0.0027 Reason: branching factors The current normalization factor is very large normalized travel cost negligible in comparison to detection cost

Toy example Most topics are small Only 20(8%) have more than 100 stories Generate all possible clusters of size 1 to 100 Put them in a binary tree Detection cost for 92% topics is 0!!! Plus empty cluster and whole set, the other 8% at most 1 Travel cost is So the combined cost is It is comparable to most participants! With careful arrangement of the binary tree, it can be easily improved

What is wrong? The idea of the travel cost is to avoid cheating experiments like power set The normalized travel cost and detection cost should be comparable With current parameter setting, small branching factor can reduce both travel cost and detection cost Suggested modification smaller normalization factor, like the old one - travel cost of the optimal hierarchy If normalized travel cost too large, give a smaller weight to it Increase CTITLE and decrease CBRANCH so that the optimal branching factor is larger (5~10?) Other evaluation algorithms, like expected travel cost (still too expensive, need some approximation algorithm)

Summary This year’s evaluation shows that overlapping clusters and small branching factor can get better results Current normalization scheme of travel cost does not work well Need some modification New evaluation methods? Reference Allan, J., Feng, A., and Bolivar, A., Flexible Intrinsic Evaluation of Hierarchical Clustering for TDT, in CIKM 2003, pp. 263-270.

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Similar presentations

Presentation on theme: "Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Similar presentations

Presentation on theme: "Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst."— Presentation transcript:

Similar presentations

About project

Feedback