E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Clustering Basic Concepts and Algorithms

Imbalanced data David Kauchak CS 451 – Fall 2013.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)

Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Clustering Unsupervised learning Generating “classes”

Combining Content-based and Collaborative Filtering Department of Computer Science and Engineering, Slovak University of Technology

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Chapter 23: Probabilistic Language Models April 13, 2004.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

Web- and Multimedia-based Information Systems Lecture 2.

PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.

NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.

Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.

Reputation Management System

C LUSTERING FOR T AXONOMY E VOLUTION By -Anindya Das - Sneha Bankar.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.

Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

Experience Report: System Log Analysis for Anomaly Detection

Big Data Infrastructure

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Optimizing Parallel Algorithms for All Pairs Similarity Search

DM-Group Meeting Liangzhe Chen, Nov

Information Retrieval and Web Search

Waikato Environment for Knowledge Analysis

Clustering tweets and webpages

Content-Based Image Retrieval

Content-Based Image Retrieval

Presented by: Prof. Ali Jaoua

Representation of documents and queries

Presentation transcript:

E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1

C ONTENTS Problem Statement Clustering Framework Pre-process Clusterer Experimental Setup Corpus Training Methodology Evaluation Methodology Quality Metrics Results Future Work 2

P ROBLEM S TATEMENT (1/2) Problem Definition: Consider a set of social media documents where each document is associated with an (unknown) event. Our goal is to partition this set of documents into clusters such that each cluster corresponds to all documents that are associated to one event. [1] Definition: An event is something that occurs in a certain place at a certain time. [1] 3

P ROBLEM S TATEMENT (2/2) Equivalent Problem: Find a clustering algorithm, where each cluster corresponds to one event and consists of all the social media documents associated with the event. Different clusters corresponds to different events. Our algorithm has the following characteristics: Single-pass Incremental Threshold-based Supervised 4

C LUSTERING F RAMEWORK (1/3) Pre-process Step Term Weighting using Vector Space Model: w ij = f ij *log(num of Docs/num of Docs with word i), where fij is the frequency of word i in document (instance) j No Stemming Applied Stop words Removal Kept topX words per dataset Based on Weka Software (implemented in Java) 5

C LUSTERING F RAMEWORK (2/3) Clusterer Step Build mappings from documents to clusters. Use textual information and a similarity metric. Cosine Similarity Metric Centroid-based Clusters Average weight per term Centroid is updated and maintained with low cost 6

C LUSTERING F RAMEWORK (3/3) Algorithm 1. foreach tweet T in corpus do 2. foreach term t in T do 3. foreach tweet T’ that contains t do 4. compute cosine_similarity_distance(T, centroid(T’)) 5. end 6. end 7. maxSimilarity = maxd’ { cosine_similarity_distance(T, centroid(T’)) } 8. end 9. if maxSimilarity > threshold then 10. add T to cluster T’ 11. update cluster’s centroid 12. else 13. new cluster (T) Experimentally defined: 0.2 7

E XPERIMENTAL S ETUP (1/4) Corpus Collection of twitter data 3079 time stamped tweets Data was collected through Twitter’s streaming API Training methodology A simple graphical user interface was created for tweet labelling 8

Connection Options Query Execution Query Results Information Panel E XPERIMENTAL S ETUP (2/4) 9

Grouping tweets E XPERIMENTAL S ETUP (3/4) 10

E XPERIMENTAL S ETUP (4/4) The “ground truth” dataset consists of 3 events, where each event is self-contained and independent of other events in the dataset. Specifically, EventTag#of tweets Kubica seriously hurtKupica931 Gary Moore dead#GaryMoore930 Egypt#egypt

E VALUATION M ETHODOLOGY (1/2) Quality Metrics Normalized Mutual Information (NMI) Measures how much information is shared between actual “ground truth” events and the clustering assignment. C = {c 1,.., c n } set of clusters. E = {e 1,.., e n } set of events. 12

E VALUATION M ETHODOLOGY (2/2) Quality Metrics Precision: Recall: F-Measure: 13

R ESULTS (1/4) Performance of the algorithm over the given test set. 14 StemmerThresholdWordsToKeep#clustersNMI NullStemmer NullStemmer NullStemmer NullStemmer NullStemmer NullStemmer(0.35, 0.45)

R ESULTS (2/4) Performance of the algorithm over the given test set. StemmerThresholdWordsToKeep#clustersNMI NullStemmer NullStemmer NullStemmer NullStemmer NullStemmer NullStemmer(0.35, 0.45) Egypt, #garymoore, http, kubica, rt

R ESULTS (3/4) F-Measure per Cluster ( WordsToKeep:5, thres:0.4 ) Event #1Event #2Event #3 Cluster # Cluster # Cluster # #egypt kubica #garymoore kubicagarymooreegypt Top word per cluster 16

R ESULTS (4/4) 17 Content of each cluster Format: {..., [word i : weight (#tweets containing word i )],... } Cluster #1 (egypt)Cluster #2 (kubica)Cluster #3 (#garymoore) {[kubica: (10)], [ (471)], [rt: (781)], [#egypt: (1203)]} {[kubica: (783)], [ (345)], [#garymoore: (1)], [rt: (213)]} {[ (307)], [#garymoore: (905)], [rt: (153)], [#egypt: (1)]}

F UTURE W ORK Improve: Pre-process Step Term Representation Feature Extraction - Not only textual features Clusterer Similarity Metrics Cluster Representation Extend Quality Metrics B-Cubed 18

Questions? 19

R EFERENCES 1. Streaming First Story Detection with Application to Twitter 2. Learning Similarity Metrics for Event Identification in Social Media 3. On-line New Event Detection and Tracking 4. More can be found: 20