Presentation on theme: "New Event Detection & Tracking ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department."— Presentation transcript:
New Event Detection & Tracking ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department Bilkent University
22/03/07First Event Detection & Event Tracking2 Outline Introduction – What is New event detection, tracking system – Motivation Related Work – TDT – Google News – NewsInEssence Proposed System – Test Collection Preparation(TTracker), – Novelty Detection & Event Tracking – C3M concept – Design Details Future Work – Named Entities with NED Conclusion
22/03/07First Event Detection & Event Tracking3 Introduction Event – Time, space Topic – Seminal event or activity The differences “Computer virus detected at Biritish Telecom, March 3, 1993 is an Event” “Computer virus outbreaks” is a topic
22/03/07First Event Detection & Event Tracking4 Introduction New event detection: is the task of detecting stories about previously unseen events in a stream of news stories. – Airplane crash, earthquake, governmental elections, and etc. Properties of New Event When the event occurred Who was involved Where it took place How it happened Impact, significance or consequence of the event
22/03/07First Event Detection & Event Tracking5 Introduction Information filtering system –uses a long-lived profile of a user’s request to identify relevant material in a stream of arriving documents. –In contrast, new event detection has no knowledge of what events will happen in the news, so must operate without a pre- specified query. NEDT usage areas In categorization system For people who need to know latest news, govermental analyst, financial analyst, stock market traders – Identifying new mails from previous ones
22/03/07First Event Detection & Event Tracking6 Related Work Topic Detection and Tracking (TDT) Researching since 1997 Broadcast news, written and spoken news stories in multiple languages Research Area Story Segmentation - Detect changes between topically cohesive sections Topic Tracking - Keep track of stories similar to a set of example stories Topic Detection - Build clusters of stories that discuss the same topic First Story Detection - Detect if a story is the first story of a new, unknown topic Link Detection - Detect whether or not two stories are topically linked
22/03/07First Event Detection & Event Tracking7 Related Work Google News A novel approach to News Uses 4,500 English news sources worldwide Groups similar stories together Displays them according to each reader's personalized interests.
22/03/07First Event Detection & Event Tracking8 Related Work NewsInEssence Since 2001 Summarizing clusters of related news articles from multiple sources on the Web. Developed by the CLAIR group at the University of Michigan. Being partially funded by the NSF under the ITR program, grant number ITR-0082884.
22/03/07First Event Detection & Event Tracking9 Proposed System Handling of Test data (Milliyet, TRT, Zaman, Haber7, Cnnturk) – Distribution of the data among collections – Processing the raw data Test Collection Preparation (TTracker) – Profiles and its properties – Sample profiles from collection Novelty Detection & Event Tracking – C3M Concept – Algorithm details Future Work – Named entities – System evaluation Conclusion
22/03/07First Event Detection & Event Tracking10 Handling of Test Data Data is collected from 5 different sources; – CNN Türk (http://www.cnnturk.com),http://www.cnnturk.com – Haber 7 (http://www.haber7.com),http://www.haber7.com – Milliyet Gazetesi (http://www.milliyet.com.tr)http://www.milliyet.com.tr – TRT (http://www.trt.net.tr),http://www.trt.net.tr – Zaman Gazetesi (http://www.zaman.com.tr).http://www.zaman.com.tr From these sources news of 2005 are crawled which has time stamps (date and time).
22/03/07First Event Detection & Event Tracking11 Handling of Test Data Each source is the representative of different angle of view; – CNN Türk – It is international, American style – TRT – It is governmental, more restrictive – Milliyet Gazetesi – It has modern perspective – Zaman Gazetesi – It is conservative – Haber 7 – It provides variety Hence, different perspectives provides nice challenge while tracking the news.
22/03/07First Event Detection & Event Tracking12 Handling of Test Data Statistics about sources; After crawling the data, the text is cleaned from html tags by using HTMLParser library. 199.56 100.0 225,580All 96.76 19.0 42,749Zaman Gazetesi 120.75 8.5 19,102TRT 218.34 32.1 72,506Milliyet Gazetesi 237.85 26.3 59,304Haber 7 270.57 14.2 31,919CNN Türk Avarage News Length (no. of words) % Addition to Total News No. of News News Source
22/03/07First Event Detection & Event Tracking13 Test Collection Preparation TTracker TTracker is a sub-component to collect the test and training data semi-automatically. It is based on an information retrieval system. This system is allowed define the profiles and its tracking news. The system is also provides some statistical information about the profiles. Success of the system will also be compared with manual tracking.
22/03/07First Event Detection & Event Tracking14 Test Collection Preparation TTracker Profile contents as follows; – Topic Title: One or two word definition. – Seminal Event: Definition with at most two or three sentences. – What: What happened during the event. – Who: Who involved the event. – When: When the event occurs. – Where: Where the event occurs. – Topic Size: Estimated number of tracking news. – Seed: Seed document of the event. – Event Type: Category of the event.
22/03/07First Event Detection & Event Tracking15 Test Collection Preparation TTracker Defining the tracking news in five stages; – Stage 1: Using seed document as a query. – Stage 2: Using event profile as a query. – Stage 3: Using tracking news as query. – Stage 4: Creative query searching. – Stage 5: Quality control of the profile. After these stages are completed the quality of the profiles are also controlled by administrators. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Create Start Finish
22/03/07First Event Detection & Event Tracking16 Test Collection Preparation TTracker In the stages annotators has right to define the news as “tracking”, “non-tracking”, “not-sure”, “not-evaluated”. Annotators are evaluating; 200 documents for the 1st stage, 300 documents for the 2nd stage, 400 documents for the 3rd stage, 200 documents each for the queries of 4th stage.
22/03/07First Event Detection & Event Tracking17 Test Collection Preparation TTracker Until now, we collect nearly 60 completed profile with valuable contrubiton of our friends. We give extra importance not to occur bias in the collection. Number of profiles of a person, event types, profile lengths are all kept in balance. Time-SpendNot-EvaluatedNot-SureNon-TrackingTrackingRetireved 825614377614541129Max. 2000142221Min. 13077137889546Avg.
22/03/07First Event Detection & Event Tracking18 Test Collection Preparation TTracker Example profiles and their life-time statistics;
22/03/07First Event Detection & Event Tracking19 Test Collection Preparation TTracker Distribution of news in the year for two sample profiles which are generated by using TTracker; Sahte Rakı 0 20 40 60 80 2005 Eurovision Şarkı Yarışması 0 2 4 6 8 Days of 2005 News amount
22/03/07First Event Detection & Event Tracking20 Test Collection Preparation TTracker To prepare this system, we used information retrieval system – semi automatic; TTracker’s recall value will be compared with the manual system recall value (=1). By using T-test, correctness of the system would be measured.
22/03/07First Event Detection & Event Tracking21 Proposed System Novelty Detection & Event Tracking Novelty detection – the identification of new data that a machine learning system is not aware of during training. – one of the fundamental requirements of a good classification or identification system.
22/03/07First Event Detection & Event Tracking22 Proposed System A special case of novelty detection... 0 time First Event Tracking Events Old News Now Window
22/03/07First Event Detection & Event Tracking23 Proposed System Cover Coefficient Based Clustering Methodology(C 3 M) [Can F., Ozkarahan E.1990] Single pass seed algorithm Working principles are: Determining number of clusters Determining cluster seeds Assigning other documents to clusters initiated by seeds – Two stage probability experiment is performed
22/03/07First Event Detection & Event Tracking24 Proposed System C 3 M CONCEPT – Example D(Document Term) and C(cover coefficient) matrixes – Cij=α i * ∑d IK *β K *d JK for k=1 to m
22/03/07First Event Detection & Event Tracking25 Proposed System NEDT using C 3 M Concept: Threshold value δ W (for new event detection) depends: Window size Cii of incoming event Cij of incoming event to other events in window δ G depends: – Cluster centroid similarity(C IJ ) – Cii of incoming event
22/03/07First Event Detection & Event Tracking26 Proposed System Two thresholds should be found: – In window – In collection A possible selection for high in window but complicated and found by some experimental trials intuitionally... Results are as follows:
22/03/07First Event Detection & Event Tracking27 Proposed System Some experiments will be conducted to improve threshold according to: -Some pattern recognition techniques such as Mixture of Gaussian SVM Decision Trees Another problem about threshold finding: – dataset is not large enough – only 2 feature available Note: Blue dots: New Event Green dots: Tracking event X axis: C ii Y axis:C ij
22/03/07First Event Detection & Event Tracking28 Future Work Improving NED => Using Named Entities Topic-conditioned novelty detection (Yang,..., 2002) A new similarity measure with semantic classes (Makkonen,..., 2002) Modified similarity metrics (Kumaran and Allan, 2004) Using names and topics (Kumaran and Allan, 2005)
22/03/07First Event Detection & Event Tracking29 Future Work Intuition behind named entities: – Who, Where, When – People, organization, places, date and time How to embed named entities into NED A new similarity matrix Additional similarity comparison with extracted named entities
22/03/07First Event Detection & Event Tracking30 Future Work Evaluation of the NED Judge documents Select random documents from different categories Annotators judge documents Same documents are used by our system Finally, evaluation is done according to precision and recall considering annotators’ judgements
22/03/07First Event Detection & Event Tracking31 Future Work Developing an – effective – real-time Web application capable of detecting new events tracking old ones
22/03/07First Event Detection & Event Tracking32 Conclusion Mention about – New Event Detection and Tracking Concepts – Test collection preparation – Details of designed system Goal: – Perform a leading research in Turkish – Make real of dreams in Information Retrival – “Rising like a sun in the science world” Fazli Can
22/03/07First Event Detection & Event Tracking33 References Can F. and Ozkarahan, E. A. “Concepts and effectiveness of the cover coefficient based clustering methodology for text databases”. 1990. Kumaran G. and Allan J. “Text classification and named entities for new event detection”. 2004. Makkonen J., Ahonen-Myka H., and Salmenkivi M. “Appliying semantic classes in event detection and tracking”. 2002. Yang Y., Zhang J., Carbonell J., and Jin C. “Topic- conditioned novelty detection”. 2002.
22/03/07First Event Detection & Event Tracking34 Questions? Thanks for your patience... Any questions?