New Event Detection & Tracking ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEBSITE DONE BY: AYESHA NUSRATH 07L51A0517 FIRDOUSE AFREEN 07L51A0522.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Motion Tracking Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition CS4243Motion Tracking1.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Information Retrieval in Practice
Presented by Zeehasham Rasheed
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Introduction to Machine Learning Approach Lecture 5.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Topic Detection and Tracking Introduction and Overview.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Chapter 1 Introduction to Data Mining
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Yang Hu University of Pittsburgh Department of Computer Science.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Multimedia Information Retrieval
Social Knowledge Mining
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

New Event Detection & Tracking ÖZGÜR BAĞLIOĞLU SÜLEYMAN KARDAŞ H. ÇAĞDAŞ ÖCALAN ERKAN UYAR Bilkent Information Retrieval Group Computer Engineering Department Bilkent University

22/03/07First Event Detection & Event Tracking2 Outline Introduction – What is New event detection, tracking system – Motivation Related Work – TDT – Google News – NewsInEssence Proposed System – Test Collection Preparation(TTracker), – Novelty Detection & Event Tracking – C3M concept – Design Details Future Work – Named Entities with NED Conclusion

22/03/07First Event Detection & Event Tracking3 Introduction Event – Time, space Topic – Seminal event or activity The differences “Computer virus detected at Biritish Telecom, March 3, 1993 is an Event” “Computer virus outbreaks” is a topic

22/03/07First Event Detection & Event Tracking4 Introduction New event detection: is the task of detecting stories about previously unseen events in a stream of news stories. – Airplane crash, earthquake, governmental elections, and etc. Properties of New Event When the event occurred Who was involved Where it took place How it happened Impact, significance or consequence of the event

22/03/07First Event Detection & Event Tracking5 Introduction Information filtering system –uses a long-lived profile of a user’s request to identify relevant material in a stream of arriving documents. –In contrast, new event detection has no knowledge of what events will happen in the news, so must operate without a pre- specified query. NEDT usage areas In categorization system For people who need to know latest news, govermental analyst, financial analyst, stock market traders – Identifying new mails from previous ones

22/03/07First Event Detection & Event Tracking6 Related Work Topic Detection and Tracking (TDT) Researching since 1997 Broadcast news, written and spoken news stories in multiple languages Research Area Story Segmentation - Detect changes between topically cohesive sections Topic Tracking - Keep track of stories similar to a set of example stories Topic Detection - Build clusters of stories that discuss the same topic First Story Detection - Detect if a story is the first story of a new, unknown topic Link Detection - Detect whether or not two stories are topically linked

22/03/07First Event Detection & Event Tracking7 Related Work Google News A novel approach to News Uses 4,500 English news sources worldwide Groups similar stories together Displays them according to each reader's personalized interests.

22/03/07First Event Detection & Event Tracking8 Related Work NewsInEssence Since 2001 Summarizing clusters of related news articles from multiple sources on the Web. Developed by the CLAIR group at the University of Michigan. Being partially funded by the NSF under the ITR program, grant number ITR

22/03/07First Event Detection & Event Tracking9 Proposed System Handling of Test data (Milliyet, TRT, Zaman, Haber7, Cnnturk) – Distribution of the data among collections – Processing the raw data Test Collection Preparation (TTracker) – Profiles and its properties – Sample profiles from collection Novelty Detection & Event Tracking – C3M Concept – Algorithm details Future Work – Named entities – System evaluation Conclusion

22/03/07First Event Detection & Event Tracking10 Handling of Test Data Data is collected from 5 different sources; – CNN Türk ( – Haber 7 ( – Milliyet Gazetesi ( – TRT ( – Zaman Gazetesi ( From these sources news of 2005 are crawled which has time stamps (date and time).

22/03/07First Event Detection & Event Tracking11 Handling of Test Data Each source is the representative of different angle of view; – CNN Türk – It is international, American style – TRT – It is governmental, more restrictive – Milliyet Gazetesi – It has modern perspective – Zaman Gazetesi – It is conservative – Haber 7 – It provides variety Hence, different perspectives provides nice challenge while tracking the news.

22/03/07First Event Detection & Event Tracking12 Handling of Test Data Statistics about sources; After crawling the data, the text is cleaned from html tags by using HTMLParser library ,580All ,749Zaman Gazetesi ,102TRT ,506Milliyet Gazetesi ,304Haber ,919CNN Türk Avarage News Length (no. of words) % Addition to Total News No. of News News Source

22/03/07First Event Detection & Event Tracking13 Test Collection Preparation TTracker TTracker is a sub-component to collect the test and training data semi-automatically. It is based on an information retrieval system. This system is allowed define the profiles and its tracking news. The system is also provides some statistical information about the profiles. Success of the system will also be compared with manual tracking.

22/03/07First Event Detection & Event Tracking14 Test Collection Preparation TTracker Profile contents as follows; – Topic Title: One or two word definition. – Seminal Event: Definition with at most two or three sentences. – What: What happened during the event. – Who: Who involved the event. – When: When the event occurs. – Where: Where the event occurs. – Topic Size: Estimated number of tracking news. – Seed: Seed document of the event. – Event Type: Category of the event.

22/03/07First Event Detection & Event Tracking15 Test Collection Preparation TTracker Defining the tracking news in five stages; – Stage 1: Using seed document as a query. – Stage 2: Using event profile as a query. – Stage 3: Using tracking news as query. – Stage 4: Creative query searching. – Stage 5: Quality control of the profile. After these stages are completed the quality of the profiles are also controlled by administrators. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Create Start Finish

22/03/07First Event Detection & Event Tracking16 Test Collection Preparation TTracker In the stages annotators has right to define the news as “tracking”, “non-tracking”, “not-sure”, “not-evaluated”. Annotators are evaluating; 200 documents for the 1st stage, 300 documents for the 2nd stage, 400 documents for the 3rd stage, 200 documents each for the queries of 4th stage.

22/03/07First Event Detection & Event Tracking17 Test Collection Preparation TTracker Until now, we collect nearly 60 completed profile with valuable contrubiton of our friends. We give extra importance not to occur bias in the collection. Number of profiles of a person, event types, profile lengths are all kept in balance. Time-SpendNot-EvaluatedNot-SureNon-TrackingTrackingRetireved Max Min Avg.

22/03/07First Event Detection & Event Tracking18 Test Collection Preparation TTracker Example profiles and their life-time statistics;

22/03/07First Event Detection & Event Tracking19 Test Collection Preparation TTracker Distribution of news in the year for two sample profiles which are generated by using TTracker; Sahte Rakı Eurovision Şarkı Yarışması Days of 2005 News amount

22/03/07First Event Detection & Event Tracking20 Test Collection Preparation TTracker To prepare this system, we used information retrieval system – semi automatic; TTracker’s recall value will be compared with the manual system recall value (=1). By using T-test, correctness of the system would be measured.

22/03/07First Event Detection & Event Tracking21 Proposed System Novelty Detection & Event Tracking Novelty detection – the identification of new data that a machine learning system is not aware of during training. – one of the fundamental requirements of a good classification or identification system.

22/03/07First Event Detection & Event Tracking22 Proposed System A special case of novelty detection... 0  time First Event Tracking Events Old News Now Window

22/03/07First Event Detection & Event Tracking23 Proposed System Cover Coefficient Based Clustering Methodology(C 3 M) [Can F., Ozkarahan E.1990] Single pass seed algorithm Working principles are: Determining number of clusters Determining cluster seeds Assigning other documents to clusters initiated by seeds – Two stage probability experiment is performed

22/03/07First Event Detection & Event Tracking24 Proposed System C 3 M CONCEPT – Example D(Document Term) and C(cover coefficient) matrixes – Cij=α i * ∑d IK *β K *d JK for k=1 to m

22/03/07First Event Detection & Event Tracking25 Proposed System NEDT using C 3 M Concept: Threshold value δ W (for new event detection) depends: Window size Cii of incoming event Cij of incoming event to other events in window δ G depends: – Cluster centroid similarity(C IJ ) – Cii of incoming event

22/03/07First Event Detection & Event Tracking26 Proposed System Two thresholds should be found: – In window – In collection A possible selection for high in window but complicated and found by some experimental trials intuitionally... Results are as follows:

22/03/07First Event Detection & Event Tracking27 Proposed System Some experiments will be conducted to improve threshold according to: -Some pattern recognition techniques such as Mixture of Gaussian SVM Decision Trees Another problem about threshold finding: – dataset is not large enough – only 2 feature available Note: Blue dots: New Event Green dots: Tracking event X axis: C ii Y axis:C ij

22/03/07First Event Detection & Event Tracking28 Future Work Improving NED => Using Named Entities Topic-conditioned novelty detection (Yang,..., 2002) A new similarity measure with semantic classes (Makkonen,..., 2002) Modified similarity metrics (Kumaran and Allan, 2004) Using names and topics (Kumaran and Allan, 2005)

22/03/07First Event Detection & Event Tracking29 Future Work Intuition behind named entities: – Who, Where, When – People, organization, places, date and time How to embed named entities into NED A new similarity matrix Additional similarity comparison with extracted named entities

22/03/07First Event Detection & Event Tracking30 Future Work Evaluation of the NED Judge documents Select random documents from different categories Annotators judge documents Same documents are used by our system Finally, evaluation is done according to precision and recall considering annotators’ judgements

22/03/07First Event Detection & Event Tracking31 Future Work Developing an – effective – real-time Web application capable of detecting new events tracking old ones

22/03/07First Event Detection & Event Tracking32 Conclusion Mention about – New Event Detection and Tracking Concepts – Test collection preparation – Details of designed system Goal: – Perform a leading research in Turkish – Make real of dreams in Information Retrival – “Rising like a sun in the science world” Fazli Can

22/03/07First Event Detection & Event Tracking33 References Can F. and Ozkarahan, E. A. “Concepts and effectiveness of the cover coefficient based clustering methodology for text databases” Kumaran G. and Allan J. “Text classification and named entities for new event detection” Makkonen J., Ahonen-Myka H., and Salmenkivi M. “Appliying semantic classes in event detection and tracking” Yang Y., Zhang J., Carbonell J., and Jin C. “Topic- conditioned novelty detection”

22/03/07First Event Detection & Event Tracking34 Questions? Thanks for your patience... Any questions?