1 Monitoring Message Streams: Retrospective and Prospective Event Detection Paul Kantor, Dave Lewis, David Madigan, Fred Roberts DIMACS, Rutgers University.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC this year Generalized Performance Plots Remarks on.
AN APPLICATION SPECIFIC TECHNIQUE FOR RETRIEVAL AND ADAPTATION OF TRUSTED COMPONENTS Benny Thomas Master of Computer Science Supervised by Dr. David Hemer.
A Quality Focused Crawler for Health Information Tim Tang.
Information Retrieval in Practice
Search Engines and Information Retrieval
Projmgmt-1/33 DePaul University Project Management I - Risk Management Instructor: David A. Lash.
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection Fred S. Roberts DIMACS, Rutgers University.
1 HOMELAND SECURITY RESEARCH AT DIMACS. 2 Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis Health surveillance a core activity.
Chapter 8 Prototyping and Rapid Application Development
Information Retrieval in Practice
Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. Paul Kantor June 2, 2003 Research supported in part.
1 Monitoring Message Streams: Retrospective and Prospective Event Detection.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Overview of Search Engines
Introduction to machine learning
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Chapter 2 The process Process, Methods, and Tools
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Anomaly detection with Bayesian networks Website: John Sandiford.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Collaborative Work Package Criteria Suitable for accomplishing in a period of 90 days to 1 year. Work should fall within the expertise of one or more consortium.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
1 A Conceptual Framework of Data Mining Y.Y. Yao Department of Computer Science, University of Regina Regina, Sask., Canada S4S 0A2
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
@theEIFoundation | eif.org.uk Early Intervention to prevent gang and youth violence: ‘Maturity Matrix’ Early intervention (‘EI’) is about getting extra.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Proposal for Term Project
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Rutgers/DIMACS MMS Project
MMS Software Deliverables: Year 1
Special Topics in Data Mining Applications Focus on: Text Mining
Data Warehousing and Data Mining
A Potpurri of topics Paul Kantor
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Chapter 8 Prototyping and Rapid Application Development
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Presentation transcript:

1 Monitoring Message Streams: Retrospective and Prospective Event Detection Paul Kantor, Dave Lewis, David Madigan, Fred Roberts DIMACS, Rutgers University

2 DIMACS is a partnership of: Rutgers University Princeton University AT&T Labs Bell Labs NEC Research Institute Telcordia Technologies

3 Motivation:  sniffing and monitoring traffic OBJECTIVE: Monitor streams of textualized communication to detect pattern changes and "significant" events

4 Given stream of text in any language. Decide whether "new events" are present in the flow of messages. Event: new topic or topic with unusual level of activity. Retrospective or “Supervised” Event Identification: Classification into pre-existing classes. TECHNICAL PROBLEM:

5 More Complex Problem: Prospective Detection or “Unsupervised” Learning  Classes change - new classes or change meaning  A difficult problem in statistics  Recent new C.S. approaches 1)Algorithm suggests a new class 2)Human analyst labels it; determines its significance

6 COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.

7 Existing methods use some or all 5 automatic processing components, but don’t exploit the full power of the components and/or an understanding of how to apply them to text data. Dave Lewis' method at TREC filtering used an off- the-shelf support vector machine supervised learner, but tuned it for frequency properties of the data. The combination still dominated competing approaches in the TREC-2001 batch filtering evaluation. OUR APPROACH: WHY WE CAN DO BETTER THAN STATE OF THE ART:

8 Existing methods aim at fitting into available computational resources without paying attention to upfront data compression. We hope to do better by a combination of:  more sophisticated statistical methods  sophisticated data compression in a pre- processing stage  optimization of component combinations OUR APPROACH:WHY WE CAN DO BETTER II:

9 COMPRESSION : Reduce the dimension before statistical analysis. Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. (E.g.: use random projections.) Unlike feature-extracting dimension reduction, which can lead to bad results. We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach.

10 Representations: Boolean representations; weighting schemes Matching Schemes: Boolean matching; nonlinear transforms of individual feature values Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting; Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes MORE SOPHISTICATED STATISTICAL APPROACHES :

11 Identify best combination of newer methods through careful exploration of variety of tools. Address issues of effectiveness (how well task is done) and efficiency (in computational time and space) Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives. OUTLINE OF THE APPROACH

12 Extend work to unsupervised learning. Still concentrate on new methods for the 5 components. Emphasize “semi-supervised learning” - human analysts help to focus on features most indicative of anomaly or change; algorithms assess incoming documents as to deviation on those features. Develop new techniques to represent data to highlight significant deviation:  Through an appropriately defined metric  With new clustering algorithms  Building on analyst-designated features IN LATER YEARS

13 Strong team: Statisticians: David Madigan, Rutgers Statistics; Ilya Muchnik, Rutgers CS Experts in Info. Retrieval & Library Science & Text Classification: Paul Kantor, Rutgers Info. And Library Science; David Lewis, Private Consultant THE PROJECT TEAM:

14 Learning Theorists/Operations Researchers: Endre Boros, Rutgers Operations Research Computer Scientists: Muthu Muthukrishnan, Rutgers CS, Martin Strauss, AT&T Labs, Rafail Ostrovsky, Telcordian Technologies Decision Theorists/Mathematical Modelers: Fred Roberts, Rutgers Math/DIMACS Homeland Security Consultants: David Goldschmidt, IDA-CCR THE PROJECT TEAM:

15 12 MONTHS: We will have established a state-of-the art scheme for classification of accumulated documents in relation to known tasks/targets/themes and building profiles to track future relevant messages. We are optimistic that by end-to-end experimentation, we will discover synergies between new mathematical and statistical methods for addressing each of the component tasks and thus achieve significant improvements in performance on accepted measures that could not be achieved by piecemeal study of one or two component tasks. IMPACT:

16 3 YEARS: prototype code for testing the concepts and a precise system specification for commercial or government development. we will have extended our analysis to semi- supervised discovery of potentially interesting clusters of documents. this should allow us to identify potentially threatening events in time for cognizant agencies to prevent them from occurring. IMPACT :

17 RISKS Data will not be realistic enough. We will find it harder than expected to combine good approaches to the 5 components Multidisciplinary cooperation won’t work as well as we think.

18 TOP ACCOMPLISHMENTS TO DATE Infrastructure Work to Date (1 of 2) --Built platform for text filtering experiments *Modified CMU Lemur retrieval toolkit to support filtering *Created newswire testset with test information needs (250 topics, 240K documents) *Wrote evaluation and adaptive thresholding software

19 TOP ACCOMPLISHMENTS TO DATE II Infrastructure Work to Date (2 of 2): --Implemented fundamental adaptive linear classifier (Rocchio) --Benchmarked them using our data sets and submitted to NIST TREC evaluation

20 TOP ACCOMPLISHMENTS TO DATE III Developed a Formal Framework for Monitoring Message Streams: Cast Monitoring Message Streams as a multistage decision problem For each message, decide to send to an analyst or not Positive utility for sending an “interesting” message; else negative…but

21 A Formal Framework for Monitoring Message Streams Continued …positive “value of information” even for negative documents Use Influence Diagrams as a modeling framework Key input is the learning curve Building simple learning curve models BinWorld – discrete model of feature space

22 TOP ACCOMPLISHMENTS TO DATE IV In June, held a “data mining in homeland security” tutorial and workshop at IDA-CCR Princeton. Organized Algorithmic Approach to Compression/Dimension Reduction Beginning Work on Nearest Neighbor Search Methods

23 Prepare available corpora of data on which to uniformly test different combinations of methods Concentrate on supervised learning and detection Systematically explore & compare combinations of compression schemes, representations, matching schemes, learning methods, and fusion schemes Test combinations of methods on common data sets and exchange information among the team Develop and test promising dimension reduction (compression) methods S.O.W: FIRST 12 MONTHS:

24 Midterm Exam (by end of November): Reports on Algorithms: draft writeups Research Quality Code: Under Development Reports on Experimental Evaluation: Interim Project Report Dissemination: draft writeups, interim report plus website, workshop in June 2002 just prior to beginning of project S.O.W: FIRST 12 MONTHS:

25 Final Exam (by end of First 12 Months): Reports on Algorithms: formal writeups as technical reports and research papers Research Quality Code: Made available to sponsors and mission agencies on a web site Reports on Experimental Evaluation: Project Report Summarizing end-to-end studies on effectiveness of different components of our approach + their effectiveness in combination Dissemination: technical reports, conference papers, journal submissions, final reports on algorithms and experimental evaluation, refinement of websites, meetings with sponsors and mission agencies. End of Year 1 Workshop for Sponsors/Practitioners. S.O.W: FIRST 12 MONTHS:

26 Combine leading methods for supervised learning with promising upfront dimension reduction methods Develop research quality code for the leading identified methods for supervised learning Develop the extension to unsupervised learning : Detect suspicious message clusters before an event has occurred Use generalized stress measures indicating a significant group of interrelated messages don’t fit into the known family of clusters Concentrate on semi-supervised learning. S.O.W: YEARS 2 AND 3:

27 The task we face is of great value in forensic activities. We are bringing to bear on this task a multidisciplinary approach with a large, enthusiastic, and experienced team. Preliminary results are very encouraging. Work is needed to make sure that our ideas are of use to analysts. WE ARE OFF TO A GOOD START