MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.

MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION
Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized communications to automatically identify clusters of messages relating to significant "events." Kantor © 2002 MAy082002

Research Components (1) compression of text to meet storage and processing limitations; (2) representation of text into a form amenable to computation and statistical analysis; (3) a matching scheme for computing similarity between documents in terms of the representation chosen; (4) a learning method building on a set of judged examples to determine the key characteristics of a document cluster or "event"; and (5) a fusion scheme that combines methods that are "sufficiently different" to yield improved detection and clustering of documents. Kantor © 2002 MAy082002

Approach/objectives sophisticated dimension reduction methods in a preprocessing stage sophisticated statistical tools in later stages goal: identify the best combination of such newer methods through a careful exploration of a variety of tools. efficiency (in computational time and space) combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives. Kantor © 2002 MAy082002

Approach/objective (2)
“semi-supervised” learning human analysts help to focus on features most indicative of change or anomaly algorithms assess whether incoming documents deviate “significantly” on those features. New techniques needed to represent the data facilitating flagging of significant deviation (“abnormality”) through an appropriately defined metric of new clustering algorithms Build on the analyst-designated features. Kantor © 2002 MAy082002

The Team David Lewis, information retrieval, designing and evaluating operational text classification systems. David Madigan, statistical methods for massive data sets; powerful extensions Bayes classifiers. Paul Kantor; combining multiple methods for classifying or ranking documents to beat an oracle Ilya Muchnik; kernel methods for machine learning; fast statistical clustering algorithm (millions of cases in reasonable time) Endre Boros; methods for Boolean representation and rule learning. Martin Strauss, S. Muthukrishnan, Rafail Ostrovsky; fast compression methods useful in one pass through data. Fred Roberts (PI) methods for combining scores in software and hardware testing; decision making methods Kantor © 2002 MAy082002

Data Sets TREC Data. 5CDs. Some subsets time stamped. Scores available for filtering and routing tasks (10^5-10^6 messages Reuters Volume 1. 8x10^5 messages Google potentially 10^7 messages – Usenet set MEDLINE 10^7 Kantor © 2002 MAy082002

Judges/Experts Existing collections often contain judgments regarding relevance to a query (TREC) or classification information that can be treated as surrogate judgments (Reuters; MEDLINE) We would benefit greatly from a day spent with real analysts, to understand the types of judgments and classifications that are salient to them. Some overlap with Strzalkowski/Kantor work for AQUAINT Kantor © 2002 MAy082002

Work Phase Ia * Prepare available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) * Systematically explore combinations of methods for the supervised learning task (choosing promising combinations of compression, representation method, matching scheme, learning scheme, and fusion method). * hold a related workshop at IDA-CCR in Princeton (paid for by IDA-CCR). (Boros, Kantor, Lewis, Madigan, Muchnik) * Test combinations of methods on common data sets, starting with the smaller ones, exchange information among researchers developing/testing different combinations of methods. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) * Develop promising compression/ dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss) Kantor © 2002 MAy082002

Work Phase Ib * Refine available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) * Extend the systematic exploration of combinations of methods for the supervised learning task. Establish the limits of these technologies including rates of convergence and probabilities of success. Code combined methods for experimental purposes. (Boros, Kantor, Lewis, Madigan, Muchnik) * Test combinations of methods on common data sets, working up to the larger ones, exchange information. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) * Develop and test promising compression/dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss) Kantor © 2002 MAy082002

Years 2 and 3 * Refine leading methods for supervised learning and test on increasingly realistic datasets. * Develop research quality code for the leading methods for supervised learning. * Extension towards unsupervised learning and detection of suspicious message clusters before an event; based on a generalized stress measure indicating that a significant group of interrelated documents cannot fit into the known family of clusters. “semi-supervised” learning; start with case where an analyst defines a list or class of words to emphasize, later generalizing to the case where the analyst identifies more complex “features” used in the definition of “anomaly.” Kantor © 2002 MAy082002

Deliverables A cross matrix of activities by areas of focus
Activities (Aspect) Algorithms Code Evaluation Dissemination Foci (Research topic) Compression Representation Matching Learning Fusion Kantor © 2002 MAy082002

Focus by Aspect FOCUS ASPECT Algorithm Code Evaluate Dissemination
Compression Representation Matching Learning Fusion PBK Kantor © 2002 MAy082002

Kantor Fusion Linear; quadratic; non-parametric Perl/awk low speed hacker code TREC scores; Reuters Classifications etc. Papers; web site; source codes (low documentation) Message entities {m}, are treated with various representation methods m->Rm; Inter-entity similarity is computed using various methods S m,m’ -> S(Rm,Rm’) Fusion combines methods to form metamethods Kantor © 2002 MAy082002

Fusion At the representation level: At the similarity level
R,R’, R’’  R* which combines them. Example: Direct Sum of vector spaces; At the similarity level S,S’, S’’  S* which produces a composite similarity score. Examples: weighted sums; maximum, minimum, non-linear forms Kantor © 2002 MAy082002

The “Fusion Program” Given a space of representation methods {R} and of similarity methods {S} answer these questions: Which methods can be combined to give results better than the best single method? What forms of combination (fusion) , for those methods, produce the best results? When is fusion called for and when should it be avoided? Kantor © 2002 MAy082002

Fusion: method Empirical exploration of the space of possibilities, guided by applicable principles from statistics either directly or via machine learning. Evaluation using the existing sets of classified or judged message texts. Kantor © 2002 MAy082002

Fusion:Deliverables CONCEPTUAL: Systematic tabulation of the effectiveness of various fusion approaches, applied to the set of representations and similarity schemes (a) commonly available [the LEMUR toolkit] and developed in this project [LAD, random projection, ..] USABLE: simple codes for performing various kinds of fusion, both ad-hoc and adaptively. DISSEMINATION: Reports, papers, web-site Kantor © 2002 MAy082002

MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.

Similar presentations

Presentation on theme: "MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.

Similar presentations

Presentation on theme: "MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized."— Presentation transcript:

Similar presentations

About project

Feedback