Presentation is loading. Please wait.

Presentation is loading. Please wait.

MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.

Similar presentations


Presentation on theme: "MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized."— Presentation transcript:

1 MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION
Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized communications to automatically identify clusters of messages relating to significant "events." Kantor © 2002 May

2 Research Components (1) compression of text to meet storage and processing limitations; (2) representation of text into a form amenable to computation and statistical analysis; (3) a matching scheme for computing similarity between documents in terms of the representation chosen; (4) a learning method building on a set of judged examples to determine the key characteristics of a document cluster or "event"; and (5) a fusion scheme that combines methods that are "sufficiently different" to yield improved detection and clustering of documents. Kantor © 2002 May

3 Data Sets TREC Data. 5CDs. Some subsets time stamped. Scores available for filtering and routing tasks (10^5-10^6 messages Reuters Volume 1. 8x10^5 messages Google potentially 10^7 messages – Usenet set MEDLINE 10^7 Kantor © 2002 May

4 Data Sets (2) Members of the project team are experienced in dealing with the TREC and Reuters data sets Fast "learning curve" at start of the project. Filtering Data simulate incoming streams. Kantor © 2002 May

5 Judges/Experts Existing collections often contain judgments regarding relevance to a query (TREC) or classification information that can be treated as surrogate judgments (Reuters; MEDLINE) We would benefit greatly from continuous interaction with real analysts, to understand the types of judgments and classifications that are salient to them. Some overlap with Strzalkowski/Kantor work for AQUAINT (HITIQA project) Kantor © 2002 May

6 Work Phase I July 1-Nov *       Prepare available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) *       Systematically explore combinations of methods for the supervised learning task (choosing promising combinations of compression, representation method, matching scheme, learning scheme, and fusion method). * hold a related workshop at IDA-CCR in Princeton (paid for by IDA-CCR). (ALL) *       Test combinations of methods on common data sets, starting with the smaller ones, exchange information among researchers developing/testing different combinations of methods. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) *       Develop promising compression/ dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss, Abello*) Kantor © 2002 May

7 Work Phase II Dec 1 ‘02 – June 30 ‘03
*       Refine available corpora of data on which to uniformly test different combinations of methods. (Kantor, Lewis) *       Extend the systematic exploration of combinations of methods for the supervised learning task. Establish the limits of these technologies including rates of convergence and probabilities of success. Code combined methods for experimental purposes. (Boros, Kantor, Lewis, Madigan, Muchnik) *       Test combinations of methods on common data sets, working up to the larger ones, exchange information. Identify promising methods for further development. (Boros, Kantor, Lewis, Madigan, Muchnik, Roberts) *       Develop and test promising compression/dimension reduction methods, especially for "streaming data" analysis. (Muchnik, Muthukrishnan, Ostrovsky, Strauss) Kantor © 2002 May

8 Specific Deliverables
We can best discuss deliverables in terms of our individual responsibilities, with reference to a cross matrix of activities by areas of focus. Within each cell of the matrix the specific deliverables may be reports, research-quality code, or evaluation experiments. End of Part I; End of Part II Activities (Aspect of the project) Architecture: DataStructures Algorithms: DraftWriteups; TechRepts+ResPapers Code Research Quality Code Evaluation Results Final Summary Dissemination Web Site Code; Reports; Papers; Workshop 2 Focus (Specific Research topic) Compression Representation Matching Learning Fusion Combination Kantor © 2002 May

9 Allocation of Deliverables and of Responsibility
Kantor © 2002 May

10 Kantor (details) Fusion Linear; quadratic; non-parametric Perl/awk low speed hacker code TREC scores; Reuters Classifications etc. Papers; web site; source codes (low documentation) Message entities {m}, are treated with various representation methods m->Rm; Inter-entity similarity is computed using various methods S m,m’ -> S(Rm,Rm’) Fusion combines methods to form metamethods Kantor © 2002 May

11 Fusion At the representation level: At the similarity level
R,R’, R’’  R* which combines them. Example: Direct Sum of vector spaces; At the similarity level S,S’, S’’  S* which produces a composite similarity score. Examples: weighted sums; maximum, minimum, non-linear forms Kantor © 2002 May

12 The “Fusion Program” Given a space of representation methods {R} and of similarity methods {S} answer these questions: Which methods can be combined to give results better than the best single method? What forms of combination (fusion), for those methods, produce the best results? When is fusion called for and when should it be avoided? Kantor © 2002 May

13 Fusion: method Empirical exploration of the space of possibilities, guided by applicable principles from statistics either directly or via machine learning. Evaluation using the existing sets of classified or judged message texts. Kantor © 2002 May

14 Fusion:Deliverables CONCEPTUAL: Systematic tabulation of the effectiveness of various fusion approaches, applied to the set of representations and similarity schemes (a) commonly available [the LEMUR toolkit] and developed in this project [LAD, random projection, ..] USABLE: simple codes for performing various kinds of fusion, both ad-hoc and adaptively. DISSEMINATION: Reports, papers, web-site END KANTOR Kantor © 2002 May


Download ppt "MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized."

Similar presentations


Ads by Google