1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University.

1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University

2 Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition) MMS: Goal Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events

3 Synergistic improvements in -Performance in terms of space, time, effectiveness, and/or “insight” -Understanding the tradeoffs among these types of improvements -Compression for efficient resource use -Representation that aids fitting models -Efficient matching of text to text and model to text -Learning models from data and prior knowledge -Reduction in need for large amounts of training data or labor-intensive input -Fusion of complementary filtering approaches MMS: Overall Objectives

4 Emphasis on “Supervised” Filtering: -Given example documents, textbook descriptions, etc., find documents on this topic in incoming stream or past data Less Emphasis on “Unsupervised” Event Identification: -Detect emergent characteristics, anomalous patterns, etc. in incoming stream of text or historical statistics on the stream MMS: Approaches

5 Batch filtering: All training texts processed before any texts of active interest to user Adaptive filtering: User trains system during use -Value of examples for both information and training must be considered MMS Approaches: Supervised Filtering

6 Creating summary statistics on massive data streams -Detect outliers, heavy hitters (most frequent items), etc. -Allow us to return to past without keeping raw data Reducing need for labeled training examples in supervised classification -Bayesian priors from domain knowledge -Tuning on unlabeled data MMS Approaches: Dealing with Massive Data

7 Bayesian Logistic Regression Using sparseness-favoring priors, our methods have produced outstanding accuracy and fast predictions with no ad-hoc feature selection State of art text classification effectiveness -Recently: Highest score on TREC 2004 triage task Public release of our Bayesian Binary Regression (BBR) software (500 downloads) Accomplishments Phase II (Jan ‘04 – Sep ‘04) Thomas Bayes

8 Bayesian Logistic Regression (cont’d) Ability to use domain knowledge to set prior distributions led to large improvements in effectiveness when little training data is available New online algorithms: online updating of Bayesian models as new data become available Accomplishments Phase II (Jan ‘04 – Sep ‘04)

9 Streaming Algorithms New sketch-based algorithms for detecting word frequency changes and other patterns in massive text streams Rapid methods for finding changing trends, outliers and deviants, rare events, heavy hitters Initial results using summarized data to search for meaningful answers to queries about the past Initial work on textual and structural patterns in informal communication networks Accomplishments Phase II (Jan ‘04 – Sep ‘04) 1 0 1 1 1 0 1 0 0 1 1

10 Nearest neighbor classification: Fast implementation Continued development of heuristics for approximate neighbor finding with an in-memory inverted index Our results have reduced memory by 90% and time by 90 to 99% with minimal impact on effectiveness. Packaged and delivered kNN software Developing algorithms for speeding up slow but potentially highly effective “local learning” approach -Based on training a separate logistic regression on the neighbors of each test document! -Slow, but with many avenues to large speedups Accomplishments Phase II (Jan ‘04 – Sep ‘04)

11 Adaptive Filtering Models to Aid in Learning: When to act greedily (“exploit” -- submit documents we believe relevant) and when to take risks (“explore” -- submit documents that can be irrelevant) Seek approximate solutions to the intractable optimal exploration/exploitation tradeoff Experiments show slight improvements in filtering effectiveness compared to greedy (exploit-only) approach Accomplishments Phase II (Jan ‘04 – Sep ‘04)

12 Bayesian methods assume prior beliefs about parameters before data is seen -Project Phase I: generic, vague priors -Project Phase II: Reference materials or intuitions about words may help predict class. Use these to set priors. (Material very unlike training examples) Goal: reduce need for training examples -Replace 1000’s of randomly sampled examples with few, possibly biased examples Some MMS Work in Depth: Bayesian Priors from Domain Knowledge

13 Reference texts have some non-topical words -Use words that discriminate among topics (use Inverted Document Frequency (IDF) weighting within reference collection) Small training sets increase problems with thresholding and text representation -Use unlabeled data to aid thresholding and to learn IDF weights -Use separate prior for intercept term of model Knowledge-Driven Priors: Issues

14 Topics: 27 Reuters Region categories Knowledge: CIA World Factbook (WFB) entries Examples: 10/topic Baseline results (F1 measure): WFB: 0.234, no WFB: 0.052 Better small training sets, improved algorithms WFB: 0.591, no WFB: 0.395 Knowledge-Driven Priors: Results

15 Reference materials of text type very different from documents to be classified can aid supervised filtering -In combination with tuning on unlabeled data, this technique can provide immediate practical benefits Current methods are crude and ad hoc – substantial improvements should be possible Knowledge-Driven Priors: Summary

16 Some MMS Work in Depth: Streaming Analysis Problem: Monitor fast, massive text streams and support both online tracking as well as historic analysis for events. Multidimensional data: source, destination, time sent or received, metadata (reply, language), text labels (words, phrases), links. Goal: To use highly compact summaries that are computed at stream speed and perform accurate analyses.

17 Streaming Analysis Tool: CM Sketch Theoretical: We have developed the CM Sketch that uses (1/  ) log 1/  space to approximate data distribution with error at most , and probability of success at least 1- . –All other previously known sample or sketch methods use space at least (1/   ). – CM Sketch is an order of magnitude better. Practical: Few 10's of KBs gives accurate summary of large data: Create summaries of data that allow historic queries to find –Heavy Hitters (Most Frequent Items) –Quantiles of a Distribution (Median, Percentiles etc.) –Finding items with large changes

18 Streaming Analysis: Using Web Logs Web logs (blogs) or regularly updated online journals provide informal, opinionated, candid data that is more like email than is the web. We have begun to automatically collect blogs, stripping formatting and tags, ads, etc., and outputting corresponding "bag of words" into streaming algorithms for analysis, archiving. 10’s to 100 GB scale. 3000:1 compression using CM Sketch methods. Allows accurate analysis of popular words, new emergent words, etc., including multilingual occurrences.

19 Classic method: Rocchio Classic method: Centroid kNN with IFH (inverted file heuristic) Sparse Bayesian (Bayesian with Laplace priors) Combinatorial PCA Homotopic Linking of Widely Varying Rocchio Methods aiSVM Fusion Deliverables: Phase I

20 Revised and extended version of kNN code, including scripts for running local learning experiments Substantially extended version of BBR, including use of domain knowledge to set priors CM Sketch (C library for count-min sketching) Code to use CM Sketch to find heavy hitters, quantiles, and large changes in streams Deliverables: Phase II

21 Bayesian Expand types of domain knowledge usable -For instance, making use of the taxonomies available in many subject areas Improve self-tuning of BBR software -Make it more effective for novice users -Surprisingly subtle questions: Cross- validation, calibration, scaling (e.g., when multiple features) Incorporate previous work on online Bayesian methods into BBR MMS: Future Directions

22 Streaming Systematically explore summarization methods such as sampling, bitmaps, sketches -Develop warehousing techniques for large scale sketch-based historical analyses Massiveness of data implies linear algorithms too inefficient. Seek sublinear methods. Develop sketch-based methods for link analysis in temporally changing multigraphs -From and To addresses in email, links between blogs, etc. Add modeling component to the sketch-based analysis: Exploit knowledge of distribution of the data. MMS: Future Directions

23 kNN kNN with small training sample for each of massive number of topics -maybe only 5 to 10 known relevant/irrelevant documents -Since small samples have little overlap, extend kNN approach to deal with partially labeled datasets Bayesian kNN -Incorporate methods developed in our Bayesian work for dealing with small training sets (e.g., tuning thresholds on unlabeled data). -More fundamental combinations of Bayesian and kNN methods (e.g., tunable distance metrics) MMS: Future Directions

24 Greedy Round Robin Feature Selection In phase I work: Explored greedy heuristic to choose subset of original set of terms as features -Did extremely well in TREC2002 “topic intersection tasks” Will develop a Greedy Round Robin (GRR) method -Applies if features fall into two or more “conceptually distinct” sets (e.g., metadata such as source/destination, genre or medium of the message) -Each list of features is consulted in turn. -Plan experimental analysis of GRR -Plan theoretical analysis of GRR using simulation MMS: Future Directions

25 Adaptive filtering Experiment with new adaptive thresholding methods (synergy with Bayesian thresholding work) -Scoring threshold is adjusted downward if judging too many irrelevant documents; upward if judging too few relevant documents Aim for algorithm with state-of-art effectiveness and provable theoretical properties Compare rate of convergence of various algorithms on real data. MMS: Future Directions

26 Paul Kantor, Rutgers Communic., Info.& Library Studies Dave Lewis, Consultant Michael Littman, Rutgers CS David Madigan, Rutgers Statistics S. Muthukrishnan, Rutgers CS Rafail Ostrovsky, Telcordia/UCLA Fred Roberts, Rutgers DIMACS/Math Martin Strauss, AT&T Labs/U. Michigan) Wen-Hua Ju, Avaya Labs (collaborator) Andrei Anghelescu, Graduate Student Suhrid Balakrishnan, Graduate Student Aynur Dayanik, Graduate Student Dmitry Fradkin, Graduate Student Peng Song, Graduate Student Graham Cormode, postdoc Alex Genkin, software developer Vladimir Menkov, software developer MMS PROJECT TEAM :

1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University.

Similar presentations

Presentation on theme: "1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University.

Similar presentations

Presentation on theme: "1 Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Fred Roberts, Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback