Rutgers/DIMACS MMS Project

Rutgers/DIMACS MMS Project
Investigators: Fred Roberts (PI), Paul Kantor, Dave Lewis (consultant), David Madigan, Endre Boros, Ilya Muchnik, Rafi Ostrovsky, Martin Strauss Developers: Alex Genkin, Vladimir Menkov (Consultant) Students: Andrei Anghelescu, Aynur Dayanik, Dmitriy Fradkin, Peng Song Presentation for MMS Site Visit, DIMACS, Rutgers Univ., P. Kantor & D. Lewis

Outline Context Software deliverables Other results of project
Adaptive Filtering (Rocchio, Centroid, kNN) Homotopy BBR (Bayesian logistic regression) libAML (cPCA, tuned SVM) Fusion Other results of project

Context Roberts’ presentation Filter for items of known interest
Retrospectively look for precursors Let ML algorithms learn from trained examples Models based on statistics, and stochastics, do much better than they “should” at classifying human products.

Statistics Meets Computation
Many powerful ideas originated in statistics, but were deemed computationally unfeasible, such as: Massive search Massive Bayesian models Highly non-linear optimization Nearest neighbor modeling Moore’s Law + Ingenuity = New Ball Game

Dual Goal Structure Understand Filtering Move ultimately from
Massive experimentation Exploration of models Study theoretical issues Move ultimately from Batch ==Training to Online == Discovery Improve Security Design to real problems Attention to space and time effectiveness insights Support migration deliver prototypes work with test beds

MMS’s View of Filtering
MMS project is oriented around five abstract filtering “conceptual stages”: A. Turning texts into data for processing (1. Compression and 2. Representation) B. Finding relations between different texts and improving its effectiveness with more data (3. Matching and 4. Learning) C. Combining alternate methods to accomplish the same step (5. Fusion).

One Program Sometimes Handles Several Components
This division into 3, or 5, parts is not always reflected in the packages. One algorithm may involve multiple “components” This was done to make the research more efficient Component processing may be needed in conditional, multiple, or iterative fashion To run lots of experiments efficiently requires several components be in the same “program”

MMS Deliverables Research software (focus of this talk)
Source code Experimenter-oriented documentation Other deliverables include Experimental results Insights

1. Adaptive Filtering Software (AFS)
Classic techniques for text retrieval, automated indexing, etc., applied to filtering Term weighting to compensate for variations in corpus properties, document length Size of in-memory index structures can be tuned Memory/time can be reduced 90% w/ little impact on effectiveness Learning algorithms: Rocchio, Centroid, kNN Powerful, but lots of settings to choose, so...

2. Homotopy AFS (and classic information retrieval methods in general) have many system parameters to tune Homotopy software smoothly varies values of parameters to find optimal tradeoffs

3. BBR (Bayesian Binary Regression)
There are many algorithms for producing highly effective text filters from examples But resulting filters can be very inefficient, due to use of 10’s of thousands of parameters BBR uses sparseness-inducing priors Highly effective classifiers w/ 10’s to 100’s of parameters Avoids usual unpredictable feature selection “tricks”

4. libAML Two aspects: Learning better representations of documents from raw, unlabeled data Self-tuning implementation of SVMs Well-known, highly effective approach to learning filters from labeled examples

5. Fusion Code Different information retrieval and machine learning algorithms work best on different topics being filtered Fusion code permits batch, and simulated adaptive combination of several algorithms

Other Results of MMS Project
Technical publications (including several submissions under review at JICRD) Seminar series Exposure of Rutgers/DIMACS community to problems important in intelligence Spinoff project underway on Author Identification (determining who wrote a particular document) Instrumental in establishing Rutgers Center for Interdisciplinary Studies in Information Policy and Security

MMS Work in Progress DIMACS group is working on:
Making BBR “domain aware”, using textbooks, taxonomies, as prior knowledge Combining BBR with kNN to produce a local logistic regression algorithm Understanding the costs and benefits of learning strategies -- when to submit doubtful documents Understanding the space of filtering mechanisms Applying “sketching” to retrospective forensics

Going Forward MMS project recognizes it is essential to gain:
More insight into how analysts work e.g. NIMD* Access to more realistic data sets and/or statistics on characteristics of operational data e.g. Test Bed Security clearances for several team members are in progress *Novel Intelligence from Massive Data, ARDA, Lucy Newell Program Director

Summary The project has produced over 5,000 experiments
Papers are in preparation or under review Research Quality Software has been provided to the sponsors The team is ready to work with users to improve the impact of what has already been done.

Technical Appendix What to expect from research software
Technical details on each system The following slides are intended to serve as a briefing about the software that could be given to potential adapters without requiring that members of the DIMACS team be present

Status The Rutgers/DIMACS KDD Project addressed a set of interrelated research questions about monitoring message streams. To do this they built a great deal of research software, which they have delivered on CD. They want to know whether parts of this software match with current needs, to help them decide whether to develop some of those parts further

Research Software: Design
A key idea in this research software is that it has lots of adjustable options and parameters. They use it to experiment with many variations Including ones that didn’t work This is “batch mode” software. It assumes all the data is available at the start of run It often incorporates evaluation code, which looks at test data judgments only after run The code Simulates processing stream of incoming data; it can’t actually accept real-time input

Research Software: Robustness
This code has been tested by knowledgeable developers (alpha testing) and by other team members (non-developers) serving as beta testers. They did some test cases but not “product level” testing So abnormal conditions (missing files, etc.) are not always handled gracefully Each part of the code assumes that data has been converted to appropriate format for it.

Research Software: Usability
The “heavy parts” are provided as source code (C++) Some shell, awk, Java code for running experiments and preprocessing data The current documentation is written for a sophisticated, experimentation-oriented user I/O formats vary among packages (driven by data) According to needs of different experiments The software makes use of libraries and other code subject to some license conditions. say something about data sets (driving I/O formats)

1. Adaptive Filtering Software (AFS)
This package is an end-to-end system which does Compression : simple feature selection (in Rocchio and Centroid) Representation : classic term weighting Matching : efficient finding of nearest neighbors Learning : Rocchio, Centroid, kNN and no Fusion Note to MMS people: Throughout this presentation I’m discussing feature selection as a “compression” technique, rather than a representation technique. This is completely arbitrary.

AFS Software Provided as source code (C++)
The AFS uses the CMU Lemur toolkit for preprocessing and storage of documents AFS can simulate batch and adaptive filtering environments, however, All documents must be available at the start of a run AFS Incorporates evaluation code wait to hear if vladimir providing linux executable as well

AFS: Rocchio The Rocchio model is a Classic learning algorithm for text classification This implementation is designed for testing of many design choices discussed in literature including Pseudofeedback (handling of unjudged data) Thresholding (adaptive and batch) Feature Selection Term weighting

AFS: Centroid A “plain vanilla” method developed for this project. It is similar to the Rocchio method, but gives more emphasis to the contrast between positive and negative examples The same design choices can be explored as for the Rocchio approach

AFS: kNN The “k-nearest neighbor” method recognizes that there may not be a single division between relevant and irrelevant documents, so it classifies each document by looking at the ones that are most similar to it (its neighbors) Experimental options can explore Matching : reducing space/time to find neighbors Learning : adjusting neighborhood size, etc.

AFS kNN Matching Capabilities
Exact scoring with an adaptation of the inverted index used in all IR systems Approximate scoring with inverted index Increases speed and decreases space by “pruning” the representations of documents, and by “pruning” the whole index, as it learns. Random projection methods are provided although the Rutgers experiments suggest (at best) that much larger datasets are needed to see real benefits.

AFS kNN Learning Capabilities
The software includes the “Classic kNN” approach and two plausible schemes for weighting the neighbors, rather than just counting them Classification thresholds are set to optimize a user-specified effectiveness measure, using Cross-validation kNN supportsAdaptive filtering (incorporating training data as seen)

2. Homotopy This is a package that wraps an End-to-end system (AFS) under program control so that many design options can be explored. Compression : simple feature selection Representation : classic term weighting Matching : linear classifier application Learning : Rocchio (explores variations in parameter settings of Rocchio) Fusion : none

Homotopy Software and Algorithm
The homotopy package is built on an early version of AFS and behaves similarly, but only includes the Rocchio matching rule Purpose is to investigate alternate parameterizations and variants of Rocchio It includes a separate program for evaluating classification results

3. BBR (Bayesian Binary Regression)
This package comprises an end-to-end system: Compression : feature selection, sparseness-inducing Bayesian priors Representation: classic term weighting options Matching : linear classifier application Learning : Bayesian logistic regression, thresholding No Fusion

BBR Software C++ source Two programs
1. Train classifier on judged data, produce classifier 2. Apply classifier to other judged data, produce classified data and evaluate classification accuracy Assumes inputs in a relatively standard sparse vector format {(attribute, value)} Uses code with GNU GPL license Zhang-Oles patent may apply if it is used outside of research.

BBR Algorithms Logistic regression
In some tests, this performs as well as any other supervised learning algorithm for text classification Effectiveness is apparently due to Bayesian incorporation of the “prior knowledge” that “most terms probably should not appear in a useful test”. The Gaussian option: tends to keep many terms The Laplace option: tends to keep terms out Choice is by the user, or by cross-validation tests Thresholding for user-specified effectiveness measure Optional control on the number of features

What’s really going on? These approaches both “understand documents” as bags of words. Some words are important in finding useful documents, others are not. Finding the right ones is called “term selection”. They approach term selection in quite different ways.

Contrasting AFS to BBR AFS imagines documents are vectors in some abstract space. All terms are included and weights are used to control their effects. (although one option simply limits the number of terms) BBR imagines terms have some probability of being useful. It begins with a reasonable assumption, and adjusts as it sees more data

A different approach Both AFS and BBR start from the knowledge that some documents are interesting (“relevant”), and others are not. This is a kind of supervised learning. It is also possible to start with a mass of documents, and simply ask whether they fall into several recognizable classes. The libAML package takes this approach.

4. libAML The package consists of a library plus programs using the library Compression: Combinatorial PCA (cPCA) Representation : classic term weighting Matching: linear classifier application Learning: aiSVM Fusion : none

libAML Software C++ source Two programs that use libAML library
dataFilterAndFeatureSelector : this finds the term weighting, and selects terms using (cPCA) aiSVM : train, apply, evaluate classifier Assumes sparse vector input Utility provided to convert from Lemur format Uses SVM_Light (noncommercial use only)

libAML Algorithms cPCA aiSVM
This selects a high quality subset of features (terms) by simultaneous clustering of both the documents and the features aiSVM SVM approach produces highly effective linear text classifiers aiSVM allows tuning to user effectiveness needs (utility scores).

5. Fusion Code This permits batch and simulated adaptive combination of several methods. It is a collection of scripts Fusion : various techniques for combining outputs of multiple classifiers for same task

Fusion Software and Algorithms
This is a collection of scripts in shell, awk, GNU Octave, and R The inputs are list of scores/class labels assigned by other classifiers to documents There are several fusion options: affine, linear, logistic, centroid

How can this be used? The researchers have provided a CD containing all of this code, together with extensive beta-tested documentation. Can it be explored or exploited in a realistic setting? For example, locate some collections that are currently index/retrieve managed by other software. See how these packages do?

Further uses Find a setting in which incoming streams of comments are matched to a profile. Try this software, “instructing it” not by building a profile, but by giving it examples of both relevant and not-relevant documents. See how well it does.

What the software cannot do
This software is “hacker friendly” rather than “user friendly” It cannot move onto the desktop and replace a familiar existing tool

… even so …. but it might be more effective on task than existing tools, justifying converting it to a usable tool and the developers are eager to initiate that process, so that the work has a chance to have an impact on the problems that led us to fund it.

Rutgers/DIMACS MMS Project

Similar presentations

Presentation on theme: "Rutgers/DIMACS MMS Project"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rutgers/DIMACS MMS Project

Similar presentations

Presentation on theme: "Rutgers/DIMACS MMS Project"— Presentation transcript:

Similar presentations

About project

Feedback