Presentation is loading. Please wait.

Presentation is loading. Please wait.

MMS Software Deliverables: Year 1

Similar presentations


Presentation on theme: "MMS Software Deliverables: Year 1"— Presentation transcript:

1 MMS Software Deliverables: Year 1
Paul Kantor, Dave Lewis Presentation for MMS Site Visit, DIMACS, Rutgers Univ., 26-Feb-2004

2 Outline Overview of software deliverables Software deliverables
Adaptive Filtering (Rocchio, Centroid, kNN) BBR (Bayesian logistic regression) libAML (cPCA, tuned SVM) Homotopy Fusion

3 MMS Deliverables Research software (focus of this talk)
Source code Experimenter-oriented documentation Experimental results Insights

4 Research Software: Design
Flexibly parameterized Easy to experiment with many variations Including ones we found didn’t work Assumes all data provided at start of run Often incorporates evaluation code Looks at test data judgments only after run Simulates processing stream of incoming data; can’t accept real-time input

5 Research Software: Robustness
Alpha and beta testing by sophisticated nondevelopers Test cases, regression testing not systematic Abnormal conditions (missing files, etc.) not always handled gracefully Assumes data has been converted to appropriate format

6 Research Software: Usability
Provided as source code (C++) Some shell, awk, Java code for running experiments and preprocessing data Documentation assumes sophisticated, experimentation-oriented user I/O formats vary among packages (driven by data) According to needs of different experiments Use of libraries, other code w/ license conditions say something about data sets (driving I/O formats)

7 “Components” of Filtering
MMS project oriented around five abstract filtering “components”: Compression Representation Matching Learning Fusion

8 One Program Sometimes Handles Several Components
One algorithm may accomplish goals of multiple “components” Component processing may be needed in conditional, multiple, or iterative fashion Running large numbers of experiments efficiently sometimes requires incorporating several components in same program

9 1. Adaptive Filtering Software (AFS)
End-to-end system Compression : simple feature selection (in Rocchio and Centroid) Representation : classic term weighting Matching : efficient finding of nearest neighbors Learning : Rocchio, Centroid, kNN Fusion : none Note to MMS people: Throughout this presentation I’m discussing feature selection as a “compression” technique, rather than a representation technique. This is completely arbitrary.

10 AFS Software Available as source code (C++)
CMU Lemur toolkit used for preprocessing and storage of documents Can simulate batch and adaptive filtering environments All documents must be provided at start of run Incorporates evaluation code wait to hear if vladimir providing linux executable as well

11 AFS: Rocchio Classic learning algorithm for text classification
Implementation designed for testing of many design choices discussed in literature Pseudofeedback (handling of unjudged data) Thresholding (adaptive and batch) Feature Selection Term weighting

12 AFS: Centroid Similar to Rocchio, but emphasizing contrast between positive and negative examples Same design choices can be explored as for Rocchio

13 AFS: kNN Classic pattern recognition algorithm Experiments focused on
Put test document in same categories as most similar training documents Experiments focused on Matching : reducing space/time to find neighbors Learning : adjusting neighborhood size, weighting, thresholds

14 AFS kNN Matching Capabilities
Exact scoring with inverted index (Already faster than exhaustive matching) Approximate scoring with inverted index Training document pruning prior to indexing Test document pruning Classification time inverted list pruning Random projections Several variants of theory-motivated method

15 AFS kNN Learning Capabilities
Classic and two weighted kNN variants Cross-validation based thresholding, optimizing user-specified effectiveness measure Adaptive filtering (incorporating training data as seen) (Pseudofeedback not yet supported)

16 2. BBR (Bayesian Binary Regression)
End-to-end system Compression : feature selection, sparseness-inducing Bayesian priors Representation : some classic term weighting Matching : only linear classifier application Learning : Bayesian logistic regression, thresholding Fusion : none (though logistic regression is a technique that can be used for fusion)

17 BBR Software C++ source Two programs
Train classifier on judged data, produce classifier Apply classifier to judged data, produce classified data and evaluate classification accuracy Assumes inputs formatted as sparse vectors Uses code with GNU GPL license Zhang-Oles patent may apply

18 BBR Algorithms Logistic regression
Best (tied w/ several) supervised learning algorithm for text classification Bayesian priors help avoid overfitting Gaussian : favors dense classifier Laplace : favors sparse classifier (few nonzero weights) Value of prior chosen by user, or cross-validation Thresholding for user-specified effectiveness measure Optional feature selection

19 3. libAML Library plus programs based on it Compression : cPCA
Representation : classic term weighting Matching : only linear classifier application Learning : aiSVM Fusion : none

20 libAML Software C++ source Two programs that use libAML library
dataFilterAndFeatureSelector : term weighting, shrink vectors using feature selection (cPCA) aiSVM : train, apply, evaluate classifier Assumes sparse vector input Utility provided to convert from Lemur format Uses SVM_Light (noncommercial use only)

21 libAML Algorithms cPCA aiSVM
Select high quality subset of features by simultaneous clustering of documents and features aiSVM SVM approach produces highly effective linear text classifiers aiSVM allows tuning to user effectiveness needs

22 4. Homotopy End-to-end system Compression : simple feature selection
Representation : classic term weighting Matching : linear classifier application Learning : Rocchio (explores variations in parameter settings of Rocchio) Fusion : none

23 Homotopy Software and Algorithm
Built on early version of AFS and behaves similarly, but only includes Rocchio Purpose is to investigate alternate parameterizations and variants of Rocchio Separate program for evaluating classification results

24 5. Fusion Code Collection of scripts Compression : none
Representation : none Matching : none Learning : none Fusion : techniques for combining outputs of multiple classifiers for same task

25 Fusion Software and Algorithms
Collection of scripts in shell, awk, GNU Octave, and R Input is list of scores/class labels assigned by other classifiers to documents Several fusion algorithms: affine, linear, logistic, centroid


Download ppt "MMS Software Deliverables: Year 1"

Similar presentations


Ads by Google