1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Random Forest Predrag Radenković 3237/10
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification Techniques: Decision Tree Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Three kinds of learning
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
An efficient distributed protocol for collective decision- making in combinatorial domains CMSS Feb , 2012 Minyi Li Intelligent Agent Technology.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
Graphical models for part of speech tagging
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Natural Language Processing Statistical Inference: n-grams
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
Classification and Regression Trees
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Ensemble Classifiers.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Roberto Battiti, Mauro Brunato
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Presenter : Jen-Wei Kuo
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns Hopkins University 2. IBM T.J. Waston Research Center March 24, 2005

IBM T.J. Waston CLSP, The Johns Hopkins University n-gram Smoothing Smoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams Over 10 different smoothing techniques were proposed in the literature. Interpolated Kneser-Ney: consistently the best performance [Chen & Goodman, 1998]

IBM T.J. Waston CLSP, The Johns Hopkins University More Data… There’s no data like more data [Berger & Miller, 1998] Just-in-time language model. [Zhu & Rosenfeld, 2001] Estimate n-gram counts from web. [Banko & Brill, 2001] Efforts should be directed toward data collection, instead of learning algorithms. [Keller et. al., 2002] n-gram counts from the web correlates reasonably well with BNC data. [Bulyko et. al., 2003] Web text sources are used for language modeling. [RT-04] U. of Washington web data for language modeling.

IBM T.J. Waston CLSP, The Johns Hopkins University More Data more data  solution to data sparseness The web has “everything”: web data is noisy. The web does NOT have everything: language models using web data still have data sparseness problem.  [Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista. In domain training data is not always easy to get. Do better smoothing techniques matter when training data is of millions of words?

IBM T.J. Waston CLSP, The Johns Hopkins University Outline Motivation Random Forests for Language Modeling Decision Tree Language Models Random Forests Language Models Experiments Perplexity Speech Recognition: IBM RT-04 CTS Limitations Conclusions

IBM T.J. Waston CLSP, The Johns Hopkins University Dealing With Sparseness in n-gram Clustering: combine words into groups of words All components need to use smoothing. [Goodman, 2001] Decision trees: cluster histories into equivalence classes Appealing idea, but negative results were reported. [Potamianos & Jelinek, 1997] Maximum entropy: use n-grams as features in an exponential model There is almost no difference in performance from interpolated Kneser-Ney models. [Chen & Rosenfeld, 1999] Neural networks: represent words with real vectors The models rely on interpolation with Kneser-Ney models in order to get superior performance. [Bengio, 1999]

IBM T.J. Waston CLSP, The Johns Hopkins University Our Motivation Better smoothing technique is desirable. Better use of available data is often important! Improvements in smoothing should help other means of dealing with data sparseness problem.

IBM T.J. Waston CLSP, The Johns Hopkins University Our Approach Extend the appealing idea of history clustering from decision trees. Overcome problems in decision tree construction …by using Random Forests!

IBM T.J. Waston CLSP, The Johns Hopkins University Decision Trees Language Models Decision trees: equivalence classification of histories Each leaf is specified by the answers to a series of questions which lead to the leaf from the root. Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i.e.,classified).

IBM T.J. Waston CLSP, The Johns Hopkins University Construction of Decision Trees Data Driven: decision trees are constructed on the basis of training data The construction requires: 1. The set of possible questions 2. A criterion evaluating the desirability of questions 3. A construction stopping rule or post-pruning rule

IBM T.J. Waston CLSP, The Johns Hopkins University Decision Tree Language Models: An Example Example: trigrams ( w - 2, w - 1, w 0 ) Questions about positions: “Is w - i 2S ?” and “Is w - i 2S c ?” There are two positions for trigram. Each pair, S and S c, defines a possible split of a node, and therefore, training data.  S and S c are complements with respect to training data A node gets less data than its ancestors. ( S, S c ) are obtained by an exchange algorithm.

IBM T.J. Waston CLSP, The Johns Hopkins University Decision Tree Language Models: An Example Training data: aba, aca, bcb, bbb, ada {ab,ac,bc,bb,ad} a:3 b:2 {ab,ac,ad} a:3 b:0 {bc,bb} a:0 b:2 Is the first word in {a}?Is the first word in {b}? New event ‘bdb’ in testNew event ‘adb’ in test New event ‘cba’ in test: Stuck!

IBM T.J. Waston CLSP, The Johns Hopkins University Construction of Decision Trees: Our Approach Grow a decision tree until maximum depth using training data Questions are automatically obtained as a tree is constructed Use training data likelihood to evaluate questions Perform no smoothing during growing Prune fully grown decision tree to maximize heldout data likelihood Incorporate KN smoothing during pruning

IBM T.J. Waston CLSP, The Johns Hopkins University Smoothing Decision Trees Using similar ideas as interpolated Kneser- Ney smoothing: Note: All histories in one node are not smoothed in the same way. Only leaves are used as equivalence classes.

IBM T.J. Waston CLSP, The Johns Hopkins University Problems with Decision Trees Training data fragmentation: As tree is developed, the questions are selected on the basis of less and less data. Optimality:  The exchange algorithm is a greedy algorithm.  So is the tree growing algorithm. Overtraining and undertraining: Deep trees: fit the training data well, will not generalize well to new test data. Shallow trees: not sufficiently refined.

IBM T.J. Waston CLSP, The Johns Hopkins University Amelioration: Random Forests Breiman applied the idea of random forests to relatively small problems. [Breiman 2001]  Using different random samples of data and randomly chosen subsets of questions, construct K decision trees.  Apply test datum x to all the different decision trees. Produce classes y 1, y 2,…, y K.  Accept plurality decision:

IBM T.J. Waston CLSP, The Johns Hopkins University Example of a Random Forest           T1T1 T2T2 T3T3 An example x will be classified as  according to this random forest.

IBM T.J. Waston CLSP, The Johns Hopkins University Random Forests for Language Modeling Two kinds of randomness: Selection of positions to ask about  Alternatives: position 1 or 2 or the better of the two. Random initialization of the exchange algorithm 100 decision trees: i th tree estimates P DT ( i ) ( w 0 | w - 2, w - 1 ) The final estimate is the average of all trees

IBM T.J. Waston CLSP, The Johns Hopkins University Experiments Perplexity (PPL): UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test Normalized text

IBM T.J. Waston CLSP, The Johns Hopkins University Experiments: Aggregating Considerable improvement already with 10 trees!

IBM T.J. Waston CLSP, The Johns Hopkins University Embedded Random Forests Smoothing a decision tree: Better smoothing: embedding!

IBM T.J. Waston CLSP, The Johns Hopkins University Speech Recognition Experiments Word Error Rate by Lattice Rescoring IBM 2004 Conversational Telephony System for Rich Transcription  Fisher data: 22 million words  WEB data: 525 million words, using frequent Fisher n- grams as queries  Other data: Switchboard, Broadcast News, etc. Lattice language model: 4-gram with interpolated Kneser-Ney smoothing, pruned to have 3.2 million unique n-grams Test set: DEV04

IBM T.J. Waston CLSP, The Johns Hopkins University Speech Recognition Experiments Baseline: KN 4-gram 110 random DTs Sampling data without replacement Fisher+WEB: linear interpolation Embedding in Fisher RF, no embedding in WEB RF Fisher 4-gram Fisher+WEB 4-gram KN14.1%13.7% RF13.5%13.1% p-value<0.001

IBM T.J. Waston CLSP, The Johns Hopkins University Practical Limitations of the RF Approach Memory: Decision tree construction uses much more memory. It is not easy to realize performance gain when training data is really large. Because we have over 100 trees, the final model becomes too large to fit into memory. Computing probabilities in parallel incurs extra cost in online computation. Effective language model compression or pruning remains an open question.

IBM T.J. Waston CLSP, The Johns Hopkins University Conclusions: Random Forests New RF language modeling approach More general LM: RF  DT  n-gram Randomized history clustering Good generalization: better n-gram coverage, less biased to training data Significant improvements in IBM RT-04 CTS on DEV04

IBM T.J. Waston CLSP, The Johns Hopkins University Thank you!