Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

Data Mining Classification: Alternative Techniques

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Machine learning continued Image source:

Supervised Learning Recap

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

K nearest neighbor and Rocchio algorithm

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Reducing Multiclass to Binary LING572 Fei Xia Week 9: 03/04/08.

Ensemble Learning: An Introduction

Adaboost and its application

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.

Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Ensembles of Classifiers Evgueni Smirnov

Final review LING572 Fei Xia Week 10: 03/11/

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Data mining and machine learning A brief introduction.

Text Classification, Active/Interactive learning.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Benk Erika Kelemen Zsolt

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: me the slides by 6am on 3/9 Final report: .

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

NTU & MSRA Ming-Feng Tsai

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Trees, bagging, boosting, and stacking

CIS 700 Advanced Machine Learning Structured Machine Learning: Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.

Introduction to EM algorithm

Introduction to Data Mining, 2nd Edition

CSCI 5832 Natural Language Processing

Ensemble learning.

Chapter 7: Transformations

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Course Summary LING 572 Fei Xia 03/06/07

Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Problem descriptions

Two types of problems Classification problem Sequence Labeling problem In both cases: –A predefined set of labels: C = {c 1, c 2, …c n } –Training data: { (x i, y i ) }, where y i 2 C, and y i is known or unknown. –Test data

NLP tasks Classification problems: –Document classification –Spam detection –Sentiment analysis –…–… Sequence labeling problems: –POS tagging –Word segmentation –Sentence segmentation –NE detection –Parsing –IGT detection –…–…

General approach

Step 1: Preprocessing Converting the NLP task to a classification or sequence labeling problem Creating the attribute-value table: –Define feature templates –Instantiate feature templates and select features –Decide what kind of feature values to use (e.g., binarizing features or not) –Converting a multi-class problem to a binary problem (optional)

Feature selection Dimensionality reduction –Feature selection Wrapping methods Filtering methods: –Mutual info,  2, Information gain, …. –Feature extraction Term clustering: Latent semantic indexing (LSI)

Multiclass  Binary One-vs-all All-pairs Error-correcting Output Codes (ECOC)

Step 2: Training and decoding Choose a ML learner Train and test on development set, with different settings of non-model parameters Choose the best setting for the development set Run the learner on the test data with the best setting

Step 3: Post-processing Label sequence  the output we want System combination –Voting: majority voting, weighted voting –More sophisticated models

Supervised algorithms

Main ideas kNN and Ricchio: finding the nearest neighbors / prototypes DT and DL: finding the right group NB, MaxEnt: calculating P(y | x) Bagging: Reducing the instability Boosting: Forming a committee TBL: Improving the current guess

ML learners Modeling Training Testing (a.k.a. decoding)

Modeling NB: assuming features are conditionally independent. MaxEnt:

Training kNN: no training Rocchio: calculate prototypes DT: build a decision tree –Choose a feature and then split data DL: build a decision list: –Choose a decision rule and then spit data TBL: build a transformation list by –Choose a transformation and then update the current label field

Training (cont) NB: calculate P(c i ) and P(f j | c i ) by simple counting. MaxEnt: calculate the weights of feature functions by iteration. Bagging: create bootstrap samples and learn base classifiers. Boosting: learn base classifiers and their weights.

Testing kNN: calculate distances between x and x i, find the closest neighbors. Rocchio: calculate distances between x and prototypes. DT: traverse the tree DL: find the first matched decision rule. TBL: apply transformations one by one.

Testing (cont) NB: calc MaxEnt: calc Bagging: run the base classifiers and choose the class with highest votes. Boosting: run the base classifiers and calc the weighted sum.

Sequence labeling problems With classification algorithms: –Having features that refer to previous tags –Using beam search to find good sequences With sequence labeling algorithms: –HMM –TBL –MEMM –CRF –…–…

Semi-supervised algorithms Self-training Co-training …  Adding some unlabeled data to the labeled data

Unsupervised algorithms MLE EM: –General algorithm: E-step, M-step –EM for PM models Forward-backward for HMM Inside-outside for PCFG IBM models for MT

Important concepts

Concepts Attribute-value table Feature templates vs. features Weights: –Feature weights –Classifier weights –Instance weights –Feature values

Concepts (cont) Maximum entropy vs. Maximum likelihood Maximize likelihood vs. minimize training error Training time vs. test time Training error vs. test error Greedy algorithm vs. iterative approach

Concepts (cont) Local optima vs. global optima Beam search vs. Viterbi algorithm Sample vs. resample Model parameters vs. non-model parameters

Assignments

Read code: –NB: binary features? –DT: difference between DT and C4.5 –Boosting: AdaBoost and AdaBoostM2 –MaxEnt: binary features? Write code: –Info2Vectors –BinVectors – 2– 2 Complete two projects

Projects Steps: –Preprocessing –Training and testing –Postprocssing Two projects: –Project 1: Document classification –Project 2: IGT detection

Project 1: Document classification A typical classification problem Data are prepared already –Feature template: word appeared in the doc –Feature value: word frequency

Project 2: IGT detection Can be framed as a sequence labeling problem –Preprocessing: Define label set –Postprocessing: Tag sequence  spans Sequence labeling problem  using classification algorithm with beam search To use classification classifiers: –Preprocessing: Define features Choose feature values …

Project 2 (cont) Preprocessing: –Define label set –Define feature templates –Decide on feature values Training and decoding –Write beam search Postprocessing –Convert label sequence  spans

Project 2 (cont) Presentation Final report A typical conference paper: –Introduction –Previous work –Methodology –Experiments –Discussion –Conclusion

Using Mallet Difficulties: –Java –A large package Benefits: –Java –A large package –Many learning algorithms: comparing the implementation with “standard” algorithms

Bugs in Mallet? In Hw9, include a new section: –Bugs –Complaints –Things you like about Mallet

Course summary 9 weeks: 18 sessions 2 kinds of problems 9 supervised algorithms 1 semi-supervised algorithm 1 unsupervised algorithm 4 related issues: feature selection, multiclass  binary, system combination, beam search 2 projects 1 well-known package 9 assignments, including 1 presentation and 1 final report N papers

What’s the next? Learn more about the algorithms covered in class. Learn new algorithms: –SVM, CRF, regression algorithms, graphical models, … Try new tasks: –Parsing, spam filtering, reference resolution, …

Misc Hw7: due tomorrow 11pm Hw8: due Thursday 11pm Hw9: due 3/13 11pm Presentation: No more than 15+5 minutes

What must be included in the presentation? Label set Feature templates Effect of beam search 3+ ways to improve the system and results on dev data (test_data/) Best system: results on dev data and the setting Results on test data (more_test_data/)

Grades, etc. 9 assignments + class participation Hw1-Hw6: –Total: 740 –Max: –Min: –Ave: –Median: