Final review LING572 Fei Xia Week 10: 03/13/08 1.

Slides:

Advertisements

Similar presentations

An Introduction of Support Vector Machine

Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

An Overview of Machine Learning

Supervised Learning Recap

CMPUT 466/551 Principal Source: CMU

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Naïve Bayes Classifier

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Support Vector Machines and Kernel Methods

K nearest neighbor and Rocchio algorithm

Sparse vs. Ensemble Approaches to Supervised Learning

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Visual Recognition Tutorial

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Crash Course on Machine Learning

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Final review LING572 Fei Xia Week 10: 03/11/

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Text Classification, Active/Interactive learning.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,

Slides for “Data Mining” by I. H. Witten and E. Frank.

Jakob Verbeek December 11, 2009

Machine Learning II Decision Tree Induction CSE 573.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

An Introduction to Support Vector Machine (SVM)

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: me the slides by 6am on 3/9 Final report: .

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

NTU & MSRA Ming-Feng Tsai

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

KNN & Naïve Bayes Hongning Wang

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Boosted Augmented Naive Bayes. Efficient discriminative learning of

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

LECTURE 07: BAYESIAN ESTIMATION

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Final review LING572 Fei Xia Week 10: 03/13/08 1

Topics covered Supervised learning: six algorithms – kNN, NB: training and decoding – DT, TBL: training and decoding (with binary features) – MaxEnt: training (GIS) and decoding – SVM: decoding Semi-supervised learning: – Self-training – Co-training 2

Other topics Introduction to classification task Information theory: entropy, KL divergence, info gain Feature selection: e.g., chi-square, feature frequency Multi-class to binary conversion: e.g., one-against-all, pair- wise Sequence labeling problem and beam search 3

Assignments Hw1: information theory Hw2: Decision tree Hw3: Naïve Bayes Hw4: kNN and feature selection Hw5: MaxEnt: decoder Hw6: MaxEnt: trainer Hw7: Beam search Hw8: SVM decoder Hw9: TBL trainer and decoder 4

Main steps for solving a classification task Define features Prepare training and test data Select ML learners Implement the learner (optional) Run the learner – Tune parameters on the dev data – Error analysis – Conclusion 5

Learning algorithms 6

Generative vs. discriminative models Joint (generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y| µ ) – Ex: n-gram models, HMM, Naïve Bayes, PCFG – Training is trivial: just use relative frequencies. Conditional (discriminative) models estimate P(y|x) by maximizing the conditional likelihood: P(Y|X, µ ) – Ex: MaxEnt, SVM, etc. – Training is harder. 7

Parametric vs. non-parametric models Parametric model: – The number of parameters do not change w.r.t. the number of training instances – Ex: NB, MaxEnt, linear-SVM Non-parametric model: – More examples could potentially mean more complex classifiers. – Ex: kNN, non-linear SVM, DT, TBL 8

DT and TBL DT: – Training: build the tree – Testing: traverse the tree TBL: – Training: create a transformation list – Testing: go through the list Both use the greedy approach – DT chooses the split that maximizes info gain, etc. – TBL chooses the transformation that reduces the errors the most. 9

kNN and SVM Both work with data through “similarity” functions between vectors. kNN: – Training: Nothing – Testing: Find the nearest neighbors SVM – Training: Estimate the weights of training instances and b. – Testing: Calculating f(x), which uses all the SVs 10

NB and MaxEnt NB: – Training: estimate P(c) and P(f|c) – Testing: calculate P(y)P(x|y) MaxEnt: – Training: estimate the weight for each (f, c) – Testing: calculate P(y|x) Differences: – generative vs. discriminative models – MaxEnt does not assume features are conditionally independent 11

MaxEnt and SVM Both are discriminative models. Start with an objective function and find the solution to an optimization problem by using – Lagrangian, the dual problem, etc. – Iterative approach: e.g., GIS – Quadratic programming  numerical optimization 12

Comparison of three learners 13 Naïve BayesMaxEntSVM ModelingMaximize P(X,Y| ¸ ) Maximize P(Y|X, ¸ ) Maximize the margin TrainingLearn P(c) and P(f|c) Learn ¸ i for feature function Learn ® i for each (x i, y i ) DecodingCalculate P(y)P(x|y) Calculate P(y|x)Calculate the sign of f(x) Things to decide Delta for smoothing Regularization Training algorithm (iteration number, etc.) Regularization Training algorithm Kernel function (gamma, etc.) C for penalty

Questions for each method Modeling: – what is the model? – How does the decomposition work? – What kind of assumption is made? – How many model parameters? – How many “tuned” (or non-model) parameters? – How to handle multi-class problem? – How to handle non-binary features? –…–…

Questions for each method (cont) Training: how to estimate parameters? Decoding: how to find the “best” solution? Weaknesses and strengths? – parametric? – generative/discriminative? – performance? – robust? (e.g., handling outliners) – prone to overfitting? – scalable? – efficient in training time? Test time?

Implementation issues 16

Implementation issue Take the log: Ignore some constants: Increase small numbers before dividing 17

Implementation issue (cont) Reformulate the formulas: e.g., entropy calc Store the useful intermediate results 18

An example: calculating model expectation in MaxEnt for each instance x calculate P(y|x) for every y 2 Y for each feature t in x for each y 2 Y model_expect [t] [y] += 1/N * P(y|x) 19

Another example: Finding the best transformation (Hw9) Conceptually for each possible transformation go through the data and calculate the net gain In practice go through the data and calculate the net gain of all applicable transformations 20

More advanced techniques In each iteration calculate net gain for each transformation choose the best transformation and apply it calculate net gain for each transformation in each iteration choose the best transformation and apply it update net gain for each transformation 21

What’s next? 22

What’s next (for LING572)? Today: – Course evaluation – 3-4pm: 4th GP meeting, Room AE 108 Hw9: Due 11:45pm on 3/15 Your hw will be in your mail folder 23

What’s next? Supervised learning: – Covered algorithms: e.g., L-BFGS for MaxEnt, training for SVM, math proof – Other algorithms: e.g., CRF, Graphical models – Other approaches: e.g., Bayesian Using algorithms: – Select features – Choose/compare ML algorithms 24

What’s next? (cont) Semi-supervised learning: – LU: labeled data and unlabeled data – PU: labeled positive data and unlabeled data Unsupervised learning Using them for real applications: LING573 25