1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Image Modeling & Segmentation
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Biointelligence Laboratory, Seoul National University
An Introduction of Support Vector Machine
Machine learning continued Image source:
Computer vision: models, learning and inference Chapter 8 Regression.
An Overview of Machine Learning
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
PATTERN RECOGNITION AND MACHINE LEARNING
Outline Separating Hyperplanes – Separable Case
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 LING 696B: Gradient phonotactics and well- formedness.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Universit at Dortmund, LS VIII
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Gaussian Processes For Regression, Classification, and Prediction.
Data Mining and Decision Support
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
1 LING 696B: Final thoughts on nonparametric methods, Overview of speech processing.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
KNN & Naïve Bayes Hongning Wang
Deep Feedforward Networks
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Non-Parametric Models
The Elements of Statistical Learning
Machine Learning Basics
Data Mining Lecture 11.
Outline Parameter estimation – continued Non-parametric methods.
Overview of Machine Learning
INTRODUCTION TO Machine Learning
LECTURE 07: BAYESIAN ESTIMATION
Support Vector Machines 2
Presentation transcript:

1 LING 696B: Midterm review: parametric and non-parametric inductive inference

2 Big question: How do people generalize?

3 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

4 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical What is the nature of inductive inference?

5 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical What is the nature of inductive inference? What role does statistics play?

6 Two paradigms of statistical learning (I) Fisher’s paradigm: inductive inference through likelihood -- p(X|  ) X: observed set of data  : parameters of the probability density function p, or an interpretation of X We expect X to come from an infinite population observing p(X|  ) Representational bias: the form of p(X|  ) constrains what kind things you can learn

7 Learning in Fisher’s paradigm Philosophy: finding the infinite population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing individuals Randomness is due to the finiteness of X Maximum likelihood: find  so p(X|  ) reaches the maximum Natural consequence: the more X you see, the better you learn about p(X|  )

8 Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|  ) for you! Must come from understanding of the structure that generates X, e.g. grammar Needs a supporting theory that guides the construction of p(X|  ) -- “language is special” Extending p(X|  ) to include hidden variables The EM algorithm Making bigger model from smaller models Iterative learning through coordinate-wise ascent

9 Example: unsupervised learning of categories X: instances of pre-segmented speech sounds  : mixture of a fixed number of category models Representational bias: Discreteness Distribution of each category (bias from mixture components) Hidden variable: category membership Learning: EM algorithm

10 Example: unsupervised learning of phonological words X: instances of word-level signals  : mixture model + phonotactic model + word segmentation Representational bias: Discreteness Distribution of each category (bias from mixture components) Combinatorial structure of phonological words Learning: coordinate-wise ascent

11 From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the posterior distribution p(  |X) Bayesian formula: p(  |X)  p(X|  ) p(  ) = p(X,  ) Same as ML when p(  ) is uniform Still needs a theory guiding the construction of p(  ) and p(X|  ) More on this later

12 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation

13 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation Can make “infinite generalizations” Synthesize from p(X,  ) can tell us something about the generalization

14 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation Can make “infinite generalizations” Synthesize from p(X,  ) can tell us something about the generalization A very general framework Theory of everything?

15 Challenges to generative modeling The representational bias can be wrong

16 Challenges to generative modeling The representational bias can be wrong But “all models are wrong”

17 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models

18 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K

19 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

20 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c Computing max  {p(X|  )} can be very hard Bayesian computation may help

21 Challenges to generative modeling Even finding X can be hard for language

22 Challenges to generative modeling Even finding X can be hard for language Probability distribution over what? Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

23 Challenges to generative modeling Even finding X can be hard for language Probability distribution over what? Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions Hope: staying on low levels of language will make the choice of X easier

24 Two paradigms of statistical learning (II) Vapnik’s critique for generative modeling: “Why solve a more general problem before solving a specific one ?” Example: Generative approach to 2- class classification (supervised) Likelihood ratio test: Log[p(x|A)/p(x|B)] A, B are parametric models

25 Non-parametric approach to inductive inference Main idea: don’t want to know the universe first, then generalize Universe is complicated, representational bias often inappropriate Very few data to learn from, compared to dimensionality of space Instead, want to generalize directly from old data to new data Rules v.s. analogy?

26 Examples of non-parametric learning (I): Nearest neighbor classification: Analogy-based learning by dictionary lookup Generalize to K-nearest neighbors

27 Examples of non-parametric learning (II) Radial Basis networks for supervised learning: F(x) =  i a i K(x, x i ) K(x, x i ) a non-linear similarity function centered at x i, with tunable parameters Interpretation: “soft/smooth” dictionary lookup/analogy within a population Learning: find a i from (x i, y i ) pairs -- a regularized regression problem min  i [f(x)-y i ] 2 + || f || 2

28 Radial basis functions/networks Each data point x i is associated with a K(x, x i ) -- a radial basis function Linear combinations of enough K(x, x i ) can approximate any smooth function from R n  R Universal approximation property Network interpretation (see demo)

29 How is this different from generative modeling? Do not assume a fixed space to search for the best hypothesis Instead, this space grows with the amount of data Basis of the space: K(x, x i ) Interpretation: local generalization from old data x i to new data x F(x) =  i a i K(x, x i ) represents an ensemble generlization from {x i } to x

30 Examples of non-parametric learning (III) Support Vector Machines (last time): linear separation f(x) = sign( +b)

31 Max margin classification The solution is also a direct generalization from old data, but sparse mostly zero f(x) = sign( +b)

32 Interpretation of support vectors Support vectors have non-zero contribution to the generalization “prototypes” for analogical learning mostly zero f(x) = sign( +b)

33 Kernel generalization of SVM The solution looks very much like RBF networks: RBF net: F(x) =  i a i K(x, x i ) Many old data contribute to generalization SVM: F(x) = sign(  i a i K(x, x i ) + b) Relatively few old data contribute Dense/sparse solution is due to different goals (see demo)

34 Transductive inference with support vectors One more wrinkle: now I’m putting two points there, but don’t tell you the color

35 Transductive SVM Not only old data affect generalization, the new data affect each other too

36 A general view of non- parametric inductive inference A function approximation problem: knowing that (x 1, y 1 ), …, (x N, y N ) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that “behaves” like F In realistic terms, non-parametric methods often win

37 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  )

38 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  ) But the model is often difficult to build E.g. a realistic experimental task

39 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  ) But the model is often difficult to build E.g. a realistic experimental task Before reaching a conclusion, we need to know how people learn They may be doing both

40 Where does neural net fit? Clearly not generative: does not reason with probability

41 Where does neural net fit? Clearly not generative: does not reason with probability Somewhat different from analogy-type of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

42 Where does neural net fit? Clearly not generative: does not reason with probability Somewhat different from analogy-type of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization Some results available for limiting cases Similar to non-parametric methods when hidden units are infinite

43 A point that nobody gets right Small sample dilemma: people learn from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the true distribution with infinite sample Non-parametric: universal approximation requires infinite sample The limit is taken in the wrong direction