Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Similar presentations


Presentation on theme: "1 LING 696B: Midterm review: parametric and non-parametric inductive inference."— Presentation transcript:

1 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

2 2 Big question: How do people generalize?

3 3 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

4 4 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical What is the nature of inductive inference?

5 5 Big question: How do people generalize? Examples related to language: Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical What is the nature of inductive inference? What role does statistics play?

6 6 Two paradigms of statistical learning (I) Fisher’s paradigm: inductive inference through likelihood -- p(X|  ) X: observed set of data  : parameters of the probability density function p, or an interpretation of X We expect X to come from an infinite population observing p(X|  ) Representational bias: the form of p(X|  ) constrains what kind things you can learn

7 7 Learning in Fisher’s paradigm Philosophy: finding the infinite population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing individuals Randomness is due to the finiteness of X Maximum likelihood: find  so p(X|  ) reaches the maximum Natural consequence: the more X you see, the better you learn about p(X|  )

8 8 Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|  ) for you! Must come from understanding of the structure that generates X, e.g. grammar Needs a supporting theory that guides the construction of p(X|  ) -- “language is special” Extending p(X|  ) to include hidden variables The EM algorithm Making bigger model from smaller models Iterative learning through coordinate-wise ascent

9 9 Example: unsupervised learning of categories X: instances of pre-segmented speech sounds  : mixture of a fixed number of category models Representational bias: Discreteness Distribution of each category (bias from mixture components) Hidden variable: category membership Learning: EM algorithm

10 10 Example: unsupervised learning of phonological words X: instances of word-level signals  : mixture model + phonotactic model + word segmentation Representational bias: Discreteness Distribution of each category (bias from mixture components) Combinatorial structure of phonological words Learning: coordinate-wise ascent

11 11 From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the posterior distribution p(  |X) Bayesian formula: p(  |X)  p(X|  ) p(  ) = p(X,  ) Same as ML when p(  ) is uniform Still needs a theory guiding the construction of p(  ) and p(X|  ) More on this later

12 12 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation

13 13 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation Can make “infinite generalizations” Synthesize from p(X,  ) can tell us something about the generalization

14 14 Attractions of generative modeling Has clear semantics p(X|  ) -- prediction/production/synthesis p(  ) -- belief/prior knowledge/initial bias p(  |X) -- perception/interpretation Can make “infinite generalizations” Synthesize from p(X,  ) can tell us something about the generalization A very general framework Theory of everything?

15 15 Challenges to generative modeling The representational bias can be wrong

16 16 Challenges to generative modeling The representational bias can be wrong But “all models are wrong”

17 17 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models

18 18 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K

19 19 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

20 20 Challenges to generative modeling The representational bias can be wrong But “all models are wrong” Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c Computing max  {p(X|  )} can be very hard Bayesian computation may help

21 21 Challenges to generative modeling Even finding X can be hard for language

22 22 Challenges to generative modeling Even finding X can be hard for language Probability distribution over what? Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

23 23 Challenges to generative modeling Even finding X can be hard for language Probability distribution over what? Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions Hope: staying on low levels of language will make the choice of X easier

24 24 Two paradigms of statistical learning (II) Vapnik’s critique for generative modeling: “Why solve a more general problem before solving a specific one ?” Example: Generative approach to 2- class classification (supervised) Likelihood ratio test: Log[p(x|A)/p(x|B)] A, B are parametric models

25 25 Non-parametric approach to inductive inference Main idea: don’t want to know the universe first, then generalize Universe is complicated, representational bias often inappropriate Very few data to learn from, compared to dimensionality of space Instead, want to generalize directly from old data to new data Rules v.s. analogy?

26 26 Examples of non-parametric learning (I): Nearest neighbor classification: Analogy-based learning by dictionary lookup Generalize to K-nearest neighbors

27 27 Examples of non-parametric learning (II) Radial Basis networks for supervised learning: F(x) =  i a i K(x, x i ) K(x, x i ) a non-linear similarity function centered at x i, with tunable parameters Interpretation: “soft/smooth” dictionary lookup/analogy within a population Learning: find a i from (x i, y i ) pairs -- a regularized regression problem min  i [f(x)-y i ] 2 + || f || 2

28 28 Radial basis functions/networks Each data point x i is associated with a K(x, x i ) -- a radial basis function Linear combinations of enough K(x, x i ) can approximate any smooth function from R n  R Universal approximation property Network interpretation (see demo)

29 29 How is this different from generative modeling? Do not assume a fixed space to search for the best hypothesis Instead, this space grows with the amount of data Basis of the space: K(x, x i ) Interpretation: local generalization from old data x i to new data x F(x) =  i a i K(x, x i ) represents an ensemble generlization from {x i } to x

30 30 Examples of non-parametric learning (III) Support Vector Machines (last time): linear separation f(x) = sign( +b)

31 31 Max margin classification The solution is also a direct generalization from old data, but sparse mostly zero f(x) = sign( +b)

32 32 Interpretation of support vectors Support vectors have non-zero contribution to the generalization “prototypes” for analogical learning mostly zero f(x) = sign( +b)

33 33 Kernel generalization of SVM The solution looks very much like RBF networks: RBF net: F(x) =  i a i K(x, x i ) Many old data contribute to generalization SVM: F(x) = sign(  i a i K(x, x i ) + b) Relatively few old data contribute Dense/sparse solution is due to different goals (see demo)

34 34 Transductive inference with support vectors One more wrinkle: now I’m putting two points there, but don’t tell you the color

35 35 Transductive SVM Not only old data affect generalization, the new data affect each other too

36 36 A general view of non- parametric inductive inference A function approximation problem: knowing that (x 1, y 1 ), …, (x N, y N ) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that “behaves” like F In realistic terms, non-parametric methods often win

37 37 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  )

38 38 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  ) But the model is often difficult to build E.g. a realistic experimental task

39 39 Who’s got the answer? Parametric approach can also approximate functions Model the joint distribution p(x,y|  ) But the model is often difficult to build E.g. a realistic experimental task Before reaching a conclusion, we need to know how people learn They may be doing both

40 40 Where does neural net fit? Clearly not generative: does not reason with probability

41 41 Where does neural net fit? Clearly not generative: does not reason with probability Somewhat different from analogy-type of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

42 42 Where does neural net fit? Clearly not generative: does not reason with probability Somewhat different from analogy-type of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization Some results available for limiting cases Similar to non-parametric methods when hidden units are infinite

43 43 A point that nobody gets right Small sample dilemma: people learn from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the true distribution with infinite sample Non-parametric: universal approximation requires infinite sample The limit is taken in the wrong direction


Download ppt "1 LING 696B: Midterm review: parametric and non-parametric inductive inference."

Similar presentations


Ads by Google