Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

Similar presentations


Presentation on theme: "Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)"— Presentation transcript:

1 Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

2 Usually only way to determine if data is linearly separable is to try a linear model. When number of attributes exceeds 2, viewing training data as a scatter plot is not practical

3 A linear model that has a small E in (g) means the bulk of training data is linearly separable. Since linear models usually generalize well, a linear model with small E in (g) is probably the best choice

4 When members of a class tend to cluster, an elliptical transformation, z =  (x) = (1, x 1 2, x 2 2 ), might lead to linearly separable features. attribute space feature space

5 When a linear model in attribute space separates most of the data, the transform to a space where E in (g) = 0 (linearly separable) is likely to be complex. attribute space Complex boundaries back-transformed from feature space Linear boundary

6 Data snooping Choosing a transform by looking at a scatter plot can be dangerous. Characteristics may only apply to this dataset

7 Non-linear transform usually discovered as improvement on a linear model. To find the optimum weight vector, w, replace attribute vectors x n in the X matrix by corresponding features, z n =  (x n ) min E in -> X T Xw lin = X T y min E in -> Z T Zw lin = Z T y

8 Learning curves: simple vs complex models Complex models require more data points for good performance. For N smaller than dotted line, simple model is better. Still larger than bound set by noise

9 Extending linear models by transforms can lead to “over-fitting” (smaller E in but larger E out ) VC dimension is a measure of complexity 2D linear model has d VC = 3 2D full quadratic model has d VC = 6 Model with d * VC has min E out not smallest E in

10 d vc as measure of complexity is usually not know What are some more useful measures of complexity? How do we estimate a good level of complexity?

11 11 “elbow” in estimate of E out indicates best complexity Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Approach used for 1D polynomial fitting applies to any measure of complexity Use validation set to estimate E out

12 Number features expands rapidly in multivariate polynomial models. z 2D quad =  (x) = (1, x 1, x 2, x 1 2, x 2 2, x 1 x 2 ) Add terms sequentially and see how E val changes

13 Extending the linear beer-bottle classifier to full quadratic changes the size of the Z matrix from 9 to 81. Some quadratic terms are more important than others. Ignore terms that do not decrease E val significantly. Large validation set makes this technique more affective Curse of dimensionality: glass data 11.52113.644.491.171.780.068.75001 21.51713.893.61.3672.730.487.83001 31.51613.533.551.5472.990.397.78001

14 Classification for digit recognition Examples of hand-written digits from zip codes

15 2-attribute digit model: intensity and symmetry Intensity: how much black is in the image Symmetry: how similar are mirror images intensity symmetry

16 Linear classifier has accuracy ~ 0.99 ones fives

17 One vs Not One: Linear is good; cubic slightly better

18 One vs Not One: finding the best complexity L +x 1 2 +x 2 2 +x 1 x 2 +x 1 3 +x 2 3 +x 1 x 2 2 +x 1 2 x 2 E val 8798 samples E in 500 samples Additional terms beyond linear Error

19 Discriminants in 2D binary classification ones fives

20 Discriminants: linear 2D binary classifier y fit (x) = w 0 + w 1 x 1 + w 2 x 2 r 1 and r 2 are numerical class labels y fit (x) = (r 1 + r 2 )/2 defines the a function of x 1 and x 2 that is the discriminant Solve this function for x 2 as a function of x 1

21 Discriminants: non-linear binary classifiers

22 y fit = w T  (x)r b = (r 1 +r 2 )/2 y fit = r b defines the discriminant For a given x 1 define f(x 2 ) = w T  (x) – r b Find the zeros of f(x 2 ) (x 1, x 2 ) are points on the discriminant By analogy with the linear 2D case


Download ppt "Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)"

Similar presentations


Ads by Google