# 1 Machine Learning in Natural Language 1.No Lecture on Thursday. 2.Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition.

## Presentation on theme: "1 Machine Learning in Natural Language 1.No Lecture on Thursday. 2.Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition."— Presentation transcript:

1 Machine Learning in Natural Language 1.No Lecture on Thursday. 2.Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition

2 Machine Learning in Natural Language Features and Kernels 1.The idea of kernels Kernel Perceptron 2.Structured Kernels Tree and Graph Kernels 3.Lessons Multi-class classification

3 Weather Whether New discriminator in functionally simpler Embedding Can be done explicitly (generate expressive features) or implicitly (use kernels).

4  A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.  Computing the weight vector is done in the original space.  Notice: this pertains only to efficiency.  Generalization is still relative to the real dimensionality.  This is the main trick in SVMs. (Algorithm - different) (although many applications actually use linear kernels). Kernel Based Methods

5 Let I be the set t 1,t 2,t 3 …of monomials (conjunctions) over The feature space x 1, x 2 … x n. Then we can write a linear function over this new feature space. Kernel Base Methods

6 Great Increase in expressivity Can run Perceptron, Winnow, Logistics regression, but the convergence bound may suffer exponential growth. Exponential number of monomials are true in each example. Also, will have to keep many weights. Kernel Based Methods

7 Consider the value of w used in the prediction. Each previous mistake, on example z, makes an additive contribution of +/-1 to w, iff t(z) = 1. The value of w is determined by the number of mistakes on which t() was satisfied. The Kernel Trick(1)

8 P – set of examples on which we Promoted D – set of examples on which we Demoted M = P  D The Kernel Trick(2)

9 P – set of examples on which we Promoted D – set of examples on which we Demoted M = P  D Where S(z)=1 if z  P and S(z) = -1 if z  D. Reordering: The Kernel Trick(3)

10 S(y)=1 if y  P and S(y) = -1 if y  D. A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z. Define a dot product in the t-space: We get the standard notation: The Kernel Trick(4)

11 What does this representation give us? We can view this Kernel as the distance between x,z measured in the t-space. But, K(x,z) can be computed in the original space, without explicitly writing the t-representation of x, z Kernel Based Methods

12 Consider the space of all 3 n monomials (allowing both positive and negative literals). Then, if same(x,z) is the number of features that have the same value for both x and z.. We get: Example: Take n=2; x=(00), z=(01), …. Proof: let k=same(x,z); choose to (1)include the literal with the right polarity in the monomial, or (2) not include at all. Other Kernels can be used. Kernel Based Methods

13 Simply run Perceptron in an on-line mode, but keep track of the set M. Keeping the set M allows to keep track of S(z). Rather than remembering the weight vector w, remember the set M (P and D) – all those examples on which we made mistakes. Dual Representation Implementation

14 A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. Computing the weight vector can still be done in the original feature space. Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space. Generalization is still relative to the real dimensionality. This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels) Summary – Kernel Based Methods I

15 Separating hyperplanes (produced by Perceptron, SVM) can be computed in terms of dot products over a feature based representation of examples. We want to define a dot product in a high dimensional space. Given two examples x = (x 1, x 2, …x n ) and y = (y 1,y 2, …y n ) we want to map them to a high dimensional space [example- quadratic]:  ( x 1,x 2,…x n ) = (x 1,…x n, x 1 2,…x n 2, x 1 ¢ x 2, …,x n-1 ¢ x n )  ( y 1,y 2,…y n ) = (y 1,…y n,y 1 2,…y n 2, y 1 ¢ y 2,…,y n-1 ¢ y n ) And compute the dot product A =  (x) ¢  (y) [takes time ] Instead, in the original space, compute B = f(x ¢ y)= [1+ (x 1,x 2, …x n ) ¢ (y 1,y 2, …y n )] 2 Theorem: A = B Coefficients do not really matter; can be done for other functions. Summary – Kernel Trick p2p2 p2p2 p2p2

16 There is a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the classifier. For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels. Next: More Complicated Kernels Efficiency-Generalization Tradeoff

17 Structured Input join John will the board as a director afternoon, Dr. Ab C …in Ms. De. F class.. [ NP Which type] [ PP of ] [ NP submarine] [ VP was bought ] [ ADVP recently ] [ PP by ] [ NP South Korea ] (. ?) S = John will join the board as a director Word= POS= IS-A= … Knowledge Representation

18 We want to extract features from structured domain elements their internal (hierarchical) structure should be encoded. A feature is a mapping from the instances space to {0,1} or [0,1] With appropriate representation language it is possible to represent expressive features that constitute infinite dimensional space [FEX] Learning can be done in the infinite attribute domain. What does it mean to extract features? Conceptually: different data instantiations may be abstracted to yield the same representation (quantified elements) Computationally: Some kind of graph matching process Challenge: Provide the expressivity necessary to deal with large scale and highly structured domains Meet the strong tractability requirements for these tasks. Learning From Structured Input

19 Only those descriptions that are ACTIVE in the input are listed Michael Collins developed kernels over parse trees.kernels over parse trees Cumby/Roth developed parameterized kernels over structures. When is it better to use kernel vs. using the primal representation. D = (AND word (before tag)) Explicit features Example

20 Overview – Goals (Cumby&Roth 2003) Applying kernel learning methods to structured domains. Develop a unified formalism for structured kernels. (Collins & Duffy, Gaertner & Lloyd, Haussler) Flexible language that measures distance between structure with respect to a given ‘substructure’. Examine complexity & generalization between different feature sets, learners. When does each type of feature set perform better with what learners? Exemplify with experiments from bioinformatics & NLP. Mutagenesis, Named-Entity prediction.

21 A flexible knowledge representation for feature extraction from structured data Domain Elements are represented as labeled graphs Concept graphs that correspond to FDL expressions. FDL is formed from an alphabet of attributes, value, and role symbols. Well defined syntax and equivalent semantics E.g., descriptions are defined inductively with sensors as primitives Sensor: a basic description – a term of the form a(v), or a a = attribute symbol, v = value symbol (ground sensor). existential sensor a describes object that has some value for attribute a. AND clauses, (role D) clauses for relations between objects, Expressive and Efficient Feature extraction. Feature Description Logic Knowledge Representation

22 Example (Cont.) Features; Feature Generation Functions; extensions Subsumption… (see paper) Basically: Only those descriptions that are ACTIVE in the input are listed The language is expressive enough to generate linguistically interesting features such as agreements, etc. D = (AND word (before tag)) {D θ } = {(AND word(the) (before tag(N)), (AND word(dog) (before tag(V)), (AND word(ran) (before tag(ADV)), (AND word(very) (before tag(ADJ))} Explicit features

23 Kernels It’s possible to define FDL based Kernels for structured data When using linear classifiers it is important to enhance the set of features to gain expressivity. A common way - blow up the feature space by generating functions of primitive features. For some algorithms – SVM, Perceptron - Kernel functions can be used to expand the feature space while working still in the original space. Kernels Is it worth doing in structured domains? Answers are not clear so far –Computationally: yes, when we simulate a huge space –Generalization: not always [Khardon,Roth,Servedio,NIPS’01; Ben David et al.]

24 Kernels in Structured Domains We define a Kernel family K parameterized by FDL descriptions. The definition is recursive on the definition of D [sensor, existential sensor; role description; AND] Key: Many previous structured kernels considered all substructures. (e.g., Collins&Duffy02, Tree Kernels); Analogous to an exponential feature space; over fitting. If feature space is explicitly expanded – can use algorithms such as Winnow (SNoW); [ complexity and experimental results] Generalization issues & Computation issues [if # of examples large] Kernels

25 FDL Kernel Definition Kernel family K parameterized by feature type descriptions. For description D : If D is a sensor s(v) is a label of then If D is a sensor s and sensor descriptions s(v 1 ), s(v 2 )… s(v j ) are labels of both then If D is a role description (r D’), then with n 1 ’, n 2 ’ those nodes that have r –labeled edge from n 1,n 2. If D is a description (AND D 1 D 2... D n ) with l i repetitions of any D i then Kernels

26 Kernel Example D = (AND word (before word)) G 1 : The dog ran very fast G 2 : The dog ran quickly Etc. the final output is 2 since there are 2 matching collocations. Can simulate Boolean kernels as seen in Khardon,Roth et al. Kernels

27 Complexity & Generalization How to compare in complexity and generalization to other kernels for structured data? for m examples, with average example size g, and time to evaluate the kernel t 1, kernel Perceptron takes O(m 2 g 2 t 1 ) if extracting a feature explicitly takes t 2, Perceptron takes O(mgt 2 ). most kernels that simulate a well defined feature space have t 1 << t 2. By restricting size of expanded feature space we avoid overfitting – even SVM suffers under many irrelevant features (Weston). Margin argument: Margin goes down when you have more features. given a linearly separable set of points S = {x 1,…x m } 2 R n with separator w 2 R n embed S into an n’>n dimensional space by adding zero-mean random noise e to the additional n’-n dimensions s.t. w’= (w,0) 2 R n’ still separates S. Now margin but & Analysis

28 Experiments Serve as comparison – Our features w/ kernel Perc, normal Winnow, and all-subtrees expanded features. Bioinformatics experiment in mutagenesis prediction: 188 compounds with atom-bond data, binary prediction. 10-fold cross validation with 12 runs training NLP experiment in classifying detected NE’s: 4700 training 1500 test phrases from MUC-7 person, location, & organization Trained and tested with kernel Perceptron, Winnow (Snow) classifiers with FDL kernel & respective features. Also all-subtrees kernel based on Collins & Duffy work. Mutagenesis concept graph Features simulated with all-subtrees kernel

29 Discussion microaveraged accuracy Have kernel that simulates features obtained with FDL But quadratic training time means cheaper to extract and learn explicitly vs kernel Perceptron SVM could take (slightly) even longer, but maybe perform better But restricted features might work better than larger spaces simulated by other kernels. Can we improve on benefits of useful features? Compile examples together ? More sophisticated kernels than matching kernel? Still provides metric for similarity based approaches.

30 Conclusion Kernels for learning from structured data is an interesting idea Different kernels may expand/restrict the hypothesis space in useful ways. Need to know the benefits and hazards To justify these methods we must embed in a space much larger than the training set size. Can decrease margin Expressive knowledge representations can be used to create features explicitly or in implicit kernel-spaces. Data representation could allow us to plug in different base kernels to replace matching kernel. Parameterized kernel allows us to direct the way the feature space is blown up to encode background knowledge.

Download ppt "1 Machine Learning in Natural Language 1.No Lecture on Thursday. 2.Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition."

Similar presentations