Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi.

Similar presentations


Presentation on theme: "Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi."— Presentation transcript:

1 Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil Department of Computer Science University College London, UK

2 ANNPR 2003, Florence 12-13 September 20032 Structured Data Many applications… … is useful to represent the objects of the domain by structured data (trees, graphs, …) … better capture important relationships between the sub-parts that compose an object

3 ANNPR 2003, Florence 12-13 September 20033 Natural Language: Parse Trees Hewasvicepreviouspresident. S S VP NP ADVP NP PRP VBD RB NN..

4 ANNPR 2003, Florence 12-13 September 20034 Structural Genomics: Protein Contact Maps

5 ANNPR 2003, Florence 12-13 September 20035 Document Processing: XY-Trees 0 0.00 1.00 0.00 1.00 1 0.00 0.23 0.00 1.001 0.23 0.26 0.00 1.001 0.23 0.29 0.00 1.00 1 0.00 0.23 0.00 0.011 0.00 0.23 0.02 1.001 0.00 0.23 0.00 0.01 1 0.01 0.12 0.02 1.00 1 0.01 0.12 0.23 1.00

6 ANNPR 2003, Florence 12-13 September 20036 Predictive Toxicology, QSAR: Chemical Compounds as Graphs CH 3 (CH(CH 3,CH 2 (CH 2 (CH 3 )))) CH 3 CH CH 3 CH 2 CH 3 [-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1]))))

7 ANNPR 2003, Florence 12-13 September 20037 Ranking vs. Preference Ranking Preference 5 5 3 3 4 4 2 2 1 1

8 ANNPR 2003, Florence 12-13 September 20038 Preference on Structured Data

9 ANNPR 2003, Florence 12-13 September 20039 Classification, Regression and Ranking Supervised learning task f:X → Y Classification Y is a finite unordered set Regression Y is a metric space (reals) Ranking and Preference Y is a finite ordered set Y is a non-metric space Metric space Metric space Unordered Non-metric space Finite Ordered Non-metric space Finite Ordered ClassificationRegression Ranking and Preference The Target Space

10 ANNPR 2003, Florence 12-13 September 200310 Learning on Structured Data Learning algorithms on discrete structures often derive from vector based methods Both Kernel Machines and RNNs are suitable for learning on structured domains Conventional Learning Algorithms Conventional Learning Algorithms 12

11 ANNPR 2003, Florence 12-13 September 200311 Kernels vs. RNNs Kernel Machines Very high-dimensional feature space How to choose the kernel? prior knowledge, fixed representation Minimize a convex functional (SVM) Recursive Neural Networks Low-dimensional space Task-driven: representation depends on the specific learning task Learn an implicit encoding of relevant information Problem of local minima

12 ANNPR 2003, Florence 12-13 September 200312 A Kernel for Labeled Trees Feature Space Set of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children Φ n (t) = # occurences of tree fragment n in t Bag of “something” A tree is represented by Φ(t) = [Φ 1 (t),Φ 2 (t),Φ 3 (t), …] K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001) A C A B BC A CBC AB B C A C A B B A CB C A C A B BC Φ

13 ANNPR 2003, Florence 12-13 September 200313 Recursive Neural Networks Composition of two adaptative functions φ transition function o output function φ,o functions are implemented by feedforward NNs Both RNN parameters and representation vectors are found by maximizing the likelihood of training data A A C C D D A A C C B B output space output space φ w :X → R n o w’ :R n → O

14 ANNPR 2003, Florence 12-13 September 200314 Recursive Neural Networks Labeled Tree Network UnfoldingPrediction PhaseError Correction DBB A CE output network

15 ANNPR 2003, Florence 12-13 September 200315 Preference Models Kernel Preference Model Binary classification of pairwise differences between instances RNNs Preference Model Probabilistic model to find the best alternative Both models use an utility function to evaluate the importance of an element

16 ANNPR 2003, Florence 12-13 September 200316 Utility Function Approach Modelling of the importance of an object Utility function U:X → R x>z ↔ U(x)>U(z) If U is linear U(x)>U(z) ↔ w T x>w T z U can be also model by a neural network Ranking and preference problems Learn U and then sort by U(x) U(z)=3 U(x)=11

17 ANNPR 2003, Florence 12-13 September 200317 Kernel Preference Model x 1 = best of (x 1,…,x r ) Create a set of pairs between x 1 and x 2,…,x r Set of constraints if U is linear U(x 1 )>U(x j ) ↔ w T x 1 >w T x j ↔ w T (x 1 -x j )>0 for j=2,…,r x 1 -x j can be seen as a positive example Binary classification of differences between instances x → Φ(x): the process can be easily kernelized Note: this model does not take into consideration all the alternatives together, but only two by two

18 ANNPR 2003, Florence 12-13 September 200318 RNNs Preference Model Set of alternatives (x 1,x 2,…,x r ) U modelled by a recursive neural network architecture Compute U(x i ) = o(φ(x i )) for i=1,…,r The error (y i - o i ) is backpropagated through whole network Note: the softmax function compares all the alternatives together at once Softmax function

19 ANNPR 2003, Florence 12-13 September 200319 Learning Problems First Pass Attachment Modeling of a psycolinguistic phenomenon Reranking Task Reranking the parse trees output by a statistical parser

20 ANNPR 2003, Florence 12-13 September 200320 First Pass Attachment (FPA) The grammar introduces some ambiguities A set of alternatives for each word but only one is correct The first pass attachment can be modelled as a preference problem 4 Ithasnobearing1432 on NP PP IN NP PP ADVP IN NP ADJP NP QP IN NP SBAR ADVP NONE NP PRN IN PRPVBZDTNN PRP NP VP S Ithasnobearing 1 3 2 PRPVBZDTNN PRP NP VP S

21 ANNPR 2003, Florence 12-13 September 200321 Heuristics for Prediction Enhancement Specializing the FPA prediction for each class of word Group the words in 10 classes (verbs, articles, …) Learn a different classifier for each class of words Removing nodes from the parse tree that aren’t important for choosing between different alternatives Tree reduction Evaluation Measure = # correct trees ranked in first position total number of sets

22 ANNPR 2003, Florence 12-13 September 200322 Experimental Setup Wall Street Journal (WSJ) Section of Penn TreeBank Realistic Corpus of Natural Language 40,000 sentences, 1 million words Average sentence length: 25 words Standard Benchmark in Computational Linguistics Training on sections 2-21, test on section 23 and validation on section 24

23 ANNPR 2003, Florence 12-13 September 200323 Voted Perceptron (VP) FPA + WSJ = 100 million trees for training Voted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999) Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient) Prediction value: weighted sum of all training weight vectors Performance comparable to maximal-margin classifiers (SVM)

24 ANNPR 2003, Florence 12-13 September 200324 Kernel VP vs. RNNs

25 ANNPR 2003, Florence 12-13 September 200325 Kernel VP vs. RNNs

26 ANNPR 2003, Florence 12-13 September 200326 Kernel VP vs. RNNs Modularization

27 ANNPR 2003, Florence 12-13 September 200327 Small Datasets No Modularization

28 ANNPR 2003, Florence 12-13 September 200328 Complexity Comparison VP does not scale linearly with the number of training examples as the RNNs do Computational cost Small datasets 5 splits of 100 sentences ~ a week @ 2GHz CPU CPU(VP) ~ CPU(RNN) Large datasets (all 40,000 sentences) VP took over 2 months to complete an epoch @ 2GHz CPU RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU VP is smooth in respect to training iterations

29 ANNPR 2003, Florence 12-13 September 200329 Reranking Task Reranking problem: rerank the parse trees generated by a statistical parser Same problem setting of FPA (preference on forests) 1 forest/sentence vs. 1 forest/word (less computational cost involved) Statistical Parser Statistical Parser

30 ANNPR 2003, Florence 12-13 September 200330 Evaluation: Parseval Measures Standard evaluation measure Labeled Precision (LP) Labeled Recall (LR) Crossing Brackets (CBs) Compare a parse from a parser with an hand parsing of a sentence

31 ANNPR 2003, Florence 12-13 September 200331 Reranking Task Model ≤ 40 Words (2245 sentences) LRLPCBs0 CBs2 CBs VP89.189.40.8569.388.2 RNN89.289.50.8467.988.4 Model ≤ 100 Words (2416 sentences) LRLPCBs0 CBs2 CBs VP88.688.90.9966.586.3 RNN88.688.90.9864.886.3

32 ANNPR 2003, Florence 12-13 September 200332 Why RNNs outperform Kernel VP? Hypothesis 1 Kernel Function: feature space not focused on the specific learning task Hypothesis 2 Kernel Preference Model worst than RNNs preference model

33 ANNPR 2003, Florence 12-13 September 200333 Linear VP on RNN Representation Checking hypothesis 1 Train VP on RNN representation The tree kernel replaced by a linear kernel State vector representation of parse trees generated by RNN as input to VP Linear VP is trained on RNN state vectors

34 ANNPR 2003, Florence 12-13 September 200334 Linear VP on RNN Representation

35 ANNPR 2003, Florence 12-13 September 200335 Conclusions RNNs show better generalization properties… … also on small datasets … at smaller computational cost The problem is… … neither the kernel function … nor the VP algorithm Reasons: linear VP on RNN representation experiment The problem is… … the preference model! Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN

36 ANNPR 2003, Florence 12-13 September 200336 Acknowledgements Thanks to: Alessio Ceroni Alessandro Vullo Andrea Passerini Giovanni Soda


Download ppt "Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi."

Similar presentations


Ads by Google