111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin.

111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin

Representing Meaning Representing the meaning of natural language is ultimately a difficult philosophical question, i.e. the “meaning of meaning”. Traditional approach is to map ambiguous NL to unambiguous logic in first-order predicate calculus (FOPC). Standard inference (theorem proving) methods exist for FOPC that can determine when one statement entails (implies) another. Questions can be answered by determining what potential responses are entailed by given NL statements and background knowledge all encoded in FOPC. 2

Model Theoretic Semantics Meaning of traditional logic is based on model theoretic semantics which defines meaning in terms of a model (a.k.a. possible world), a set-theoretic structure that defines a (potentially infinite) set of objects with properties and relations between them. A model is a connecting bridge between language and the world by representing the abstract objects and relations that exist in a possible world. An interpretation is a mapping from logic to the model that defines predicates extensionally, in terms of the set of tuples of objects that make them true (their denotation or extension). –The extension of Red(x) is the set of all red things in the world. –The extension of Father(x,y) is the set of all pairs of objects such that A is B’s father. 3

Truth-Conditional Semantics Model theoretic semantics gives the truth conditions for a sentence, i.e. a model satisfies a logical sentence iff the sentence evaluates to true in the given model. The meaning of a sentence is therefore defined as the set of all possible worlds in which it is true. 4

5 Semantic Parsing Semantic Parsing: Transforming natural language (NL) sentences into completely formal logical forms or meaning representations (MRs). Sample application domains where MRs are directly executable by another computer system to perform some task. –CLang: Robocup Coach Language –Geoquery: A Database Query Application

6 CLang: RoboCup Coach Language In RoboCup Coach competition teams compete to coach simulated players [http://www.robocup.org] The coaching instructions are given in a formal language called CLang [Chen et al. 2003] Simulated soccer field CLang If the ball is in our goal area then player 1 should intercept it. (bpos (goal-area our) (do our {1} intercept)) Semantic Parsing

7 Geoquery: A Database Query Application Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] Which rivers run through the states bordering Texas? Query answer(traverse(next_to(stateid(‘texas’)))) Semantic Parsing Arkansas, Canadian, Cimarron, Gila, Mississippi, Rio Grande … Answer answer(traverse(next_to(stateid(‘texas’))))

Procedural Semantics The meaning of a sentence is a formal representation of a procedure that performs some action that is an appropriate response. –Answering questions –Following commands In philosophy, the “late” Wittgenstein was known for the “meaning as use” view of semantics compared to the model theoretic view of the “early” Wittgenstein and other logicians. 8

99 Most existing work on computational semantics is based on predicate logic What is the smallest state by area? answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 )))) x 1 is a logical variable that denotes “the smallest state by area” Predicate Logic Query Language

10 Functional Query Language (FunQL) Transform a logical language into a functional, variable-free language (Kate et al., 2005) What is the smallest state by area? answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 )))) answer(smallest_one(area_1(state(all))))

11 Learning Semantic Parsers Manually programming robust semantic parsers is difficult due to the complexity of the task. Semantic parsers can be learned automatically from sentences paired with their logical form. NL  MR Training Exs Semantic-Parser Learner Natural Language Meaning Rep Semantic Parser

12 Engineering Motivation Most computational language-learning research strives for broad coverage while sacrificing depth. –“Scaling up by dumbing down” Realistic semantic parsing currently entails domain dependence. Domain-dependent natural-language interfaces have a large potential market. Learning makes developing specific applications more tractable. Training corpora can be easily developed by tagging existing corpora of formal statements with natural- language glosses.

13 Cognitive Science Motivation Most natural-language learning methods require supervised training data that is not available to a child. –General lack of negative feedback on grammar. –No POS-tagged or treebank data. Assuming a child can infer the likely meaning of an utterance from context, NL  MR pairs are more cognitively plausible training data.

14 Our Semantic-Parser Learners CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) –Separates parser-learning and semantic-lexicon learning. –Learns a deterministic parser using ILP techniques. COCKTAIL (Tang & Mooney, 2001) –Improved ILP algorithm for CHILL. SILT (Kate, Wong & Mooney, 2005) –Learns symbolic transformation rules for mapping directly from NL to LF. SCISSOR (Ge & Mooney, 2005) –Integrates semantic interpretation into Collins’ statistical syntactic parser. WASP (Wong & Mooney, 2006) –Uses syntax-based statistical machine translation methods. KRISP (Kate & Mooney, 2006) –Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations.

15 CHILL (Zelle & Mooney, 1992-96) Semantic parser acquisition system using Inductive Logic Programming (ILP) to induce a parser written in Prolog. Starts with a deterministic parsing “shell” written in Prolog and learns to control the operators of this parser to produce the given I/O pairs. Requires a semantic lexicon, which for each word gives one or more possible meaning representations. Parser must disambiguate words, introduce proper semantic representations for each, and then put them together in the right way to produce a proper representation of the sentence.

16 CHILL Example U.S. Geographical database –Sample training pair Cuál es el capital del estado con la población más grande? answer(C, (capital(S,C), largest(P, (state(S), population(S,P))))) –Sample semantic lexicon cuál : answer(_,_) capital: capital(_,_) estado: state(_) más grande: largest(_,_) población: population(_,_)

17 WOLFIE (Thompson & Mooney, 1995-1999) Learns a semantic lexicon for CHILL from the same corpus of semantically annotated sentences. Determines hypotheses for word meanings by finding largest isomorphic common subgraphs shared by meanings of sentences in which the word appears. Uses a greedy-covering style algorithm to learn a small lexicon sufficient to allow compositional construction of the correct representation from the words in a sentence.

18 WOLFIE + CHILL Semantic Parser Acquisition NL  MR Training Exs WOLFIE Lexicon Learner Natural Language Meaning Rep Semantic Lexicon Semantic Parser CHILL Parser Learner

Compositional Semantics Approach to semantic analysis based on building up an MR compositionally based on the syntactic structure of a sentence. Build MR recursively bottom-up from the parse tree. BuildMR(parse-tree) If parse-tree is a terminal node (word) then return an atomic lexical meaning for the word. Else For each child, subtree i, of parse-tree Create its MR by calling BuildMR(subtree i ) Return an MR by properly combining the resulting MRs for its children into an MR for the overall parse-tree.

Composing MRs from Parse Trees 20 What is the capital of Ohio? S NP VP WP What answer(capital(loc_2(stateid('ohio')))) capital(loc_2(stateid('ohio'))) answer() NP capital(loc_2(stateid('ohio'))) VBZ V is DT N PP loc_2(stateid('ohio')) capital() IN NP NNP Ohio stateid('ohio') the capital of loc_2() capital() stateid('ohio') loc_2()     

Disambiguation with Compositional Semantics The composition function that combines the MRs of the children of a node, can return  if there is no sensible way to compose the children’s meanings. Could compute all parse trees up-front and then compute semantics for each, eliminating any that ever generate a  semantics for any constituent. More efficient method: –When filling (CKY) chart of syntactic phrases, also compute all possible compositional semantics of each phrase as it is constructed and make an entry for each. –If a given phrase only gives  semantics, then remove this phrase from the table, thereby eliminating any parse that includes this meaningless phrase.

Composing MRs from Parse Trees 22 What is the capital of Ohio? S NP VP WP What NP VBZ V is DT N PP IN NP NNP Ohio riverid('ohio') the capital of loc_2() riverid('ohio') loc_2() 

Composing MRs from Parse Trees What is the capital of Ohio? S NP VP WP What NP VBZ V is DT N capital() the capital capital()      PP loc_2(stateid('ohio')) IN NP NNP Ohio stateid('ohio') of loc_2() stateid('ohio') loc_2() capital() 

24 SCISSOR: Semantic Composition that Integrates Syntax and Semantics to get Optimal Representations

25 An integrated syntax-based approach –Allows both syntax and semantics to be used simultaneously to build meaning representations A statistical parser is used to generate a semantically augmented parse tree (SAPT) Translate a SAPT into a complete formal meaning representation (MR) using a meaning composition process SCISSOR MR: bowner(player(our,2)) ourplayer2has theball PRP$-teamNN-playerCD-unumVB-bowner DT-nullNN-null NP-null VP-bownerNP-player S-bowner

26 Semantic Composition Example ourplayer 2 has theball PRP$-ourNN-player(_,_)CD-2VB-bowner(_) DT-nullNN-null NP-null VP-bowner(_)NP-player(our,2) S-bowner(player(our,2)) player(team,unum) semantic vacuous require argumentsrequire no arguments bowner(player)

27 Semantic Composition Example ourplayer 2 has theball PRP$-ourNN-player(_,_)CD-2VB-bowner(_) DT-nullNN-null NP-null VP-bowner(_) S-bowner(player(our,2)) NP-player(our,2) player(team,unum) bowner(player)

28 Semantic Composition Example ourplayer 2 has theball PRP$-ourNN-player(_,_)CD-2VB-bowner(_) DT-nullNN-null NP-null VP-bowner(_)NP-player(our,2) S-bowner(player(our,2)) player(team,unum) bowner(player)

29 An integrated syntax-based approach –Allows both syntax and semantics to be used simultaneously to build meaning representations A statistical parser is used to generate a semantically augmented parse tree (SAPT) Translate a SAPT into a complete formal meaning representation (MR) using a meaning composition process Allow statistical modeling of semantic selectional constraints in application domains –( AGENT pass) = PLAYER SCISSOR

30 Overview of S CISSOR Integrated Semantic Parser SAPT Training Examples TRAINING SAPT ComposeMR MR NL Sentence TESTING learner

31 Extending Collins’ (1997) Syntactic Parsing Model Collins’ (1997) introduced a lexicalized head- driven syntactic parsing model Bikel’s (2004) provides an easily-extended open- source version of the Collins statistical parser Extending the parsing model to generate semantic labels simultaneously with syntactic labels constrained by semantic constraints in application domains

32 Integrating Semantics into the Model Use the same Markov processes Add a semantic label to each node Add semantic subcat frames –Give semantic subcategorization preferences –bowner takes a player as its argument ourplayer2hastheball PRP$-teamNN-playerCD-unumVB-bowner DT-null NN-null NP-null(ball) VP-bowner(has)NP-player(player) S-bowner(has ) ourplayer2hastheball PRP$NNCDVBDTNN NP(ball) VP(has)NP(player) S(has)

33 Adding Semantic Labels into the Model S-bowner(has) VP-bowner(has) P h (VP-bowner | S-bowner, has)

34 Adding Semantic Labels into the Model S-bowner(has) VP-bowner(has) P lc ({NP}-{player} | S-bowner, VP-bowner, has)× P rc ({}-{}| S-bowner, VP-bowner, has) P h (VP-bowner | S-bowner, has) × {NP}-{player}{ }-{ }

35 Adding Semantic Labels into the Model P d (NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) P lc ({NP}-{player} | S-bowner, VP-bowner, has)× P rc ({}-{}| S-bowner, VP-bowner, has) × P h (VP-bowner | S-bowner, has) × S-bowner(has) VP-bowner(has)NP-player(player) {NP}-{player}{ }-{ }

36 Adding Semantic Labels into the Model P d (NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) P lc ({NP}-{player} | S-bowner, VP-bowner, has)× P rc ({}-{}| S-bowner, VP-bowner, has) × P h (VP-bowner | S-bowner, has) × S-bowner(has) VP-bowner(has)NP-player(player) { }-{ }

37 Adding Semantic Labels into the Model P d (STOP | S-bowner, VP-bowner, has, LEFT, {}-{}) P d (NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) × P lc ({NP}-{player} | S-bowner, VP-bowner, has)× P rc ({}-{}| S-bowner, VP-bowner, has) × P h (VP-bowner | S-bowner, has) × S-bowner(has) VP-bowner(has)NP-player(player) { }-{ } STOP

38 P d (STOP | S-bowner, VP-bowner, has, RIGHT, {}-{}) S-bowner(has) VP-bowner(has)NP-player(player) { }-{ } STOP P d (STOP | S-bowner, VP-bowner, has, LEFT, {}-{}) × P d (NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) × P lc ({NP}-{player} | S-bowner, VP-bowner, has)× P rc ({}-{}| S-bowner, VP-bowner, has) × P h (VP-bowner | S-bowner, has) × Adding Semantic Labels into the Model

39 SCISSOR Parser Implementation Supervised training on annotated SAPTs is just frequency counting Augmented smoothing technique is employed to account for additional data sparsity created by semantic labels. Parsing of test sentences to find the most probable SAPT is performed using a variant of standard CKY chart-parsing algorithm.

40 Smoothing Each label in SAPT is the combination of a syntactic label and a semantic label Increases data sparsity Use Bayes rule to break the parameters down P h (H | P, w) = P h (H syn, H sem | P, w) = P h (H syn | P, w) × P h (H sem | P, w, H syn )

41 Learning Semantic Parsers with a Formal Grammar for Meaning Representations Our other techniques assume that meaning representation languages (MRLs) have deterministic context free grammars –True for almost all computer languages –MRs can be parsed unambiguously

42 NL: Which rivers run through the states bordering Texas? MR: answer(traverse(next_to(stateid(‘texas’)))) Parse tree of MR: Non-terminals: ANSWER, RIVER, TRAVERSE, STATE, NEXT_TO, STATEID Terminals : answer, traverse, next_to, stateid, ‘texas’ Productions: ANSWER  answer(RIVER), RIVER  TRAVERSE(STATE), STATE  NEXT_TO(STATE), TRAVERSE  traverse, NEXT_TO  next_to, STATEID  ‘texas’ ANSWER answer STATE RIVER STATE NEXT_TO TRAVERSE STATEID stateid ‘ texas ’ next_to traverse

43 KRISP: Kernel-based Robust Interpretation for Semantic Parsing Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar Productions of MRL are treated like semantic concepts SVM classifier with string subsequence kernel is trained for each production to identify if an NL substring represents the semantic concept These classifiers are used to compositionally build MRs of the sentences

44 Overview of KRISP Train string-kernel-based SVM classifiers Semantic Parser Collect positive and negative examples MRL Grammar NL sentences with MRs Novel NL sentences Best MRs Best MRs (correct and incorrect) Training Testing

45 Overview of KRISP Train string-kernel-based SVM classifiers Semantic Parser Collect positive and negative examples MRL Grammar NL sentences with MRs Novel NL sentences Best MRs Best MRs (correct and incorrect) Training Testing

46 KRISP’s Semantic Parsing We first define Semantic Derivation of an NL sentence We next define Probability of a Semantic Derivation Semantic parsing of an NL sentence involves finding its Most Probable Semantic Derivation Straightforward to obtain MR from a semantic derivation

47 Semantic Derivation of an NL Sentence ANSWER answer STATE RIVER STATE NEXT_TO TRAVERSE STATEID stateid ‘ texas ’ next_to traverse Which rivers run through the states bordering Texas? MR parse with non-terminals on the nodes:

48 Semantic Derivation of an NL Sentence Which rivers run through the states bordering Texas? ANSWER  answer(RIVER) RIVER  TRAVERSE(STATE) TRAVERSE  traverse STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘ texas ’ MR parse with productions on the nodes:

49 Semantic Derivation of an NL Sentence Which rivers run through the states bordering Texas? ANSWER  answer(RIVER) RIVER  TRAVERSE(STATE) TRAVERSE  traverse STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘ texas ’ Semantic Derivation: Each node covers an NL substring:

50 Semantic Derivation of an NL Sentence Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (NEXT_TO  next_to, [5..7]) 1 2 3 4 5 6 7 8 9

51 Semantic Derivation of an NL Sentence Through the states that border Texas which rivers run? ANSWER  answer(RIVER) RIVER  TRAVERSE(STATE) TRAVERSE  traverse STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘ texas ’ Substrings in NL sentence may be in a different order:

52 Semantic Derivation of an NL Sentence Through the states that border Texas which rivers run? (ANSWER  answer(RIVER), [1..10]) (RIVER  TRAVERSE(STATE), [1..10]] (TRAVERSE  traverse, [7..10]) (STATE  NEXT_TO(STATE), [1..6]) (NEXT_TO  next_to, [1..5]) (STATE  STATEID, [6..6]) (STATEID  ‘ texas ’, [6..6]) Nodes are allowed to permute the children productions from the original MR parse 1 2 3 4 5 6 7 8 9 10

53 Probability of a Semantic Derivation Let P π (s[i..j]) be the probability that production π covers the substring s[i..j] of sentence s For e.g., P NEXT_TO  next_to (“the states bordering”) Obtained from the string-kernel-based SVM classifiers trained for each production π Assuming independence, probability of a semantic derivation D: (NEXT_TO  next_to, [5..7]) the states bordering 5 6 7 0.99

54 Probability of a Semantic Derivation contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) 1 2 3 4 5 6 7 8 9 0.98 0.9 0.89 0.95 0.99 0.93 0.98

55 Computing the Most Probable Semantic Derivation Task of semantic parsing is to find the most probable semantic derivation of the NL sentence given all the probabilities P π (s[i..j]) Implemented by extending Earley’s [1970] context-free grammar parsing algorithm Resembles PCFG parsing but different because: –Probability of a production depends on which substring of the sentence it covers –Leaves are not terminals but substrings of words

56 Computing the Most Probable Semantic Derivation contd. Does a greedy approximation search, with beam width ω=20, and returns ω most probable derivations it finds Uses a threshold θ=0.05 to prune low probability trees

57 Overview of KRISP Train string-kernel-based SVM classifiers Semantic Parser Collect positive and negative examples MRL Grammar NL sentences with MRs Novel NL sentences Best MRs Training Testing P π (s[i..j]) Best semantic derivations (correct and incorrect)

58 KRISP’s Training Algorithm Takes NL sentences paired with their respective MRs as input Obtains MR parses Induces the semantic parser using an SVM with a string subsequence kernel and refines it in iterations In the first iteration, for every production π: –Call those sentences positives whose MR parses use that production –Call the remaining sentences negatives

59 Support Vector Machines Recent approach based on extending a neural- network approach like Perceptron. Finds that linear separator that maximizes the margin between the classes. Based in computational learning theory, which explains why max-margin is a good approach (Vapnik, 1995). Good at avoiding over-fitting in high-dimensional feature spaces. Performs well on various text and language problems, which tend to be high-dimensional.

60 Picking a Linear Separator Which of the alternative linear separators is best?

61 Classification Margin Consider the distance of points from the separator. Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between classes. r ρ

62 SVM Algorithms Finding the max-margin separator is an optimization problem called quadratic optimization. Algorithms that guarantee an optimal margin take at least O(n 2 ) time and do not scale well to large data sets. Approximation algorithms like SVM-light (Joachims, 1999) and SMO (Platt, 1999) allow scaling to realistic problems.

63 Kernels SVMs can be extended to learning non-linear separators by using kernel functions. A kernel function is a similarity function between two instances, K(x 1,x 2 ), that must satisfy certain mathematical constraints. A kernel function implicitly maps instances into a higher dimensional feature space where (hopefully) the categories are linearly separable. A kernel-based method (like SVMs) can use a kernel to implicitly operate in this higher-dimensional space without having to explicitly map instances into this much larger (perhaps infinite) space (called “the kernel trick”). Kernels can be defined on non-vector data like strings, trees, and graphs, allowing the application of kernel-based methods to complex, unbounded, non-vector data structures.

64 Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

65 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” K(s,t) = ?

66 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states K(s,t) = 1+?

67 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = next K(s,t) = 2+?

68 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = to K(s,t) = 3+?

69 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states next K(s,t) = 4+?

70 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states to K(s,t) = 5+?

71 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = next to K(s,t) = 6+?

72 String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states next to K(s,t) = 7

73 KRISP’s Training Algorithm contd. STATE  NEXT_TO(STATE) which rivers run through the states bordering texas? what is the most populated state bordering oklahoma ? what is the largest city in states that border california ? … what state has the highest population ? what states does the delaware river run through ? which states have cities named austin ? what is the lowest point of the state with the largest area ? … PositivesNegatives String-kernel-based SVM classifier First Iteration

74 String Subsequence Kernel The examples are implicitly mapped to the feature space of all subsequences and the kernel computes the dot products states bordering states that border states that share border states with area larger than states through which state with the capital of the states next to

75 Support Vector Machines SVMs find a separating hyperplane such that the margin is maximized the states next to states that are next to Separating hyperplane Probability estimate of an example belonging to a class can be obtained using its distance from the hyperplane [Platt, 1999] states bordering states that border states that share border state with the capital of states with area larger than states through which 0.97

76 KRISP’s Training Algorithm contd. STATE  NEXT_TO(STATE) which rivers run through the states bordering texas? what is the most populated state bordering oklahoma ? what is the largest city in states that border california ? … what state has the highest population ? what states does the delaware river run through ? which states have cities named austin ? what is the lowest point of the state with the largest area ? … PositivesNegatives P STATE  NEXT_TO(STATE) (s[i..j]) String-kernel-based SVM classifier First Iteration

77 Overview of KRISP Train string-kernel-based SVM classifiers Semantic Parser Collect positive and negative examples MRL Grammar NL sentences with MRs Novel NL sentences Best MRs Best semantic derivations (correct and incorrect) Training Testing P π (s[i..j])

79 KRISP’s Training Algorithm contd. Using these classifiers P π (s[i..j]), obtain the ω best semantic derivations of each training sentence Some of these derivations will give the correct MR, called correct derivations, some will give incorrect MRs, called incorrect derivations For the next iteration, collect positives from most probable correct derivation Extended Earley’s algorithm can be forced to derive only the correct derivations by making sure all subtrees it generates exist in the correct MR parse Collect negatives from incorrect derivations with higher probability than the most probable correct derivation

80 KRISP’s Training Algorithm contd. Most probable correct derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) 1 2 3 4 5 6 7 8 9

81 KRISP’s Training Algorithm contd. Most probable correct derivation: Collect positive examples Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) 1 2 3 4 5 6 7 8 9

82 KRISP’s Training Algorithm contd. Incorrect derivation with probability greater than the most probable correct derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) 1 2 3 4 5 6 7 8 9 Incorrect MR: answer(traverse(stateid( ‘ texas ’ )))

83 KRISP’s Training Algorithm contd. Incorrect derivation with probability greater than the most probable correct derivation: Collect negative examples Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) 1 2 3 4 5 6 7 8 9 Incorrect MR: answer(traverse(stateid( ‘ texas ’ )))

84 KRISP’s Training Algorithm contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’,[8..9]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) Most Probable Correct derivation: Incorrect derivation: Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

89 KRISP’s Training Algorithm contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’,[8..9]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) Most Probable Correct derivation: Incorrect derivation: Mark the words under these nodes.

90 KRISP’s Training Algorithm contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’,[8..9]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) Most Probable Correct derivation: Incorrect derivation: Mark the words under these nodes.

91 Consider all the productions covering the marked words. Collect negatives for productions which cover any marked word in incorrect derivation but not in the correct derivation. 91 KRISP’s Training Algorithm contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’,[8..9]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) Most Probable Correct derivation: Incorrect derivation:

92 Consider the productions covering the marked words. Collect negatives for productions which cover any marked word in incorrect derivation but not in the correct derivation. KRISP’s Training Algorithm contd. Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’,[8..9]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE(STATE), [1..9]) (TRAVERSE  traverse, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘ texas ’, [8..9]) (NEXT_TO  next_to, [5..7]) Most Probable Correct derivation: Incorrect derivation:

93 KRISP’s Training Algorithm contd. STATE  NEXT_TO(STATE) the states bordering texas? state bordering oklahoma ? states that border california ? states which share border next to state of iowa … what state has the highest population ? what states does the delaware river run through ? which states have cities named austin ? what is the lowest point of the state with the largest area ? which rivers run through states bordering … PositivesNegatives P STATE  NEXT_TO(STATE) (s[i..j]) String-kernel-based SVM classifier Next Iteration: more refined positive and negative examples

95 WASP A Machine Translation Approach to Semantic Parsing Uses statistical machine translation techniques –Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) –Word alignments (Brown et al., 1993; Och & Ney, 2003) Hence the name: Word Alignment-based Semantic Parsing

96 A Unifying Framework for Parsing and Generation Natural Languages Machine translation

97 A Unifying Framework for Parsing and Generation Natural Languages Formal Languages Semantic parsing Machine translation

98 A Unifying Framework for Parsing and Generation Natural Languages Formal Languages Semantic parsing Tactical generation Machine translation

99 A Unifying Framework for Parsing and Generation Natural Languages Formal Languages Semantic parsing Tactical generation Machine translation Synchronous Parsing

100 A Unifying Framework for Parsing and Generation Natural Languages Formal Languages Semantic parsing Tactical generation Machine translation Compiling: Aho & Ullman (1972) Synchronous Parsing

101 Synchronous Context-Free Grammars (SCFG) Developed by Aho & Ullman (1972) as a theory of compilers that combines syntax analysis and code generation in a single phase Generates a pair of strings in a single derivation

102 QUERY  What is CITY CITY  the capital CITY CITY  of STATE STATE  Ohio Context-Free Semantic Grammar Ohio of STATE QUERY CITY What is CITY the capital

103 QUERY  What is CITY / answer(CITY) Productions of Synchronous Context-Free Grammars Natural languageFormal language

104 STATE  Ohio / stateid('ohio') QUERY  What is CITY / answer(CITY) CITY  the capital CITY / capital(CITY) CITY  of STATE / loc_2(STATE) What is the capital of Ohio Synchronous Context-Free Grammar Derivation Ohio of STATE QUERY CITY What is QUERY answer ( CITY ) capital ( CITY ) loc_2 ( STATE ) stateid ( 'ohio' ) answer(capital(loc_2(stateid('ohio')))) CITY the capital

105 Probabilistic Parsing Model Ohio of STATE CITY capital ( CITY ) loc_2 ( STATE ) stateid ( 'ohio' ) capital CITY STATE  Ohio / stateid('ohio') CITY  capital CITY / capital(CITY) CITY  of STATE / loc_2(STATE) d1d1

106 Probabilistic Parsing Model Ohio of RIVER CITY capital ( CITY ) loc_2 ( RIVER ) riverid ( 'ohio' ) capital CITY RIVER  Ohio / riverid('ohio') CITY  capital CITY / capital(CITY) CITY  of RIVER / loc_2(RIVER) d2d2

107 CITY capital ( CITY ) loc_2 ( STATE ) stateid ( 'ohio' ) Probabilistic Parsing Model CITY capital ( CITY ) loc_2 ( RIVER ) riverid ( 'ohio' ) STATE  Ohio / stateid('ohio') CITY  capital CITY / capital(CITY) CITY  of STATE / loc_2(STATE) RIVER  Ohio / riverid('ohio') CITY  capital CITY / capital(CITY) CITY  of RIVER / loc_2(RIVER) 0.5 0.3 0.5 0.05 0.5 λλ 1.31.05 ++ Pr(d 1 |capital of Ohio) = exp( ) / ZPr(d 2 |capital of Ohio) = exp( ) / Z d1d1 d2d2 normalization constant

108 Overview of WASP Lexical acquisition Parameter estimation Semantic parsing Unambiguous CFG of MRL Training set, {(e,f)} Lexicon, L Parsing model parameterized by λ Input sentence, e' Output MR, f' Training Testing

109 Lexical Acquisition Transformation rules are extracted from word alignments between an NL sentence, e, and its correct MR, f, for each training example, (e, f)

110 Word Alignments A mapping from French words to their meanings expressed in English And the program has been implemented Le programme a été mis en application

111 Lexical Acquisition Train a statistical word alignment model (IBM Model 5) on training set Obtain most probable n-to-1 word alignments for each training example Extract transformation rules from these word alignments Lexicon L consists of all extracted transformation rules

112 Word Alignment for Semantic Parsing How to introduce syntactic tokens such as parens? ( ( true ) ( do our { 1 } ( pos ( half our ) ) ) ) The goalie should always stay in our half

113 Use of MRL Grammar The goalie should always stay in our half RULE  (CONDITION DIRECTIVE) CONDITION  (true) DIRECTIVE  (do TEAM {UNUM} ACTION) TEAM  our UNUM  1 ACTION  (pos REGION) REGION  (half TEAM) TEAM  our top-down, left-most derivation of an unambiguous CFG n-to-1

114 TEAM Extracting Transformation Rules The goalie should always stay in our half RULE  (CONDITION DIRECTIVE) CONDITION  (true) DIRECTIVE  (do TEAM {UNUM} ACTION) TEAM  our UNUM  1 ACTION  (pos REGION) REGION  (half TEAM) TEAM  our TEAM  our / our

115 REGION TEAM REGION  TEAM half / (half TEAM) Extracting Transformation Rules The goalie should always stay in half RULE  (CONDITION DIRECTIVE) CONDITION  (true) DIRECTIVE  (do TEAM {UNUM} ACTION) TEAM  our UNUM  1 ACTION  (pos REGION) REGION  (half TEAM) TEAM  our REGION  (half our)

116 ACTION ACTION  (pos (half our)) REGION ACTION  stay in REGION / (pos REGION) Extracting Transformation Rules The goalie should always stay in RULE  (CONDITION DIRECTIVE) CONDITION  (true) DIRECTIVE  (do TEAM {UNUM} ACTION) TEAM  our UNUM  1 ACTION  (pos REGION) REGION  (half our)

117 Based on maximum-entropy model: Features f i (d) are number of times each transformation rule is used in a derivation d Output translation is the yield of most probable derivation Probabilistic Parsing Model

118 Parameter Estimation Maximum conditional log-likelihood criterion Since correct derivations are not included in training data, parameters λ * are learned in an unsupervised manner EM algorithm combined with improved iterative scaling, where hidden variables are correct derivations (Riezler et al., 2000)

119 Experimental Corpora CLang –300 randomly selected pieces of coaching advice from the log files of the 2003 RoboCup Coach Competition –22.52 words on average in NL sentences –14.24 tokens on average in formal expressions GeoQuery [Zelle & Mooney, 1996] –250 queries for the given U.S. geography database –6.87 words on average in NL sentences –5.32 tokens on average in formal expressions –Also translated into Spanish, Turkish, & Japanese.

120 Experimental Methodology Evaluated using standard 10-fold cross validation Correctness –CLang: output exactly matches the correct representation –Geoquery: the resulting query retrieves the same answer as the correct representation Metrics

121 Precision Learning Curve for CLang

122 Recall Learning Curve for CLang

123 Precision Learning Curve for GeoQuery

124 Recall Learning Curve for Geoquery

125 Precision Learning Curve for GeoQuery (WASP)

126 Recall Learning Curve for GeoQuery (WASP)

127 λWASP Logical forms can be made more isomorphic to NL sentences than FunQL and allow for better compositionality and generalization. Version of WASP that uses λ calculus to introduce and bind logical variables. –Standard in compositional formal semantics, e.g. Montague semantics. Modify SCFG to λ-SCFG

SCFG Derivations QUERY

QUERY  What is FORM / answer(x 1,FORM) SCFG Derivations QUERY FORMWhat is QUERY answer(x 1, FORM )

FORM  the smallest FORM FORM / smallest(x 2,(FORM,FORM)) SCFG Derivations QUERY FORMWhat is QUERY answer(x 1, FORM ) FORMthe smallestFORMsmallest(x 2,( FORM, FORM ))

FORM  state / state(x 1 ) SCFG Derivations QUERY FORMWhat is QUERY answer(x 1, FORM )state(x 1 ) FORMthe smallestFORM state smallest(x 2,( FORM, FORM ))

FORM  by area / area(x 1,x 2 ) SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM )state(x 1 ) FORMthe smallestFORM state smallest(x 2,( FORM, FORM )) area(x 1,x 2 )

What is the smallest state by area SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM )state(x 1 ) FORMthe smallestFORM state smallest(x 2,( FORM, FORM )) area(x 1,x 2 ) answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 ))))

What is the smallest state by area SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM )state(x 1 ) FORMthe smallestFORM state smallest(x 2,( FORM, FORM )) area(x 1,x 2 ) answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 )))) ???

What is the smallest state by area λ-SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM ) λx 1.state(x 1 ) FORMthe smallestFORM state λx 1.smallest(x 2,( FORM, FORM )) λx 1.λx 2.area(x 1,x 2 ) answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 ))))

What is the smallest state by area λ-SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM(x 1 ) ) λx 1.state(x 1 ) FORMthe smallestFORM state λx 1.smallest(x 2,( FORM(x 1 ), FORM(x 1,x 2 ) )) λx 1.λx 2.area(x 1,x 2 ) answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 ))))

138 FORM  smallest FORM FORM / λ-SCFG Production Rules NL string: MR string: λx 1.smallest(x 2,( FORM(x 1 ), FORM(x 1,x 2 ) )) Variable-binding λ-operator: Binds occurrences of x 1 in the MR string Argument lists: For function applications

What is the smallest state by area Yield of λ-SCFG Derivations by area QUERY FORMWhat is QUERY answer(x 1, FORM(x 1 ) ) λx 1.state(x 1 ) FORMthe smallestFORM state λx 1.smallest(x 2,( FORM(x 1 ), FORM(x 1,x 2 ) )) λx 1.λx 2.area(x 1,x 2 ) answer(x 1,smallest(x 2,(state(x 1 ),area(x 1,x 2 )))) ???

Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) ) λx 1.state(x 1 ) λx 2.smallest(x 1,( FORM(x 2 ), FORM(x 2,x 1 ) )) λx 1.λx 2.area(x 1,x 2 )

λx 1.state(x 1 ) λx 1.λx 2.area(x 1,x 2 ) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) ) λx 2.smallest(x 1,( FORM(x 2 ), FORM(x 2,x 1 ) )) Lambda functions

λx 2.smallest(x 1,( (λx 1.state(x 1 ))(x 2 ), (λx 1.λx 2.area(x 1,x 2 ))(x 2,x 1 ) )) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) )

λx 2.smallest(x 1,( (λx 1.state(x 1 ))(x 2 ), (λx 1.λx 2.area(x 1,x 2 ))(x 2,x 1 ) )) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) ) Function application: Replace bound occurrences of x 1 with x 2 (λx 1.f(x 1 ))(x 2 ) = f(x 2 )

λx 2.smallest(x 1,( state(x 2 ), (λx 1.λx 2.area(x 1,x 2 ))(x 2,x 1 ) )) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) )

λx 2.smallest(x 1,( state(x 2 ), area(x 2,x 1 ) )) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) )

λx 2.smallest(x 1,( state(x 2 ), area(x 2,x 1 ) )) Computing Yield with Lambda Calculus QUERY answer(x 1, FORM(x 1 ) ) Lambda function

Computing Yield with Lambda Calculus QUERY answer(x 1, λx 2.smallest(x 1,(state(x 2 ),area(x 2,x 1 )))(x 1 ) )

Computing Yield with Lambda Calculus QUERY answer(x 1, smallest(x 3,(state(x 1 ),area(x 1,x 3 ))) )

Computing Yield with Lambda Calculus answer(x 1,smallest(x 3,(state(x 1 ),area(x 1,x 3 )))) Logical form free of λ-operators with logical variables properly named

Learning in λWASP Must update induction of SCFG rules to introduce λ functions and produce a λSCFG.

152 λWASP Results on Geoquery 152

153 Tactical Natural Language Generation Mapping a formal MR into NL Can be done using statistical machine translation –Previous work focuses on using generation in interlingual MT (Hajič et al., 2004) –There has been little, if any, research on exploiting statistical MT methods for generation

154 Tactical Generation Can be seen as inverse of semantic parsing ((true) (do our {1} (pos (half our)))) The goalie should always stay in our half Semantic parsing Tactical generation

155 Tactical generation: Generation by Inverting WASP Same synchronous grammar is used for both generation and semantic parsing QUERY  What is CITY / answer(CITY) NL:MRL: InputOutput Semantic parsing:

156 Generation by Inverting WASP Same procedure for lexical acquisition Chart generator very similar to chart parser, but treats MRL as input Log-linear probabilistic model inspired by Pharaoh (Koehn et al., 2003), a phrase- based MT system Uses a bigram language model for target NL Resulting system is called WASP -1

157 Geoquery (NIST score; English)

158 RoboCup (NIST score; English) contiguous phrases only Similar human evaluation results in terms of fluency and adequacy

Conclusions Semantic parsing maps NL sentences to completely formal MRs. Semantic parsers can be effectively learned from supervised corpora consisting of only sentences paired with their formal MRs (and possibly also SAPTs). Learning methods can be based on: –Adding semantics to an existing statistical syntactic parser and then using compositional semantics. –Using SVM with string kernels to recognize concepts in the NL and then composing them into a complete MR using the MRL grammar. –Using probabilistic synchronous context-free grammars to learn an NL/MR grammar that supports both semantic parsing and generation.

111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin.

Similar presentations

Presentation on theme: "111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin.

Similar presentations

Presentation on theme: "111 CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback