Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Context Free Grammar Language structure is not linear The velocity of seismic waves rises to…

Similar presentations


Presentation on theme: "Probabilistic Context Free Grammar Language structure is not linear The velocity of seismic waves rises to…"— Presentation transcript:

1

2 Probabilistic Context Free Grammar

3 Language structure is not linear The velocity of seismic waves rises to…

4 Context free grammars – a reminder A CFG G consists of - A set of terminals {w k }, k=1, …, V A set of nonterminals {N i }, i=1, …, n A designated start symbol, N 1 A set of rules, {N i  π j } (where π j is a sequence of terminals and nonterminals)

5 A very simple example G’s rewrite rules – S  aSb S  ab Possible derivations – S  aSb  aabb S  aSb  aaSbb  aaabbb In general, G creates the language a n b n

6 Modeling natural language G is given by the rewrite rules – S  NP VP NP  the N | a N N  man | boy | dog VP  V NP V  saw | heard | sensed | sniffed

7 Recursion can be included G is given by the rewrite rules – S  NP VP NP  the N | a N N  man CP | boy CP | dog CP VP  V NP V  saw | heard | sensed | sniffed CP  that VP | ε

8 Probabilistic Context Free Grammars A PCFG G consists of – A set of terminals {w k }, k=1, …, V A set of nonterminals {N i }, i=1, …, n A designated start symbol, N 1 A set of rules, {N i  π j } (where π j is a sequence of terminals and nonterminals) A corresponding set of probabilities on rules

9 Example

10 astronomers saw stars with ears P(t 1 ) = 0.0009072

11 astronomers saw stars with ears P(t 2 ) = 0.0006804 P(w 15 ) = P(t 1 )+P(t 2 ) = 0.0015876

12 Training PCFGs Given a corpus, it’s possible to estimate rule probabilities to maximize its likelihood This is regarded a form of ‘grammar induction’ However the rules of the grammar must be pre-given

13 Questions for PCFGs What is the probability of a sentence w 1n given a grammar G – P(w 1n |G)? Calculated using dynamic programming What is the most likely parse for a given sentence – argmax t P(t|w 1n, G) Likewise How can we choose rule probabilities for the grammar G that maximize the probability of a given corpus? The inside-outside algorithm

14 Chomsky Normal Form We will be dealing only with PCFGs of the above-mentioned form That means that there are exactly two types of rules – N i  N j N k N i  w j

15 Estimating string probability Define ‘inside probabilities’ – We would like to calculate A dynamic programming algorithm Base step

16 Estimating string probability Induction step

17 Drawbacks of PCFGs Do not factor in lexical co-occurrence Rewrite rules must be pre-given according to human intuitions The ATIS-CFG fiasco The capacity of PCFG to determine the most likely parse is very limited As grammars grow larger, they become increasingly ambiguous The following sentences look the same to a PCFG, although suggest different parses I saw the boat with the telescope I saw the man with the scar

18 PCFGs – some more drawbacks Have some inappropriate biases In general, the probability of a smaller tree will be larger than a larger one Most frequent length for Wall Street Journal sentences is around 23 words Training is slow and problematic Converges to a local optimum Non-terminals do not always resemble true syntactic classes

19 PCFGs and language models Because they ignore lexical co- occurrence, PCFGs are not good as language models However, some work has been done on combining PCFGs with n-gram models PCFGs modeled long-range syntactic constraints Performance generally improved

20 Is natural language a CFG? There is an on-going debate on the CFG’ness of English There are some languages that can be shown to be more complex than CFGs For example, Dutch –

21 Dutch oddities Dat Jan Marie Pieter Arabisch laat zien schrijven THAT JAN MARIE PIETER ARABIC LET SEE WRITE “that Jan Let Marie see Pieter write Arabic” However, from a purely syntactic view point, this is just – dat P n V n

22 Other languages Bambara (Malinese language) has non- CF features, in the form of – A n B m C n D m Swiss-German as well However, CFGs seem to be a good approximation for most phenomena in most languages

23 Grammar Induction With ADIOS (“Automatic DIstillation Of Structure”)

24 Previous work Probabilistic Context Free Grammars ‘Supervised’ induction methods Little work on raw data Mostly work on artificial CFGs Clustering

25 Our goal Given a corpus of raw text separated into sentences, we want to derive a specification of the underlying grammar This means we want to be able to Create new unseen grammatically correct sentences Accept new unseen grammatically correct sentences and reject ungrammatical ones

26 What do we need to do? G is given by the rewrite rules – S  NP VP NP  the N | a N N  man | boy | dog VP  V NP V  saw | heard | sensed | sniffed

27 ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability We will consider each of these in turn

28 Is that a dog? (6) 102 (5) (4) 102 (3) (4) 101 (1)(2) 101(3) 103 (1) 104 (1) (2) 104 (3) (2) (3) 103 (6) (5)(7) (6) (5) where 104 (4) thedog ? END (4) (5) aandhorse (2) that cat 102 (1) BEGIN is Is that a cat?Where is the dog?And is that a horse? node edge The Model: Graph representation with words as vertices and sentences as paths.

29 ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability

30 Toy problem – Alice in Wonderland a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n

31 Detecting significant patterns Identifying patterns becomes easier on a graph Sub-paths are automatically aligned

32 Motif EXtraction

33 The Markov Matrix The top right triangle defines the P L probabilities, bottom left triangle the P R probabilities Matrix is path-dependent

34

35 Example of a probability matrix

36 Rewiring the graph Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.

37 MEX at work

38 ALICE motifs curious1.00196 hadbeen1.00176 however1.00206 perhaps1.00166 hastily1.00166 herself1.00786 footman1.00146 suppose1.00126 silence0.99146 witness0.99106 gryphon0.97546 serpent0.97116 angrily0.9786 croquet0.9786 venture0.95126 forsome0.95126 timidly0.9596 whisper0.9596 rabbit1.00275 course1.00255 eplied1.00225 seemed1.00265 remark1.00285 WeightOccurrencesLength

39 ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability

40 Generalization

41 Bootstrapping

42 Determining L Involves a tradeoff Larger L will demand more context sensitivity in the inference Will hamper generalization Smaller L will detect more patterns But many might be spurious

43 The ADIOS algorithm Initialization – load all data into a pseudograph Until no more patterns are found For each path P Create generalized search paths from P Detect significant patterns using MEX If found, add best new pattern and equivalence classes and rewire the graph

44 1205 567321120132234621987 321 234987 1203 567321120132 234 621 987 2000 321234 987 1203 3211203 234987 1204 987 2001 1204 The Model : The training process

45 1205 567321120132234621987 1203 567120132621 2000 321 1203 1204 987 2001 1204

46 1205 567 321 120132234621987 567 120132621 2000 321 1203 987 2001 1204

47 Example

48 More Patterns

49 Evaluating performance In principle, we would like to compare ADIOS-generated parse-trees with the true parse-trees for given sentences Alas, the ‘true parse-trees’ are subject to opinion Some approaches don’t even suppose parse trees

50 Evaluating performance Define Recall – the probability of ADIOS recognizing an unseen grammatical sentence Precision – the proportion of grammatical ADIOS productions Recall can be assessed by leaving out some of the training corpus Precision is trickier Unless we’re learning a known CFG

51 The ATIS experiments ATIS-NL is a 13,043 sentence corpus of natural language Transcribed phone calls to an airline reservation service ADIOS was trained on 12,700 sentences of ATIS-NL The remaining 343 sentences were used to assess recall Precision was determined with the help of 8 graduate students from Cornell University

52 The ATIS experiments ADIOS’ performance scores – Recall – 40% Precision – 70% For comparison, ATIS-CFG reached – Recall – 45% Precision - <1%(!)

53 ADIOS/ATIS-N comparison

54 An ADIOS drawback ADIOS is inherently a heuristic and greedy algorithm Once a pattern is created it remains forever – errors conflate Sentence ordering affects outcome Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar

55 An ad-hoc solution Train multiple learners on the corpus Each on a different sentence ordering Create a ‘forest’ of learners To create a new sentence Pick one learner at random Use it to produce sentence To check grammaticality of given sentence If any learner accepts sentence, declare as grammatical

56 The effects of context window width

57 Meta-analysis of ADIOS results Define a pattern spectrum as the histogram of pattern types for an individual learner A pattern type is determined by its contents E.g. TT, TET, EE, PE… A single ADIOS learner was trained with each of 6 translations of the bible

58 Pattern spectra

59 Language dendogram

60 To be continued…


Download ppt "Probabilistic Context Free Grammar Language structure is not linear The velocity of seismic waves rises to…"

Similar presentations


Ads by Google