Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003.

Similar presentations


Presentation on theme: "Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003."— Presentation transcript:

1 Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003

2 Contents 1.Introduction - Motivations 2.Previous & Related Works a) McClelland & Kawamoto(1986) b) Elman (1990, 1993, & 1999) c) Miikkulainen (1996) 3.Algorithms (Williams and Zipser, 1989) - Real Time Recurrent Learning 4.Simulations 5.Data & Encoding schemes 6.Results 7.Discussion & Future work

3 Motivations Can the neural network recognize the lexical classes from the sentence and learn the various types of sentences? From cognitive science perspective: - comparison between human language learning and neural network learning pattern - (e.g.) Learning English past tense (Rumelhart & McClelland,1986), Grammaticality judgment (Allen & Seidenberg,1999), Embedded sentences(Elman 1993, Miikkulainen1996, etc.)

4 Related Works McClelland & Kawamoto (1986) - Sentences with Case Role Assignments and semantic features by using backpropagation algorithm - output: 2500 case role units for each sentence - (e.g.) input: the boy hit the wall with the ball. output: [ Agent Verb Patient Instrument ] + [other features] - Limitation: poses a hard limit on the number of input size. - Alternative: Instead of detecting the input patterns displaced in space, detect the patterns which were in time (sequential inputs).

5 Related Works (continued) Elman (1990,1993, & 1999) - Simple Recurrent Network: Partially Recurrent Network using Context units - Network with a dynamic memory - Context units at time t hold a copy of the activations of the hidden units from the previous time step at time t-1. - Network can recognize sequences. input: Many years ago boy and girl … | | | | | | output: years ago boy and girl …

6 Related Works (continued) Miikkulainen (1996) - SPEC architecture (Subsymbolic Parser for Embedded Clauses  Recurrent Network) - Parser, Segmenter, and Stack: process the center and tail embedded sentences: 98,100 sentences with 49 different sentence by using case role assignments - (e.g.) Sequential Inputs input: …, the girl, who, liked, the dog, saw, the boy, … output: …, [the girl, saw, the boy] [the girl, liked, the dog] case role: (agent, act, patient) (agent, act, patient)

7 Algorithms Recurrent Network - Unlike feedforward networks, they allow connections both ways between a pair of units and even from a unit to itself. - Backpropagation through time (BPTT) – unfolds the temporal operation of the network into a layered feedforward network at every time step. (Rumelhart, et al., 1986) - Real Time Recurrent Learning (RTRL) – two versions (Williams and Zipser, 1989) 1) update weights after processing sequences is completed. 2) on-line: update weights while sequences are being presented. - Simple Recurrent Network (SRN) – partially recurrent network in terms of time and space. It has context units which store the outputs of the hidden units (Elman, 1990). (It can be modified from RTRL algorithm.)

8 Real Time Recurrent Learning Williams and Zipser (1989) - This algorithm computes the derivatives of states and outputs with respect to all weights as the network processes the sequence. Summary of Algorithm: In recurrent network, for any unit connected to any other and the input at node i at time t, the dynamic update rule is:

9 RTRL (continued) Error measure: with target outputs defined for some k ’s and t ’s if is defined at time t; otherwise Total cost function, t =0,1, …, T, where

10 RTRL (continued) The gradient of E separate in time, to do gradient descent, we define: The derivative of update rule: where initial condition t = 0,

11 RTRL (continued) Depending on the way of updating weights, there can be two versions of RTRL. 1) Update the weights after the sequences are completed at (t = T ). 2) Update the weights after each time step: on-line Elman’s “tlearn” simulator program for “Simple Recurrent Network” (which I’m using for this project) is implemented based on the classical backpropagation algorithm and the modification of this RTRL algorithm.

12 Simulation Based on Elman’s data and Simple Recurrent Network(1990,1993, & 1999), simple sentences and embedded sentences are simulated by using “tlearn” neural network program (BP + modified version of RTLR algorithm) available at http://crl.ucsd.edu/innate/index.shtml. http://crl.ucsd.edu/innate/index.shtml Question: 1. Can the network discover the lexical classes from word order? 2. Can the network recognize the relative pronouns and predict them?

13 Network Architecture 31 input nodes 31 output nodes 150 hidden nodes 150 context nodes * black arrow: distributed and learnable * dotted blue arrow: linear function and one-to-one connection with hidden nodes

14 Training Data Lexicon (31 words) NOUN-HUM man woman boy girl NOUN-ANIM cat mouse dog lion NOUN-INANIM book rock car NOUN-AGRESS dragon monster NOUN-FRAG glass plate NOUN-FOOD cookie bread sandwich VERB-INTRAN think sleep exist VERB-TRAN see chase like VERB-AGPAT move break VERB-PERCEPT smell see VERB-DESTROY break smash VERB-EAT eat ----------------------------- RELAT-HUM who RELAT-INHUM which Grammar (16 templates) NOUN-HUM VERB-EAT NOUN-FOOD NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-HUM VERB-INTRAN NOUN-HUM VERB-TRAN NOUN-HUM NOUN-HUM VERB-AGPAT NOUN-INANIM NOUN-HUM VERB-AGPAT NOUN-ANIM VERB-EAT NOUN-FOOD NOUN-ANIM VERB-TRAN NOUN-ANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTROY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD

15 Sample Sentences & Mapping Simple sentences – 2 types - man think (2 words) - girl see dog (3 words) - man break glass (3 words) Embedded sentences - 3 types (*RP – Relative Pronoun) 1. monster eat man who sleep (RP–sub, VERB-INTRAN) 2. dog see man who eat sandwich (RP-sub, VERB-TRAN) 3. woman eat cookie which cat chase (RP-obj, VERB-TRAN) Input-Output Mapping: (predict next input – sequential input) INPUT: girl see dog man break glass cat … | | | | | | | OUTPUT: see dog man break glass cat …

16 Encoding scheme Random word representation - 31-bit vector for each lexical item, each lexical item is represented by a randomly-assigned different bit. - not semantic feature encoding sleep 0000000000000000000000000000001 dog 0000100000000000000000000000000 woman 0000000000000000000000000010000 …

17 Training a network Incremental Input (Elman, 1993) “Starting small” strategy Phase I: simple sentences (Elman, 1990, used 10,000 sentences) - 1,564 sentences generated(4,636 31-bit vectors) - train all patterns: learning rate =0.1, 23 epochs Phase II: embedded sentences (Elman, 1993, 7,500 sentences) - 5,976 sentences generated(35,688 31-bit vectors) - loaded with weights from phase I - train (1,564 + 5,976) sentences together: learning rate = 0.1, 4 epochs

18 Performance Network performance was measured by Root Mean Squared Error: the number of input patterns RMS error = target output vector actual output vector Phase I: After 23 epochs, RMS ≈ 0.91 Phase II: After 4 epochs, RMS ≈ 0.84 Why can RMS not be lowered? The prediction task is nondeterministic, so the network cannot produce the unique output for the corresponding input. For this simulation, RMS is NOT the best measurement of performance. Elman’s simulation: RMS = 0.88 (1990), Mean Cosine = 0.852 (1993)

19 Phase I: RMS ≈ 0.91 after 23 epochs

20 Phase II: RMS ≈ 0.84 after 4 epochs

21 Results and Analysis Output Target … which (target: which ) ? (target: lion ) ? (target: see ) ? (target: boy )  ? (target: move ) ? (target: sandwich ) which (target: which ) ? (target: cat ) ? (target: see ) … ? (target: book ) which (target: which ) ? (target: man ) see (target: see ) … ? (target: dog )  ? (target: chase ) ? (target: man ) ? (target: who ) ? (target: smash ) ? (target: glass )... Arrow(  ) indicates the start of the sentence. In all positions, the word “which” is predicted correctly! But most of words are not predicted including the word “who” is not. Why?  Training Data Since the prediction task is non- deterministic, predicting the exact next word can not be the best performance measurement. We need to look at hidden unit activations of each input, since they reflect what the network has learned about classes of inputs with regard to what they predict.  Cluster Analysis, PCA

22 Cluster Analysis The network successfully recognizes VERB, NOUN, and some of their subcategories. WHO and WHICH has different distance VERB-INTRAN failed to fit in VERB <Hierarchical cluster diagram of hidden unit activation vectors>

23

24 Discussion & Conclusion 1. The network can discover the lexical classes from word order. Noun and Verb are different classes except the “VERB-INTRAN”. Also, subclasses for NOUN are classified correctly, but some subclasses for VERB are mixed. This is related to the input example. 2. The network can recognize and predict the relative pronoun, “which”, but not “who” Why? Because the sentences for “who” is not “RP- obj”, so “who” is just considered as one of normal subject in simple sentences. 3. The organization of input data is important and sensitive to the recurrent network, since it processes the input sequentially and on- line. 4. Generally, recurrent networks by using RTRL recognized sequential input, but it requires more training time and computation resources.

25 Future Studies Recurrent Least Squared Support Vector Machines, Suykens, J.A.K. & Vandewalle, J., (2000). - provides new perspectives for time-series prediction and nonlinear modeling - seems more efficient than BPTT, RTRL & SRN

26 References Allen, J., & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. In Brian MacWhinney (Ed.), The emergence of language (pp.115-151). Hillsdale, NJ: Lawrence Erlbaum. Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Elman, J. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48, 71-99. Elman, J.L. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.) Emergence of Language. Hillsdale, NJ: Lawrence Earlbaum Associates. Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley. MacClelland, J. L., & Kawamoto A. H. (1986). Mechanisms of sentence processing: Assigning roles to constituents of sentences. (273-325). In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Miikkulaninen R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses, Cognitive Science, 20, 47-73. Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error Propagation," Parallel and Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318-362 Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tense of English verbs. In J.L. McClelland & D.E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270--280.


Download ppt "Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003."

Similar presentations


Ads by Google