Presentation on theme: "Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:"— Presentation transcript:
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York
boydogrun book rockseeeat boydogrun book rockseeeat Elman’s Single Recurrent Network 1) activate from input to output as usual (one input word at a time), but copy the hidden activations to the context layer. 2) repeat 1 over and over - but activate from the input AND copy layers to the ouput layer. 1-to-1 exact copy of activations "regular" trainable weight connections
From Elman (1990) Templates were set up and lexical items were chosen at random from "reasonable" categories. Templates for sentence generator NOUN-HUM VERB-EAT NOUN-FOOD NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-HUM VERB-INTRAN NOUN-HUM VERB-TRAN NOUN-HUM NOUN-HUM VERB-AGPAT NOUN-INANIM NOUN-HUM VERB-AGPAT NOUN-ANIM VERB-EAT NOUN-FOOD NOUN-ANIM VERB-TRAN NOUN-ANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTORY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD Categories of lexical items NOUN-HUM man, woman NOUN-ANIM cat, mouse NOUN-INANIM book, rock NOUN-AGRESS dragon, monster NOUN-FRAG glass, plate NOUN-FOOD cookie, sandwich VERB-INTRAN think, sleep VERB-TRAN see, chase VERB-AGPA move, break VERB-PERCEPT smell, see VERB-DESTROY break, smash VERB-EA eat
Training dataSupervisor's answers woman smash plate cat move man break car boy move girl eat bread dog smash plate cat move man break car boy move girl eat bread dog move Resulting training and supervisor files. Files were 27,354 words long, made up of 10,000 two and three word "sentences."
Cluster (Similarity) analysis Hidden activations were for each word were averaged together. For simplicity assume only 3 hidden nodes (in fact there were 150). After the SRN was trained, the file was run through the network. The activations at the hidden nodes was recorded (I made up these numbers for the example). Now the average was taken for every word: boy smash plate... dragon eat boy... boy eat cookie boy smash plate dragon eat cookie Each of these vectors represents a point in 3-D space – some vectors are close together, some furthur apart.
Each of these words represents a point in 150-Dimentional space averaged from all activations generated by the network when processing that word. Each joint (where there is a connection) represents the distance between clusters. So for example, the distance between animals and humans is approx.85 and the distance between ANIMATES and INANIMATES is approx 1.5.
Seems to correctly discover Nouns vs Verbs, verb subcategorization, animates/inanimates, etc. Cool, eh? Remarks: No information is represented in the input (localist, orthogonal) There are no "rules" in the traditional sense. The categories are learned from statistical regularities in the sentences – there is no structure being provided to the network (more on this in a bit) There are no "symbols" in the traditional sense. Classic symbol manipulating systems use names for well-defined classes of entities (N, V, adj, etc). In an SRN the representation of the concept of, say, boy, is: 1. distributed – (as a vector of activations), and 2. represented over context wrt to words that come before. (E.g. boy is represented one way when used as an object and another when used in subject position) Although. note that when a cluster analysis is performed on specific occurrences of a word, the cluster is very tight, but there is some variation based on a words context.
From Elman (1991) – Constituency, long distance relations, optionality. A simple context-free grammar was used S -> NP VP. NP -> PropN | N | N RC VP -> V (NP) RC -> who NP VP | who VP (NP) N -> boy | girl | cat | dog | boys | girls | cats | dogs V -> chase | feed | see | hear | walk | live | chases | feeds | sees | hears | walks | lives PropN -> John | Mary Plus constraints on number agreement, and verb argument subcategorization.
This allows a variety of interesting sentences that were used for training. (note *'d items were not used for training. For you CS people out there, * frequently means ungrammatical) Dogs live. *Dogs live cats. Boys see. Boys see dogs. Boys see dog. Boys hit dogs. *Boys hit. Dog who chases cat sees girl. *Dog who chase cat sees girl. Dog who cat chases sees girl. Boys who girls who dogs chase see hear. Boys see dogs who see girls who hear. Boys see dogs who see girls. Boys see dogs. Boys see. Transitive Optionally transitive intransitive long distance number agreement Ambiguous sentence boundaries.
Boys who Mary chase feed cats. This is much, much difficult input than Elman Long distance agreement: chases agrees with Boys, but who Mary is in the way. Subcategorization: chases is mandatorily transitive, but in a relative clause, the network has to NOT mistake it as the independent sentence Mary chases.
Analysis of results – Principle Component Analysis. Suppose you have 3 hidden nodes and four vectors of activation that correspond to: boy subj, boy obj, girl subj, girl obj. boy subj boy obj girl obj girl subj Adapted from Crocker (2001) activation at hidden node 1 activation at hidden node 2 boy subj boy obj girl obj girl subj activation at hidden node 3 But PCA give you more info: And hierarchical clustering gives you this: