Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Processes.

Similar presentations


Presentation on theme: "Learning Processes."— Presentation transcript:

1 Learning Processes

2 Introduction Property of primary significance in nn: learn from its environment, and improve its performance through learning. Iterative adjustment of synaptic weights. Learning: hard to define. One definition by Mendel and McClaren: Learning is a process by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place.

3 Learning Sequence of events in nn learning: NN is stimulated by the environment. NN undergoes changes in its free parameters as a result of this stimulation. NN responds in a new way to the environment because of the changes that have occurred in its internal structure. A prescribed set of well-defined rules for the solution of the learning problem is called a learning algorithm. The manner in which a NN relates to the environment dictates the learning paradigm that refers to a model of environment operated on by the nn.

4 Overview Five basic learning rules Learning paradigms
Error correction, Hebbian, memory-based, competitive, and Boltzmann Learning paradigms Credit assignment problem, supervised learning, unsupervised learning Learning tasks, memory, and adaptation Bias-Variance Tradeoff

5 Error Correction Learning
Input 𝑥 𝑛 , output 𝑦 𝑘 𝑛 , and desired response or target output 𝑑 𝑘 𝑛 Error signal 𝑒 𝑘 𝑛 = 𝑑 𝑘 𝑛 - y 𝑘 𝑛 𝑒 𝑘 𝑛 actuates a control mechanism that gradually adjust the synaptic weights, to minimize the cost function (or index of performance): 𝐸 𝑛 = 1 2 𝑒 𝑘 2 𝑛 When synaptic weights reach a steady state, learning is stopped.1

6 Error Correction Learning: Delta Rule
Widrow-Hoff rule, with learning rate 𝜂 : Δ𝑤 𝑘𝑗 𝑛 = 𝜂 𝑒 𝑘 𝑛 𝑥 𝑗 𝑛 With that, we can update the weights: 𝑤 𝑘𝑗 𝑛+1 = 𝑤 𝑘𝑗 𝑛 + Δ𝑤 𝑘𝑗 𝑛 Note: Partial derivatives for gradient

7 Memory-Based Learning or Lazy Learning
All (or most) past experiences are explicitly stored, as input- target pairs 𝑥 𝑖 , 𝑑 𝑖 𝑖=1,…𝑁 Given a new input 𝑥 𝑡𝑒𝑠𝑡 , determine class based on local neighborhood of 𝑥 𝑡𝑒𝑠𝑡 . Criterion used for determining the neighborhood Learning rule applied to the neighborhood of the input, within the set of training examples CMAC, Case-Based Learning

8 Memory-Based Learning: Nearest Neighbor
A set of instances observed so far: 𝑋= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑁 Nearest neighbor 𝑥 𝑁 ′ ∈𝑋 of 𝑥 𝑡𝑒𝑠𝑡 : min 𝑖 𝑑 𝑥 𝑖 ,𝑥 𝑡𝑒𝑠𝑡 =𝑑 𝑥 𝑁 ′ ,𝑥 𝑡𝑒𝑠𝑡 where 𝑑 .,. is the Euclidean distance. 𝑥 𝑡𝑒𝑠𝑡 is classified as the same class as 𝑥 𝑁 ′ Cover and Hart (1967): error (for Nearest Neighbor) is at most twice that of the optimal (Bayes probability of error), given The classified examples are independently and identically distributed. The sample size N is infinitely large.

9 Hebbian Synapses Donald Hebb’s postulate of learning appeared in his book The Organization of Behavior (1949). When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased. Hebbian synapse If two neurons on either side of a synapse are activated simultaneously, the synapse is strengthened. If they are activated asynchronously, the synapse is weakened or eliminated. (This part was not mentioned in Hebb.)

10 Hebbian Learning I n p u t S i g n a l s O u t p u t S i g n a l s Hebb’s Law provides the basis for learning without a teacher. Learning here is a local phenomenon occurring without feedback from the environment.

11 Hebbian Learning Using Hebb’s Law we can express the adjustment applied to the weight 𝑤 𝑖𝑗 at iteration 𝑛: Δ𝑤 𝑖𝑗 𝑛 = 𝐹( 𝑦 𝑗 𝑛 , 𝑥 𝑖 𝑛 ) As a special case, we can represent Hebb’s Law as follows(Activity product rule): Δ𝑤 𝑖𝑗 𝑛 = α 𝑦 𝑗 𝑛 𝑥 𝑖 𝑛 , where α is the learning rate parameter. Non-linear forgetting factor into Hebb’s Law: Δ𝑤 𝑖𝑗 𝑛 = α 𝑦 𝑗 𝑛 𝑥 𝑖 𝑛 −𝜑 𝑦 𝑗 𝑛 𝑤 𝑖𝑗 𝑛 where 𝜑 is the (usually small positive) forgetting factor.

12 Hebbian Learning Hebbian: time-dependent, highly local, heavily interactive. Positively correlated: Strengthen Negatively correlated: Weaken Strong evidence for Hebbian plasticity in the Hippocampus (brainregion). Correlation v.s. Covariance Rule Covariance Rule(1977 Sejnowski) Δ𝑤 𝑖𝑗 𝑛 = 𝜂(𝑦 𝑗 𝑛 − 𝑦 )(𝑥 𝑖 𝑛 - 𝑥 )

13 Competitive Learning In competitive learning, neurons compete among themselves to be activated. While in Hebbian learning, several output neurons can be activated simultaneously, in competitive learning, only a single output neuron is active at any time. The output neuron that wins the “competition” is called the winner- takes-all neuron. The basic idea of competitive learning was introduced in the early 1970s. In the late 1980s, Teuvo Kohonen introduced a special class of artificial neural networks called self-organizing feature maps. These maps are based on competitive learning.

14 Competitive Learning Highly suited to discover statistically salient features (that may aid in classification). Three basic elements: Same type of neurons with different weight sets, so that they respond differently to a given set of inputs. A limit imposed on the strength of each neuron. Competition mechanism, to choose one winner: winner-takes- all neuron.

15 Competitive Learning Inputs and weights can be seen as vectors: 𝑥 and 𝑤 𝑘 . Note that the weight vector belongs to a certain output neuron 𝑘, and thus the index. Single layer, feedforward excitatory, and lateral inhibitory connection Winner selection: 𝑦 𝑘 = 1 𝑖𝑓 𝑣 𝑘 > 𝑣 𝑗 for all 𝑗, 𝑗≠𝑘 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Limit: 𝑗 𝑤 𝑘𝑗 =1 for all 𝑘 Adaptation: Δ𝑤 𝑘𝑗 = 𝜂 𝑥 𝑗 − 𝑤 𝑘𝑗 if 𝑘 is the winner or neighbor of it otherwise 𝑤 𝑘 =( 𝑤 𝑘1 , 𝑤 𝑘2 ,…, 𝑤 𝑘𝑛 ) is moved toward the input vector.

16 Competitive Learning Δ𝑤 𝑘𝑗 = 𝜂 𝑥 𝑗 − 𝑤 𝑘𝑗 if 𝑘 is the winner or neighbor of it otherwise Weight vectors converge toward local input clusters: clustering.

17 Bolzmann Learning Stochastic learning algorithm rooted in statistical mechanics. Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’). Energy function 𝐸=− 1 2 𝑗 𝑘,𝑘≠𝑗 𝑤 𝑘𝑗 𝑥 𝑘 𝑥 𝑗 Activation: Choose a random neuron 𝑘 Flip state with a probability (given temperature 𝑇) 𝑃( 𝑥 𝑘 →− 𝑥 𝑘 )= 1 1+exp⁡(− ∆ 𝐸 𝑘 𝑇 ) where ∆ 𝐸 𝑘 is the change in 𝐸 due to the flip.

18 BolzmannMachine Two types of neurons Two modes of operation
Visible neurons: can be affected by the environment Hidden neurons: isolated Two modes of operation Clamped: visible neuron states are fixed by environmental input and held constant. Free-running: all neurons are allowed to update their activity freely.

19 Bolzmann Machine: Learning
Correlation of activity during clamped condition: 𝜌 𝑘𝑗 + Correlation of activity during free-running condition 𝜌 𝑘𝑗 − Weight update: ∆ 𝑤 𝑘𝑗 =𝜂( 𝜌 𝑘𝑗 + − 𝜌 𝑘𝑗 − ) Train weights 𝑤 𝑘𝑗 with various clamping input patterns. After training is completed, present new clamping input pattern that is a partial input of one of the known vectors. Let it run clamped on the new input (subset of visible neurons),and eventually it will complete the pattern (pattern completion). Corr 𝑥,𝑦 = 𝐶𝑜𝑣(𝑥,𝑦) 𝜎 𝑥 𝜎 𝑦 Note that both +kj and -kj range in value from –1 to +1.

20 Learning Paradigms How neural networks relate to their environment
credit assignment problem learning with a teacher learning without a teacher

21 Credit Assignment Problem
How to assign credit or blame for overall outcome to individual decisions made by the learning machine. In many cases, the outcomes depend on a sequence of actions. Assignment of credit for outcomes of actions (temporal credit- assignment problem): When does a particular action deserve credit. Assignment of credit for actions to internal decisions (structural credit-assignment problem): assign credit to internal structures of actions generated by the system. Credit-assignment problem routinely arises in neural network learning. Which neuron, which connection to credit or blame?

22 Learning Paradigms Supervised learning
Teacher has knowledge, represented as input–output examples. The environment is unknown to the nnet. Nnet tries to emulate the teacher gradually. Error-correction learning is one way to achieve this. Error surface, gradient, steepest descent, etc.

23 Learning Paradigms Learning without a Teacher: no labeled examples available of the function to be learned. 1) Reinforcement learning 2) Unsupervised learning

24 Learning Paradigms 1) Reinforcement learning: The learning of input-output mapping is performed through continued interaction with the environment in oder to minimize a scalar index of performance.

25 Learning Paradigms Delayed reinforcement, which means that the system observes a temporal sequence of stimuli. Difficult to perform for two reasons: There is no teacher to provide a desired response at each step of the learning process. The delay incurred in the generation of the primary reinforcement signal implies that the machine must solve a temporal credit assignment problem. Reinforcement learning is closely related to dynamic programming.

26 Learning Paradigms Unsupervised Learning: There is no external teacher or critic to oversee the learning process. The provision is made for a task independent measure of the quality of representation that the network is required to learn. Internal representations for encoding features of the input space. Competitive learning rule needed, such as winner-takes-all.

27 The Issues of Learning Tasks: Pattern Association
An associative memory: a brain- like distributed memory that learns by association. Autoassociation: A neural network is required to store a set of patterns by repeatedly presenting then to the network. The network is presented a partial description of an original pattern stored in it, and the task is to retrieve that particular pattern. Heteroassociation: It differs from autoassociation in that an arbitary set of input patterns is paired with another arbitary set of output patterns.

28 The Issues of Learning Tasks
Let 𝑥 𝑘 denote a key pattern and 𝑦 𝑘 denote a memorized pattern. The pattern association is decribed by 𝑥 𝑘 → 𝑦 𝑘 𝑘=1,2,…,𝑞 In an autoassociative memory 𝑥 𝑘 = 𝑦 𝑘 In a heteroassociative memory 𝑥 𝑘 ≠ 𝑦 𝑘 Storage phase v.s. Recall phase 𝑞 is a direct measure of the storage capacity.

29 The Issues of Learning Tasks: Classification
Pattern Recognition: The process whereby a received pattern/signal is assigned to one of a prescribed number of classes Two general types: Feature extraction (observation space to feature space: cf. dimensionality reduction), then classification (feature space todecision space). Single step (observation space to decision space).

30 The Issues of Learning Tasks
Function Approximation: Consider a nonlinear input-output mapping 𝒅=𝒇(𝒙) The vector 𝒙 is the input and the vector 𝒅 is the output. The function 𝒇(.) is assumed to be unknown. The requirement is to design a neural network that approximates the unknown function 𝒇(.). 𝑭 𝒙 −𝒇(𝒙) <𝜖 for all 𝒙 System identification Inverse system 𝒅=𝒇(𝒙) 𝒙= 𝒇 −1 (𝒅)

31 The Issues of Learning Tasks
Control: The controller has to invert the plant’s input- output behavior. Feedback controller: adjust plant input u so that the output of the plant y tracks the reference signal d. Learning is in the form of free-parameter adjustment in the controller.

32 The Issues of Learning Tasks
Filtering: estimate quantity at time 𝑛, based on measurements up to time 𝑛. Smoothing: estimate quantity at time 𝑛, based on measurements up to time 𝑛+𝛼 (𝛼 > 0). Prediction: estimate quantity at time 𝑛+𝛼, based on measurements up to time 𝑛 (𝛼 > 0).

33 The Issues of Learning Tasks
Blind source separation: recover 𝑢(𝑛) from distorted signal 𝑥(𝑛) when mixing matrix 𝐴 is unknown. 𝑥 𝑛 =𝐴𝑢 𝑛 Nonlinear prediction: given 𝑥 𝑛−𝑇 , 𝑥 𝑛−2𝑇 , …, estimate 𝑥 𝑛 ( 𝑥 𝑛 is the estimated value).

34 The Issues of Learning Tasks: Associative Memory
𝑞 pattern pairs: 𝑥 𝑘 , 𝑦 𝑘 for 𝑘=1,2,…,𝑞. Input (key vector) 𝑥 𝑘 = 𝑥 𝑘1 , 𝑥 𝑘2 ,…, 𝑥 𝑘𝑚 𝑡 Output (memorized vector) 𝑦 𝑘 = 𝑦 𝑘1 , 𝑦 𝑘2 ,…, 𝑦 𝑘𝑚 𝑡 . Weights can be represented as a weight matrix (see next slide): 𝑦 𝑘 =𝑾 𝑘 𝑥 𝑘 for 𝑘=1,2,…,𝑞. 𝑦 𝑘𝑖 = 𝑗=1 𝑚 𝑤 𝑖𝑗 𝑘 𝑥 𝑘𝑗 for 𝑖=1,2,…,𝑚. 𝑦 𝑘𝑖 = 𝑤 𝑖1 𝑘 , 𝑤 𝑖2 𝑘 ,…, 𝑤 𝑖𝑚 𝑘 𝑥 𝑘1 𝑥 𝑘 𝑥 𝑘𝑚

35 Associative Memory 𝑦 𝑘1 𝑦 𝑘 𝑦 𝑘𝑚 = 𝑤 11 𝑘 , 𝑤 12 𝑘 ,…, 𝑤 1𝑚 𝑘 𝑤 21 𝑘 , 𝑤 22 𝑘 ,…, 𝑤 2𝑚 𝑘 … 𝑤 𝑚1 𝑘 , 𝑤 𝑚2 𝑘 ,…, 𝑤 𝑚𝑚 𝑘 𝑥 𝑘1 𝑥 𝑘 𝑥 𝑘𝑚 With a single 𝑊(𝑘), we can only represent one mapping 𝑥 𝑘 to 𝑦 𝑘 . For all pairs 𝑥 𝑘 , 𝑦 𝑘 (for 𝑘=1,2,…,𝑞 ). We need 𝑞 such weight matrices. 𝑦 𝑘 =𝑾 𝑘 𝑥 𝑘 for 𝑘=1,2,…,𝑞 One strategy is to combine all 𝑾 𝑘 into a single memory matrix 𝑀 by simple summation 𝑀= 𝑘=1 𝑞 𝑾 𝑘 Will such a simple strategy work? That is, can the following be possible with 𝑀? 𝑦 𝑘 ≈𝑀 𝑥 𝑘 for 𝑘=1,2,…,𝑞

36 Correlation Matrix Memory
With fixed set of key vectors 𝑥 𝑘 , an 𝑚×𝑚 matrix can store 𝑚 arbitrary output vectors 𝑦 𝑘 . Let 𝑥 𝑘 = 0,0,…,1,…0 𝑡 where only the 𝑘-th element is 1 and all the rest is 0. Construct a memory matrix 𝑀 with each column representing the arbitrary output vectors 𝑦 𝑘 . 𝑀= 𝑦 1 𝑘 , 𝑦 2 𝑘 ,…, 𝑦 𝑚 𝑘 Then, 𝑦 𝑘 =𝑀 𝑥 𝑘 for 𝑘=1,2,…,𝑚 But, we want 𝑥 𝑘 to be arbitrary too!

37 Correlation Matrix Memory
With 𝑞 pairs 𝑥 𝑘 , 𝑦 𝑘 for 𝑘=1,2,…,𝑞, we can construct a candidate memory matrix that stores all 𝑞 mappings as: 𝑀 = 𝑘=1 𝑞 𝑦 𝑘 𝑥 𝑘 𝑡 = 𝑦 1 𝑥 1 𝑡 + 𝑦 2 𝑥 2 𝑡 +… 𝑦 𝑞 𝑥 𝑞 𝑡 = 𝑦 1 , 𝑦 2 ,… 𝑦 𝑞 𝑥 1 𝑥 𝑥 𝑞 =𝑌 𝑋 𝑡 This can be verified easily using partitioned matrices

38 Correlation Matrix Memory
First, consider 𝑾 𝑘 = 𝑦 𝑘 𝑥 𝑘 𝑡 only. Check if 𝑾 𝑘 𝑥 𝑘 = 𝑦 𝑘 : 𝑾 𝑘 𝑥 𝑘 = 𝑦 𝑘 𝑥 𝑘 𝑡 𝑥 𝑘 = 𝑦 𝑘 (𝑥 𝑘 𝑡 𝑥 𝑘 )= 𝑐𝑦 𝑘 where 𝑐=𝑥 𝑘 𝑡 𝑥 𝑘 a scalar value (the length of vector 𝑥 𝑘 squared). If all 𝑥 𝑘 s were normalized to have length 1, 𝑾 𝑘 𝑥 𝑘 = 𝑦 𝑘 will hold! Now, back to 𝑀 : Under what condition will 𝑀 𝑥 𝑗 give 𝑦 𝑗 for all 𝑗? Let’s begin by assuming 𝑥 𝑘 𝑡 𝑥 𝑘 =1.(Key vectors are normalized) We can decompose 𝑀 𝑥 𝑗 as follows. 𝑀 𝑥 𝑗 = 𝑘=1 𝑞 𝑦 𝑘 𝑥 𝑘 𝑡 𝑥 𝑗 = 𝑦 𝑗 𝑥 𝑗 𝑡 𝑥 𝑗 + 𝑘=1,𝑘≠𝑗 𝑞 𝑦 𝑘 𝑥 𝑘 𝑡 𝑥 𝑗 = 𝑦 𝑗 + 𝑘=1,𝑘≠𝑗 𝑞 𝑦 𝑘 𝑥 𝑘 𝑡 𝑥 𝑗 If all keys are orthogonal, then 𝑀 𝑥 𝑗 = 𝑦 𝑗 by removing the last noise term

39 Correlation Matrix Memory
We can also ask how many items can be stored in 𝑀 , i.e., its capacity. The capacity is closely related with the rank of the matrix 𝑀 . The rank means the number of linearly independent column vectors (or row vectors) in the matrix.

40 Adaptation When the environment is stationary (the statistic characteristics do not change over time), supervised learning can be used to obtain a relatively stable set of parameters. If the environment is nonstationary, the parameters need to be adapted over time, on an on-going basis (continuous learning or learning-on-the-fly). If the signal is locally stationary (pseudostationary), then the parameters can repeatedly be retrained based on a small window of samples, assuming these are stationary: continual training with time-ordered samples.

41 Probabilistic and Statistical Aspects of the Learning Process
We do not have knowledge of the exact functional relationship between X and D -> D = f(X) + , a regressive model The mean value of the expectational error , given any realization of X, is zero. The expectational error  is uncorrelated with the regression function f(X).

42 Probabilistic and Statistical Aspects of the Learning Process
Bias/Variance Dilemma Lav(f(x),F(x,T)) = B²(w)+V(w) B(w) = ET[F(x,T)]-E[D|X=x] (an approximation error) V(w) = ET[(F(x,T)-ET[F(x,T)])² ] (an estimation error) NN -> small bias and large variance Introduce bias -> reduce variance See other sources


Download ppt "Learning Processes."

Similar presentations


Ads by Google