Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visualizing and Understanding recurrent Neural Networks

Similar presentations


Presentation on theme: "Visualizing and Understanding recurrent Neural Networks"— Presentation transcript:

1 Visualizing and Understanding recurrent Neural Networks
Presented By: Collin Watts Wrritten By: Andrej Karpathy, Justin Johnson, Li Fei-fei

2 Plan Of Attack What we’re going to cover: Overview Some Definitions
Expiremental Analysis Lots of Results The Implications of the Results Case Studies Meta-Analysis

3 So, what would you say you do here...
This paper set out to analyze both the most efficient implementation of an RANN (we’ll get there) as well as identify what mechanisms are used internally that achieve their results. Chose 3 different variants of RANNs: Basic RANNs LSTM RANNs GRU RANNs Did character level language analysis as their test problem, as it is apparently strongly representative of other analysies.

4 Definitions RECURRENT NEURAL NETWORK
Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within the network Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks Difficult to train

5 Definitions RECURRENT NEURAL NETWORK
Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within the network Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks Difficult to train

6 Definitions RECURRENT NEURAL NETWORK (Cont.)
Uses a 2 dimensional node setup, with time as one axis and depth of the nodes as another Nodes are referrd to as hLt, with l = 0 being the input nodes, and l = L being the output nodes. Intermediate vectors are calculated as a function of both the previous time step and the previous layer. This results in the following recurrence:

7 More DEFINITIONS! LONG SHORT-TERM MEMORY VARIANT
Variant of the RANN designed to mitigate problems with backpropogation within a RANN. Adds a memory vector to each node. Every time step, an LSTM can choose to read, write to, or reset the memory vector, following a series of gating mechanisms. Has the effect of preserving gradients across memory cells for long periods. i, f, o, and g are the gates for whether the memory cell is updated, reset, or read, respectively, while g allows for additive additions to the memory cell.

8 HALF A DEFINITION... GATED RECURRENT UNIT
Not well elaborated on in the paper... Given explanation is that “The GRU has the interpretation of computing a candidate hidden vector and then smoothly interpolating towards it, as gated by z.” My interpretation: rather than having explicit access & control gates, this follows a more analog approach.

9 Expiremental analysis (Science!)
As previously stated, the researchers used character-level language modelling as a basis of comparison. Trained each network to predict the following character in a sequence. Used Softmax classifier at each time step. Generated a vector of all possible next characters and fed those to the current network to get that many hidde vectors in the last layer of the network. These outputs represented log probabilities of each character being the next character in the sequence.

10 Expiremental analysis (Science!)
Rejected the use of two other datasets (Penn treeback dataset and Hutter Prize 100MB of Wikipedia dataset) on the basis of them containing both standard English language and markup. Stated intention for rejecting was to use a controlled setting for all types of neural networks, rather than compete for best results on these data sets. Decided on Leo Tolstoy’s War and Peace, consisting of 3,258,246 characters and the source code of the Linux Kernel (randomized across files and then concatenated into a single 6,206,996 character file).

11 Expiremental analysis (Science!)
War and Peace, was split into 80/10/10 for training/validation/testing. Linux Kernel, was split into 90/5/5 for training/validation/testing. Tested the following properties for each of the 3 RANNS: Number of Layers (1, 2 , or 3) Number of Parameters (64, 128, 256, 512 cell counts)

12 Results (and the winner is...)
Test set cross entropy loss:

13 Results (and the winner is...)

14 Results (and the winner is...)

15 Implications of results (But why...)
The researchers paid attention to several characteristics beyond just the results of their findings. One of their stated goals was to arrive at why these emergent properties exist. Interpretable, long-range LSTM cells Have been theorized to exist, but never proven. They proved them. Truncated back-propagation (used for performance gains as well as combatting overfitting) limits understanding dependencies more than X characters away, where X is the depth of the backpropogation. These LSTM cells have been able to overcome that challenge while retaining performance and fitting characteristics.

16 Visualizations of results (But why...)
Text color is a visualization of tanh(c) where -1 is red and +1 is blue.

17 Visualizations of results (But why...)

18 Visualizations of results (But why...)

19 Visualizations of results (But why...)

20 Implications of results (But why...)
Also paid attention to gate activations (remember the gates are what cause interactions with the memory node) in LSTMs. Defined the ideas of “left saturated” and “right saturated” Left saturated: If the gates activate less than 0.1 (10% of the time). Right saturated: If the gates activate more than 0.9 (90% of the time) Of particular note: Right saturated forget gate cells (cells remembering values) No left saturated forget gate cells (no cells being purely feed forward) Found that activations in the first layer are diffuse (this is unexplainable by the researchers, but found to be very strange)

21 Visualizations of results (But why... LSTMs)

22 Visualizations of results (But why...GRUs)

23 Error Analysis of Results
Compared against two standard n-gram models for analysis of LSTMs effectiveness. An error was defined to be if the probability of the next character being the character that was actually there was less than 0.5. Found that while the models shared many of the same errors, there were distinct segments that each one failed differently on.

24 Error analysis of results
Linux Kernel War and peace

25 Error analysis of results
Found that LSTM has significant advantages over standard n-gram models when computing the probability of special characters. In the Linux Kernel model, brackets and whitespce are predicted significantly better than in the n-gram model, because of it’s ability to keep track of relationships between open and closing brackets. Similarly, in War and Peace, LSTM was able to more correctly predict carriage returns, due to the relationship being outside of the n-gram models’ effective range of relationship prediction.

26 Case study { Look, braces! }
When it specifically compes to closing brackets (“}”) in the Linux kernel, the researchers were able to analyze the performance of the LSTM versus the n-gram models. Found that LSTM did better than n-gram for distances of up to 60 characters. After that, the performance gains levelled off.

27 Meta-analysis (The good)
The researchers were able to very effectively capture and elucidate their point via their visualizations and implications. They seem to have proven several until now only theorized ideas on how RANNs work in data analysis.

28 Meta-analysis (THE BAD)
I would have appreciated a more in depth explanation of why they rejected the standard ANN competitive datasets. It would seem to follow that those would be a more true measure of the capabilities, which is why they are chosen in the first place. There wasn’t a lot of explanation as to why their parameters were chosen for each RANN, or what their parameters for evaluation were. (What is test set cross-entropy loss?) Data was split differently across each of the texts, so that the total count for validation and tests was the same. I don’t see what this offers. If anything, you would want the count of training to be the same.

29 META-ANALYSIS (The ugly)
This paper does not ease the reader into understanding the ideas involved. Required reading several additional papers to get the implications of things they assumed the reader knew. Some ideas were not clearly explained even after researching the related works.

30 Final slide Questions? Comments ? Concerns? Correction s?


Download ppt "Visualizing and Understanding recurrent Neural Networks"

Similar presentations


Ads by Google