Download presentation

Presentation is loading. Please wait.

Published byLiana Widjaja Modified about 1 year ago

1
**Hybrid computing using a neural network with dynamic external memory**

Alex Graves, Greg Wayne et al, 2016, Nature Youngnam Kim

2
Outline This paper proposed an improved version of neural Turing machines The name is the Differentiable Neural Computer a.k.a. DNC The 3 main differences are dynamic memory allocation improved location-based addressing – temporal memory linkage the agent can learn how much to write The presentation is going to address neural Turing machines briefly what are different between NTMs and DNCs experimental results

3
**Neural Turing machines(Alex Graves et al, 2014)**

imitate Turing machines by memory networks having an external memory M t ∈ ℝ 𝑁×𝑑 , 𝑁 = number of memory locations, 𝑑 = memory vector dimension read and write heads, interaction with memory must be differentiable a controller learns what and where to read and write, generally RNNs are used

4
**𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector**

Neural Turing machines – read and write To be differentiable, we do Attention read and write everywhere to the different extent Reading 𝑖 𝑤 𝑡 (𝑖) =1, 0≤ 𝑤 𝑡 𝑖 ≤1 𝑟 𝑡 ← 𝑖 𝑤 𝑡 𝑖 𝑴 𝑡 (𝑖) Writing 𝑴 𝑡 𝑖 ← 𝑴 𝑡−1 𝑖 ⊙ 𝟏− 𝑤 𝑡 𝑖 𝒆 𝑡 + 𝑤 𝑡 𝑖 𝒂 𝑡 𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector

5
**Neural Turing machines – addressing**

Addressing – how to produce weights for read and write operations Content-based addressing Location-based addressing a controller produces key vector 𝒌 𝒕 and key strength 𝜷 𝒕 ≥1, then content-based weight 𝑤 𝑡 𝑐 is 𝒘 𝑡 𝑐 = exp{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑖))} 𝑗 exp{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑗)} 𝑆 𝒖,𝒗 = 𝒖∙𝒗 ‖𝒖‖‖𝒗‖ 𝑆 is a similarity function, generally cosine similarity DNC used the same content-based addressing of NTMs

6
**Neural Turing machines – addressing**

Location based addressing(different from DNCs) In NTMs, interpolates the content weights 𝒘 𝑡 𝑐 and previous weights 𝒘 𝑡−1 before shift 𝒘 𝑡 𝑔 ← 𝑔 𝑡 𝒘 𝑡 𝑐 + 1− 𝑔 𝑡 𝒘 𝑡−1 where an interpolation gate 𝑔 𝑡 is a scalar in the range (0,1) after interpolation, shifts 𝒘 𝑡 𝑐 using shift distribution 𝒔 𝑡 𝑤 𝑡 𝑖 = 𝑗=0 𝑁−1 𝑤 𝑡 𝑔 𝑗 𝑠 𝑡 (𝑖−𝑗) to avoid leakage and dispersion of weightings, use sharpening parameter 𝛾 𝑡 ≥1 𝑤 𝑡 𝑖 ← 𝑤 𝑡 𝑖 𝛾 𝑡 𝑗 𝑤 𝑡 𝑗 𝛾 𝑡

7
**Neural Turing machines – addressing**

an example of shift weightings disadvantages: we can iterate on only adjacent elements

8
**Differentiable Neural Computers – architecture**

9
**Differentiable Neural Computers – write operation**

dynamic memory allocations the agent learns deciding whether the location to be freed in which it reads to do this, the read head produces an allocation vector 𝒂 𝑡 ∈ 0,1 𝑁 when usage vector 𝑢 𝑡 𝑖 is close to 0, indicates 𝑖-th memory location being free 𝝍 𝑡 = 𝑖=1 𝑅 (𝟏− 𝑓 𝑡 𝑖 𝒘 𝑡−1 𝑟,𝑖 ) 𝒖 𝑡 = 𝒖 𝑡−1 + 𝒘 𝑡−1 𝑤 − 𝒖 𝑡−1 ⊙ 𝒘 𝑡−1 𝑤 ⊙ 𝝍 𝑡 𝒂 𝑡 𝝓 𝑡 𝑗 = 1− 𝒖 𝑡 𝝓 𝑡 𝑗 𝑖=1 𝑗−1 𝒖 𝑡 [ 𝝓 𝑡 [𝑖]] overwriting how much free the location is force to use locations more free where 𝝍 𝑡 is a retention vector, 𝑓 𝑡 𝑖 is a free gate of read head 𝑖 and 𝝓 𝑡 is free list 𝒘 𝑡−1 𝑟,𝑖 is a 𝑖-th read weighting of previous time step and 𝒘 𝑡−1 𝑤 is a write weighting free list 𝜙 𝑡 is a sorted list of index in ascending order of usage

10
**Differentiable Neural Computers – write operation**

interpolating content weighting 𝒄 𝑡 𝑤 and allocation weightings 𝒂 𝑡 𝒘 𝑡 𝑤 = 𝑔 𝑡 𝑤 [ 𝑔 𝑡 𝑎 𝒂 𝑡 + 1− 𝑔 𝑡 𝑎 𝒄 𝑡 𝑤 ] where 𝑔 𝑡 𝑤 is a write gate and 𝑔 𝑡 𝑎 is an allocation gate

11
**copy of 10 sequences of length 5**

Differentiable Neural Computers – write operation copy of 10 sequences of length 5 with memory size 10

12
**Differentiable Neural Computers – read operation**

Temporal memory linkage after write operation, we can store information about the order in which the data are written here, let the linkage matrix 𝐿∈ 0,1 𝑁×𝑁 𝐿[𝑖,𝑗] represent the degree to which location 𝑖 was the location written to after location 𝑗 𝒑 𝑡 = 1− 𝑖 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 + 𝒘 𝑡 𝑤 , 𝒑 0 =𝟎 𝑳 𝑡 𝑖,𝑗 = 1− 𝒘 𝑡 𝑤 𝑖 − 𝒘 𝑡 𝑤 𝑗 𝑳 𝑡−1 𝑖,𝑗 + 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 𝑗 𝐿 0 𝑖,𝑗 =0 ∀𝑖,𝑗 𝐿 𝑡 𝑖,𝑖 =0 ∀𝑖 goes to 0 when write is not null the degree to which the latest valid write operation attends to location 𝑗 close to 1, cut the links from 𝑗 to 𝑖

13
**Differentiable Neural Computers – read operation**

Temporal memory linkage the agent can choose which direction to read forward weighting 𝒇 𝑡 𝑖 and backward weighting 𝒃 𝑡 𝑖 is 𝒇 𝑡 𝑖 = 𝑳 𝑡 𝒘 𝑡−1 𝑟,𝑖 𝒃 𝑡 𝑖 = 𝑳 𝑡 𝑇 𝒘 𝑡−1 𝑟,𝑖

14
**Differentiable Neural Computers – read operation**

Read mode each read head can choose which mode to read using 𝝅 𝑡 𝑖 ∈ 0,1 3 resulting read weighting of read head 𝑖 is 𝒘 𝑡 𝑟,𝑖 = 𝝅 𝑡 𝑖 1 𝒃 𝑡 𝑖 + 𝝅 𝑡 𝑖 2 𝒄 𝑡 𝑟,𝑖 + 𝝅 𝑡 𝑖 3 𝒇 𝑡 𝑖 Then, we can iterate on written sequences forward and backward regardless of their actual locations

15
**Differentiable Neural Computers – controller**

DNCs used a deep LSTM as controller LSTM with multi-layers 𝑥 𝑡 is an input 𝑟 𝑡−1 𝑖 is a read vector of read head 𝑖 at previous time step 𝑣 𝑡 and 𝜉 𝑡 are outputs 𝜉 𝑡 is an interface vector

16
**Differentiable Neural Computers – experiments**

bAbI question & answering dataset consisting of 20 type of reasoning 10,000 training data, 1,000 test data Graph task training inference, shortest path and traversal on randomly generated graphs test on London Underground and family tree Mini-SHRDLU moving block to satisfy given constraints reinforcement learning

17
**Differentiable Neural Computers – bAbI**

‘mary journeyed to the kitchen. mary moved to the bathroom. john went back to the hallway. john picked up the milk there. Q: what is john carrying?’ the answer is milk. a lexicon of 159 unique words one-hot vector encoding is used DNC is a classifier here

18
**Differentiable Neural Computers – bAbI**

19
**Differentiable Neural Computers – Graph task**

0-999 labels 1) regress the optimal policy 2) 10-time steps of planning 0-9, direct 10-410, relation(not input) check the DNC remember a graph

20
**Differentiable Neural Computers – Graph task**

logistic regressor input – write vector target – an input triple at that time

21
**Differentiable Neural Computers – Graph task**

22
**Differentiable Neural Computers – Graph task**

23
**Differentiable Neural Computers – Graph task**

24
**Differentiable Neural Computers – extra experiments**

DNC trained with 256 memory size for traversal fraction of completes over 100 traversal tasks (source node, edge, destination node)

25
**Differentiable Neural Computers – mini SHRDLU**

reward – the number of satisfied constraints penalty – when taking an invalid action logistic regressor input – contents average vector target – first 5 actions by the agent input dimension *9 7 actions

26
**Differentiable Neural Computers – mini SHRDLU**

Perfect = minimal moves Success = anyway satisfy all constraints Incomplete = failed to satisfy all constraints

27
**Differentiable Neural Computers – conclusion**

reasoning about and representing complex data structure is important DNCs can detect variability of tasks maintaining domain regularity the controller learns domain regularity and write variability in memory future direction is to make the model without adapting parameters

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google