Learning to make specific predictions using Slow Feature Analysis.

Presentation on theme: "Learning to make specific predictions using Slow Feature Analysis."— Presentation transcript:

Learning to make specific predictions using Slow Feature Analysis

Slow: temporally invariant abstractions Fast: quickly changing input Memory/prediction hierarchy with temporal invariances But… how does each module work: learn, map, and predict?

My (old) module: 1.Quantize high-dim input space 2.Map to low-dim output space 3.Discover temporal sequences in input space 4.Map sequences to low-dim sequence language 5.Feedback = same map run backwards Problems: Sequence-mapping (step #4) depends on several previous steps  brittle, not robust Sequence-mapping not well-defined statistically

Pro’s of SFA: Nearly guaranteed to find some slow features No quantization Defined over entire input space Hierarchical “stacking” is easy Statistically robust building blocks (simple polynomials, Principal Components Analysis, variance reduction, etc)  a great way to find invariant functions  invariants change slowly, hence easily predictable New module design: Slow Feature Analysis (SFA)

BUT… ….No feedback! Can’t get specific output from invariant input It’s hard to take a low-dim signal and turn it into the right high-dim one (underdetermined) Here’s my solution (straightforward, probably done before somewhere): Do feedback with separate map

First, show it working… … then, show how & why Input space: 20-dim “retina” Input shapes: Gaussian blurs (wrapped) of 3 different widths Input sequences: constant-velocity motion (0.3 pixels/step) T = 0 … T=2 … T=4 T = 23 … T=25 … T=27 Pixel 21 = pixel 1

Sanity-check: slow features extracted match generating parameters: “What” “Where” Gaussian std dev. Gaussian center pos’n Slow feature #1 Slow feature #2 (… so far, this is plain vanilla SFA, nothing new…)

T = 0 … T=2 … T=4 T=5  New contribution: Predict all pixels of next image, given previous images… ? ? ? ? ? ? ? ? ? ? Reference prediction is to use previous image (“tomorrow’s weather is just like today’s”) T=4 T=5 

Plot ratio: Median ratio over all points = 0.06 (including discontinuities) …over high-confidence points = 0.03 (toss worst 20%) Reference prediction (mean-squared prediction error ) (mean-squared reference error)

Take-home messages: –SFA can be inverted –SFA can be used to make specific predictions –The prediction works very well –The prediction can be further improved by using confidence estimates So why is it hard, and how is it done?....

Why it’s hard: High-dim: x1 x2 x3 ……………………………………………..…………………..x20 Low-dim slow features: S1 = 0.3 x1 + 0.1 x1 2 + 1.4 x2 x3 + 1.1 x4 2 +…. + 0.5 x5 x9 + … But given S1 = 1.4 S2 = -0.33 x1= ? x2=? x3=? x4=? x5=? x6=?. x20=? Infinitely many possibilities of x’s Vastly under-determined No simple polynomial-inverse formula (e.g. “quadratic formula”) easy HARD

Very simple, graphable example: (x1, x2) 2-dim  S1 1-dim x1(t), x2(t) approx circular motion in plane S1(t) = x1 2 + x2 2 nearly constant, i.e. slow Illustrate a series of six clue/trick pairs for learning specific-prediction mapping

Clue #1: The actual input data is a small subset of all possible input data (i.e. on a “manifold”) Trick #1: Find a set of points which represent where the actual input data is ≠  20-80 “anchor points” A i (Found using k-means, k-medoids, etc. This is quantization, but only for feedback) actual possible

Clue #2: The actual input data is not distributed evenly about those anchor-points Trick #2: Calculate covariance matrix C i of data around A i yes no  data Eigenvectors of Ci

Clue #3: S(x) is locally linear about each anchor point Trick #3: Construct linear (affine) Taylor-series mappings SL i approximating S(x) about each A i (NB: this doesn’t require polynomial SFA, just differentiable) 

Clue #4: Covariance eigenvectors tell us about the local data manifold Trick #4: 1.Get SVD pseudo-inverse  X = SL i -1 (S new – S(A i )) 2.Then stretch  X onto manifold by multiplying by chopped* C i Good news: Linear SL i can be pseudo-inverted (SVD) Bad news: We don’t want any old (x1,x2), we want (x1,x2) on the data manifold S new S(A i ) SS XX XX …stretch… Stretched  X * Projection matrix, keeping only as many eigenvectors as dimensions of S

Good news: Given A i and C i, we can invert S new  X new Bad news: How do we choose which A i and SL i -1 to use? ? ? ? These three all have the same value of S new

Clue #5: a) We need an anchor A i such that S(A i ) is close to S new Trick #5: Choose anchor A i such that –A i is “close to” the hint AND –S(A i ) is close to S new S new S(A i ) Close candidates b) Need a “hint” of which anchors are close in X-space Hint region

All tricks together: Map local linear inverse about each anchor point Anchors + S(A i ) neighbors x

Clue #6: The local data scatter can decide if a given point is probable (“on the manifold”) or not Trick #6: Use Gaussian hyper-ellipsoid probabilities about closest A i (this can tell if a prediction makes sense or not) probable improbable probable improbable

Estimated uncertainty increases away from anchor points -log(P)

Summary of SFA inverse/prediction method: We have X(t-2), X(t-1), X(t)… we want X(t+1) 1.Calculate slow features S(t-2), S(t-1), S(t) S t 2. Extrapolate that trend linearly to S new ( NB: S varies slowly/smoothly in time ) S t 3. Find candidate S(A i )’s close to S new S new all S(A i ) e.g. candidate i = {1, 16, 3, 7}

Summary cont’d 4. Take X(t) as “hint,” and find candidate A i ’s close to it 5. Find “best” candidate A i, whose index is high on both candidate lists: e.g. candidate i = {8, 3, 5, 17} S(A i )’s close to S new A i close to X(t) ii 18 163 35 617

6. Use chosen A i and pseudo-inverse (i.e. SL i -1 (S new – S(A i ) ) with SVD) to get  X S(A i ) XX XX …stretch… Stretched  X 7. Stretch  X onto low-dim manifold using chopped C i 8. Add stretched  X back onto A i to get final prediction Stretched  X AiAi

9. Use covariance hyper-ellipsoids to estimate confidence in this prediction probable improbable This method uses virtually everything we know about the data; any improvements presumably would need further clues… –Discrete sub-manifolds –Discrete sequence steps –Better nonlinear mappings

Next steps Online learning –Adjust anchor points and covariance as new data arrive –Use weighted k-medoid clusters to mix in old with new data Hierarchy –Set output of one layer as input to next –Enforce ever-slower features up the hierarchy Test with more complex stimuli and natural movies Let feedback from above modify slow feature polynomials Find slow features in the unpredicted input (input – prediction)