Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy.

Similar presentations


Presentation on theme: "Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy."— Presentation transcript:

1 Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy

2 Motivation A key component of a cognitive tutor: student cognitive model Tracks what skills student currently knows —latent factors circle-area rectangle-area decompose-area right-answer

3 Motivation Student models are a key bottleneck in cognitive tutor authoring and performance rough estimate: 20-80 hrs to hand-code model for 1 hr of content result may be too simple, not rigorously verified But, demonstrated improvements in learning from better models E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis

4 DataShop Subject areaTransactions Math (total)16.3 M Algebra11.2 M Geometry5.1 M Language (total)2.5 M French0.5 M English0.2 M Chinese1.8 M Science (total)3.3 M Chemistry1.2 M Physics2.1 M Other (total)3.2 M Total25.3 M REPRESENTIN G ~112K TOTAL HOURS ACROSS ~15K STUDENTS

5 SCORE: STDNT I, ITEM J Simple case: snapshot, no side information 123456… A 110010… B 011000… C 110110… D 100110… … ………………… ITEMS STUDENTS

6 Missing data 123456… A 1???10… B 0?10??… C 11???0… D 1001??… … ………………… ITEMS STUDENTS

7 Data matrix X x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... STUDENTS ITEMS

8 Simple case: model X V U U: student latent factors V: item latent factors X: observed performance n students m items k latent factors observed unobserved

9 Linear-Gaussian version student factoritem factor X V U n students m items k latent factors U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left)

10 Matrix form: Principal Components Analysis x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T

11 PCA: the picture

12 PCA: matrix form x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T COLS OF V SPAN THE LOW-RANK SPACE

13 Interpretation of factors u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… STUDENTS ITEMSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

14 PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

15 Data matrix: face images x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... IMAGES PIXELS

16 Result of factoring u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… IMAGES PIXELSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

17 Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE

18 PCA: the good Unsupervised: need no human labels of latent state! No worry about “expert blind spot” Of course, labels helpful if available Post-hoc human interpretation of latents is nice too—e.g., intervention design

19 PCA: the bad Linear, Gaussian PCA assumes E(X) is linear in UV PCA assumes (X–E(X)) is i.i.d. Gaussian

20 Nonlinearity: conjunctive skills P(CORRECT) SKILL 1 SKILL 2

21 Nonlinearity: disjunctive skills P(CORRECT) SKILL 1 SKILL 2

22 Nonlinearity: “other” P(CORRECT) SKILL 1 SKILL 2

23 Non-Gaussianity Typical hand-developed skill-by-item matrix 123456… 110011… 001101… SKILLS ITEMS

24 Result of Gaussian assumption truerecovered rows of true and recovered V matrices

25 Result of Gaussian assumption truerecovered rows of true and recovered V matrices

26 The ugly: MLE only PCA yields maximum-likelihood estimate Good, right? sadly, the usual reasons to want the MLE don’t apply here e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item   ) Result: MLE is typically far too confident of itself

27 Too certain: example Learned coefficients (e.g., a row of U) Predictions

28 Result: “fold-in problem” Nonsensical results when trying to apply learned model to a new student or item Similar to overfitting problem in supervised learning: confident- but-wrong parameters do not generalize to new examples Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

29 Summary: 3 problems w/ PCA Can’t handle nonlinearity Can’t handle non-Gaussian distributions Uses MLE only (==> fold-in problem) Let’s look at each problem in turn

30 Nonlinearity In PCA, had X ij ≈ U i ⋅ V j What if X ij ≈ exp(U i ⋅ V j ) X ij ≈ logit(U i ⋅ V j ) …

31 Non-Gaussianity In PCA, had X ij ~ Normal(μ), μ = U i ⋅ V j What if X ij ~ Poisson(μ) X ij ~ Binomial(p) …

32 Exponential family review Exponential family of distributions: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) G(θ) is always strictly convex, differentiable on interior of domain means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

33 Exponential family review Exponential family PDF: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) Surprising result: G’(θ) = g(θ) = E(X | θ) g & g –1 = “link function” θ = “natural parameter” E(X | θ) = “expectation parameter”

34 Examples Normal(mean) g = identity Poisson(log rate) g = exp Binomial(log odds) g = sigmoid

35 Nonlinear & non-Gaussian Let P(X | θ) be an exponential family with natural parameter θ Predict X ij ~ P(X | θ ij ), where θ ij = U i ⋅ V j e.g., in Poisson, E(X ij ) = exp(θ ij ) e.g., in Binomial, E(X ij ) = logit(θ ij )

36 Optimization problem max ∑ log P(X ij | θ ij ) s.t. θ ij = U i ⋅ V j “Generalized linear” or “exponential family” PCA all P(…) terms are exponential families analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]

37 Special cases PCA, probabilistic PCA Poisson PCA k-means clustering Max-margin matrix factorization (MMMF) Almost: pLSI, pHITS, NMF

38 Comparison to AFM p = probability correct θ = student overall performance β = skill difficulty Q = item x skill matrix  = skill practice slope T = number of practice opportunities T ik k T ik  k θ β0 Q 1 x

39 Theorem In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem And, finding best V (holding U fixed) is a convex problem Further, Hessian is block diagonal So, an efficient and effective optimization algorithm: alternately improve U and V

40 Example: compressing histograms w/ Poisson PCA Points: observed frequencies in ℝ 3 Hidden manifold: a 1-parameter family of multinomials A BC

41 Example ITERATION 1

42 Example ITERATION 2

43 Example ITERATION 3

44 Example ITERATION 4

45 Example ITERATION 5

46 Example ITERATION 9

47 Remaining problem: MLE Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference Typical problem: computation In our case, the computation is just fine if we’re a little clever Additional wrinkle: switch to hierarchical model

48 Bayesian hierarchical exponential- family PCA X V U U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents n students m items k latent factors observed unobserved R S student factor item factor

49 A little clever: MCMC Z P(X)

50 Experimental comparison Geometry Area 1996-1997 data Geometry tutor: 139 items presented to 59 students On average, each student tested on 60 items

51 Results: hold-out error Embedding dimension for *EPCA is K = 15 credit: Ajit Singh

52 Extensions Relational models Temporal models

53 Relational models 123456 john 110010 sue 011000 tom 110110 ITEMS STUDENTS 123456 trig 110010 story 011000 hard 110110 ITEMS TAGS

54 Relational hierarchical Bayesian exponential-family PCA X V U X, Y: observed data U: student latent factors V: item latent factors Z: tag latent factors R, S, T: shared priors n students m items k latent factors observed unobserved R S p tags Y Z k latent factors T X ≈ f(UV T ) Y ≈ g(VZ T )

55 Example: brain imaging 2000 dictionary words 60 stimulus words 500 brain voxels X = co-occurrence of (dictionary word, stimulus word) on web Y = activation of voxel when presented with stimulus Task: predict X HB-EPCA H-EPCA EPCA Relational versions Mean squared error credit: Ajit Singh

56 fMRI data Subject reads word on screen, thinks about it for 15–30s fMRI measures blood flow in ~16,000 voxels of a few mm 3 each at ~1Hz

57 fMRI data Blood flow is a proxy for energy consumption is a proxy for amount of activity But delayed and time- averaged (4–10s) And we further average over time to reduce noise

58 Example data Slice 2 at bottom, slice 15 at top Front of brain at bottom of slice Subject’s left at left of slice

59 Image factorization Express each image in terms of a basis of “eigenimages” Basis images capture spatial patterns of activity over many voxels … = = x

60 Temporal models So far: latent factors of students and content e.g., knowledge components for student: skill at KC for problem: need for KC e.g., student affect But limited idea of evolution through time e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate

61 Temporal models For evolving factors, we expect far better results if we learn about time explicitly learning curves, gaming state, affective state, motivational state, self-efficacy, … X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PROPERTIES OF TRANSACTION X1 X1 Y1Y1Y1Y1 INSTRUCTIONAL DECISIONS X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

62 Example: Bayesian Evaluation & Assessment [BECK ET AL., 2008] PROPERTIES OF TRANSACTIONS LATENT STATE INSTRUCTIONAL DECISIONS

63 The hope Fit a temporal model Examine learned parameters and latent states Discover important evolving factors which affect performance learning curve, affective state, gaming state, … Discover how they evolve

64 The hope Reduce assumptions about what the factors are Explore a wider variety of models Model search guided by data  discover factors we might otherwise have missed

65 Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI

66 Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE JOINT ANGLES X1 X1 Y1Y1Y1Y1 DESIRED DIRECTION X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

67 Walking: learned model

68 Steam: original data

69 X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PIXELS X1 X1 Y1Y1Y1Y1 (EMPTY) X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

70 Steam: learned model


Download ppt "Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy."

Similar presentations


Ads by Google