Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18th, 2009 Where are these people from/

Similar presentations


Presentation on theme: "Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18th, 2009 Where are these people from/"— Presentation transcript:

1 Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing
Mingyu Feng August 18th, 2009 Where are these people from/ put on CMU logo/ say Joe’ken’s first name, they are on my committee Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)

2 Motivation – the need Concerns about poor student performance on new state tests High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act Student performance are not satisfactory Massachusetts (2003, 20% failed 10th grade math on the first try) Worcester Secondary teachers are asked to be data-driven MCAS test reports Formative assessment and practice tests Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc. In many states, there are concerns about poor student performance on new high-stakes standards-based tests that are required by the No Child Left Behind Act. In the year of 2003, 20% of students failed the 10th grade math test on the first try; For Worcester, an industrial city, in 2003, 1 out 5 seniors failed. Partly because of this pressure, secondary schools seek to use the assessment data in a data-driven manner to provide regular and ongoing feedback to teachers and students on progress towards instructional objectives. Mcas reports has been used a lot for this purpose. There has been an intense interest in “Formative Assessment” in K-12 Education and in predicting student performance on end-of-year tests with many companies providing such services of practice tests. Many teachers make extensive use of these practice tests and released test problems to help identify learning deficits

3 Motivation – the problems
I: Formative assessment steals time from instruction NCLB or NCLU (No Child Left Untested)? Every hour spent assessing students is an hour lost from instruction Limited classroom time compels teachers to make a choice However, accompanying the great needs, there are problems with formative assessment data and the reports that are currently provided. As just mentioned, some teachers make extensive use of these practice tests and released test problems to help identify learning deficits for individual students and the class as a whole. At the same time, critics of No Child Left Behind legislation are calling the bill “No Child Left Untested”. Among other things, critics point out that every hour spent assessing students is an hour lost from instruction. Because assessment takes time away from instruction, how can teachers be sure that the time they spent on assessing will improve instruction enough to justify the cost of the lost instructional time? 3 3

4 Motivation – the problems
II: Performance reports are not satisfactory Teachers want more often, and more detailed reports One issue with the performance reports is that they do not provide enough cognitive diagnostic information. For instance, although the number of skills may possibly be in the order of hundreds, MCAS reports in only 5 categories. Show school report. In fact, Confrey and colleagues conducted a detailed analysis of state tests in Texas, and concluded that such topic reporting is not reliable. Thus, a teacher cannot trust that putting more effort on a particular low scoring area will indeed pay off in the next round. To get some intuition on why this is the case, I’d like to encourage all of you to try the problem 19 from 2003 MCAS test. Then ask yourself “What are the important things that make this item difficult?” Clearly, this item includes elements from Algebra (equation-solving), Geometry (congruence), and Measurement (perimeter pe`rimeter). Ignoring this obvious overlap, the state chose just one strand, Geometry, to classify the item, which might also be the first feeling of most people. However, we have found evidence there is more to this problem. 4 Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At

5 Main Contributions Propose a novel approach that assess better by taking into accounts how much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09) Establish a way to track and predict performance longitudinally (WWW’06) Rigorously evaluate the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09) Propose using data mining approach to evaluate effectiveness of individual contents (AIED’09) Propose using data mining results to help refine existing skill models (EDM’09; in preparation) An online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07) Towards solving the previous listed problems, this dissertation makes the following contributions. I propose a novel approach that assess better by taking into accounts how much assistance students need This work is novel because traditional assessment usually focuses on students’ responses to test items and whether they are answered correctly or incorrectly, but ignores all other student behaviors during the test (e.g., response time). But my results show the model based solely upon assistance information predicts reliably better than the model based only upon correctness. Not only we can hit a moving target, but also we can do it over time. I argue this is novel since no existing systems longitudinally track student knowledge over time and use that to predict student’s state test scores. Demonstrate the value of a very fine-grained versus more coarse-grained models ()

6 Roadmap Motivation Contributions Background - ASSISTments
Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling Conclusion & general implications

7 ASSISTments System A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress Teachers like ASSISTments Students like ASSISTments ASSISTments is the project that the dissertation is based upon. I have been working on the ASSISTment project since 2004 when we first start to build the system. The teachers like using such a system that they can students to practice on MCAS items, save their grading time, and get feedback. Students like the system not only because they get more confident on MCAS after practice, but also because they can get away from the “boring” sit-in classroom. 7

8 An ASSISTment We break multi-step items (original question) into scaffolding questions Attempt: student take an action to answer a question Response: the correctness of student answer (1/0) Hint Messages: given on demand that give hints about what step to do next Buggy Message: a context sensitive feedback message Skill: a piece of knowledge required to answer a question Now let’s take a closer look at how a piece of assistment works. Try to do it online. There are more than 1000 items like this in the system now.

9 Facts about ASSISTments
5000+ students have used the system regularly More than 10 million data records collected Other features Learning experiments; authoring tools, account and class management toolkit … In the past school year, students used the system regularly as a part of their math class While they work in the system, the background logging system collects data of student actions and stores in the database More than 10 million data records has been collected since 2004 Other interesting features I won’t talk about. The large amount of data provides an ample recourse for my dissertation work. All these studies in my work have been done using data sets from the years of More than 1000 students are involved. AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp Springer Berlin / Heidelberg. 9

10 Roadmap Motivation Contributions Background - ASSISTments
Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling Conclusion & general implications

11 Where does this score come from?
A Grade Book Report Where does this score come from? Here is a report I built called grade book, that has been frequently used by teachers. One of our collaborating teachers like sitting in front of the computer and kept hitting refresh button while his students were working in the system We can see in this report, one student, Tom, blah… When the report was developed back in 2004, I present to teachers an estimate of students' “expected” MCAS test scores. Teachers’ like it. And the correlation with actual MCAS score was 0.7. The score was solely based on student’s average percent correct on the original questions. There are some problems with it. See another kid, jack, who was predicted to have the same score, obviously asked for much fewer hints than Tom. Intuitively, we should be able to pay attention to this and distinguish performance level of the two students. Another issue is that it is an average. SO, ignores.. The third problem is that the MCAS score estimation is uninformative for teachers’ classroom instruction. JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA

12 Automated assessment Big idea: use data collected while a student uses ASSISTment to assess him Lots of types of data available (last screen just used % correct on original questions) Lots of other possible measures Why should we be more complicated? Worcester Polytechnic Institute

13 A Grade Book Report Static – does not distinguish “Tom” and “Jack”
Here is a report I built called grade book, that has been frequently used by teachers. One of our collaborating teachers like sitting in front of the computer and kept hitting refresh button while his students were working in the system We can see in this report, one student, Tom, blah… When the report was developed back in 2004, I present to teachers an estimate of students' “expected” MCAS test scores. Teachers’ like it. And the correlation with actual MCAS score was 0.7. The score was solely based on student’s average percent correct on the original questions. There are some problems with it. See another kid, jack, who was predicted to have the same score, obviously asked for much fewer hints than Tom. Intuitively, we should be able to pay attention to this and distinguish performance level of the two students. Another issue is that it is an average. SO, ignores.. The third problem is that the MCAS score estimation is uninformative for teachers’ classroom instruction. Static – does not distinguish “Tom” and “Jack” Average – ignores development over time Uninformative – not informative for classroom instruction Dynamic assessment Longitudinal modeling Cognitive diagnostic assessment

14 Dynamic Assessment – the idea
The contrast between help-seeking behavior of Tom and Jack reflects the idea of dynamic testing. The idea of dynamic testing is not new. Back in 1983, Brown and his colleagues compared traditional testing paradigms against a dynamic testing paradigm. In the dynamic testing paradigm, a student would be presented with an item and when the student appeared to not be making progress, would be given a prewritten hint. That was even before computerized testing. Such kind of detailed, assistance information and performance during a tutoring session is normally not available in traditional practice tests. However, a computerized tutoring system has the potential to use far more. There is rich data about the nature and amount of help that the student was given, which I hypothesized would be of great value in judging a student's mastery of knowledge. So I computerized Brown’s idea. Actually they suggested doing it this way, but they did not do that. Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983). 14 Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.

15 Dynamic vs. Static Assessment
Developing dynamic testing metrics # attempts # minutes to come up with an answer; # minutes to complete an ASSISTment # hint requests; # hint-before-attempt requests; #bottom-out hints % correct on scaffolds # problems solved “Static” measure correct/wrong on original questions I developed groups of “dynamic” testing metrics to supplement accuracy data (wrong/right scores) #attempts a student needs to finally get a correct answer, which indicates their response efficiency; The second group of metrics has to do with students’ response time and problem solving time The third group is about how often they ask for hint + how often they request for a hint before even making an attempt + how often they reach the bottom-out hints. (explain bottom-out hints). These metrics on help-seeking behavior reflect how much assistance a student need to complete a problem %correct on scaffolds: performance on each single step And their attendance, reflected by # problems solved These are called “dynamic” assessment metrics, in contrast to “static” measure that is correct or wrong on the original questions, because they depend on individual students’ behaviors while they interact with the system. 15

16 Dynamic Assessment – data
Sept, 2004 – May, 2005 391 students Online data 267 minutes (sd. = 79); 9 days; 147 items (sd. = 60) 8th grade MCAS scores (May, 2005) Data Sept, 2005 – May, 2006 616 students 196 minutes (sd. = 76); 6 days; 88 items (sd. = 42) 8th grade MCAS scores (May, 2006) 16

17 Dynamic Assessment - modeling
Three linear stepwise regression models 1-parameter IRT proficiency estimate 1-parameter IRT proficiency Estimate + all online metrics MCAS Score The standard test model I built 3 linear regression models to predict students’ actual MCAS scores using the metrics described above. In all 3 models, the dependent variable is always the MCAS score, but the independent variables are different. For the standard test model, I train 1 parameter item response model and use estimated student proficiency as the independent variable The assistance model is special in that it does not have any assessment information on the original questions But only the online metrics described above. People might laugh: what are you people thinking about? How could a model of no assessment information be useful for testing purpose? But I built the model to investigate the prediction power of the dynamic metrics. And the two models are combined in the mixed model. The mixed model The assistance model All online metrics 17 1-parameter IRT: One parameter item response theory model

18 Dynamic Assessment - evaluation
Bayesian Information Criterion (BIC) Widely used model selection criterion Resolves overfitting problem by introducing a penalty term for the number of parameters Formula Model of lower BIC is preferred Mean Absolute Deviation (MAD) Cross-validated prediction error Function Model of lower MAD is preferred When estimating model parameters using maximum likelihood estimation, it is possible to increase the likelihood by adding additional parameters, which may result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. This penalty for additional parameters is stronger than that of the AIC. 18 Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25,

19 Dynamic Assessment - results
1-PL IRT proficiency estimate 1-PL IRT proficiency Estimate + all online metrics The standard test model The assistance model All online metrics The mixed model Here is the results of the modeling process. The predicted scores is adjusted score, generated by SPSS leave-one-out cross validation. The primary contrast is between the assistance model and the standard test model. The interesting finding, maybe shocking finding, is that the assistance model that does not use assessment data on original question, but only using features reflecting student assistance requirement, effort, attendance, etc, makes significantly better predictions than the standard test model that is based on the assessment results alone. P: t-test Traditional assessment usually focuses on whether students answer test items correctly or incorrectly. But ignores all other student behaviors during the test. Well, this result suggests there is great value in there, probably even more than in the correctness data. It is not surprising that, when we put the two models together, the mixed model did best on predicting MCAS scores. the Bayesian information criterion (BIC) is a criterion for model selection among a class of parametric models with different numbers of parameters. When estimating model parameters using maximum likelihood estimation, it is possible to increase the likelihood by adding additional parameters, which may result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information. It is the proportion of variability in a data set that is accounted for by the statistical model. It provides a measure of how well future outcomes are likely to be predicted by the model. There are several different definitions of R2 which are only sometimes equivalent. One class of such cases includes that of linear regression. In this case, R2 is simply the square of the sample correlation coefficient between the outcomes and their predicted values, or in the case of simple linear regression, between the outcome and the values being used for prediction. In such cases, the values vary from 0 to 1. In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. 19 Max MCAS score: 54

20 Dynamic Assessment - results
What are metrics selected in the assistance model? Cross validation suggests this approach is robust Same year Cross-year -- stepwise regression -- all coefficients are significant. -- negative coefficients. The more attempt they made, the longer they need to finish a problem, the more help they asked for, the lower their predicted score will be. -- show the assistance model More recently, I cross-validates the models by splitting data from the same year into two parts, training and testing, and by using data from two years. It turned out the models were [rəu`bʌst]. Therefore, even though the models constructed based on data from different years are not quite the same, they are both quite predictive of student end-of-year exam scores. I wrote an article on that and it was accepted by the User Modeling journal, and was well received by the reviewers and editor. In statistics, standardized coefficients or beta coefficients are the estimates resulting from an analysis performed on variables that have been standardized so that they have variances of 1. This is usually done to answer the question of which of the independent variables have a greater effect on the dependent variable in a multiple regression analysis, when the variables are measured in different units of measurement (for example, income measured in dollars and family size measured in number of individuals). 20

21 Compare Models from Two Years
Mixed model data data (Constant) 32.414 3.284 1 IRT_Proficiency_Estimate 26.800 32.944 2 Scaffold_Percent_Correct 20.427 21.327 3 Avg_Question_Time -0.170 Question_Count 0.072 4 Avg_Attempt -0.102 5 Avg_Hint_Request -3.217 Avg_Item_Time 0.045 6 Total_Attempt -0.044 Which metrics are stable across years? Worcester Polytechnic Institute

22 Dynamic Assessment - conclusion
ASSISTments data enables us to assess more accurately The relative success of the assistance model over the standard test model highlights the power of the dynamic measures Is this a fair comparison? In this work, I addressed assessment challenge of ASSISTments by mining the log data. The online assessment system enables us to do a better job of predicting student knowledge. We took advantage of the computer system to collect interaction data on how much tutoring assistance was needed, how fast a student solves a problem and how many attempts were needed to finish a problem. 2. We got the best prediction when we put together assessment data and assistance data. But, the relative success of the assistance model over the standard test model highlights the power of the dynamic measures. Not only is it possible to get good test information while “teaching on the test”, data from the teaching process actually can help improving prediction accuracy. 3. A critic may argue that it is not fair to have the standard test model as a contrast case as students were not spending all their time on assessment. Whether or not the ASSISTments system would yield better predictions than such a tougher contrast case, where students only spend on-line time on assessment and not on instruction, is an open question worthy of further research. However, I would remind the critic that such a contrast would leave out the instructional benefit of the ASSISTment system and, moreover, might not be as well received by teachers and students. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp New York, NY: ACM Press Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.

23 Roadmap Motivation Contributions Background - ASSISTments
Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling Conclusion & general implications As mentioned before, my solution to the “average” issue in the grade book report is using longitudinal modeling. Because of time, I won’t be talking about it here. Please refer to my dissertation for detail. Or ask me questions later.

24 Can we have our cake and eat it, too?
Most large standardized tests are unidimensional or low-dimensional. Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005) Can we have our cake and eat it, too? As a matter of fact, most large standardized tests are unidimensional or low-dimensional as if they are sampling one or only a few knowledge components The dynamic assessment work shows that we can do a good job predicting student total score. However, in addition to the overall performance, teachers also want detailed analysis of their students’ knowledge. Instead of having performance reports that break math knowledge into only a few components, teachers want more fine grained diagnostic reports to inform their everyday classroom practice. This is called “assessment for learning”. Can we have our cake and eat it, too? That is, can we have a good overall prediction of a high stakes test, while at the same time be able to tell teachers meaningful information about fine-grained knowledge components? In the following part, I explore this question. Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY. Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from on October 13, 2008. Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 24

25 Cognitive Diagnostic Assessment
McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring. Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities? A lot of people care about this topic of building cognitive models. In particular, Griel, Wang & Zhou (2008) proposed that one direction for future research of cognitive assessment is to increase understanding of how to specify an appropriate grain size or level of analysis with a cognitive diagnostic assessment. Here, in this dissertation, I evaluate the effectiveness of model granularity on understanding student knowledge development by examining how models of different granularity do on predicting external test scores. No all the details about the approach. Intercept: incoming knowledge on a skill Slope: learning rate of a skill Given that, I can estimate student’s scores McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6). 25

26 Building Skill Models … … WPI - 1 WPI - 5 WPI - 39 WPI - 78 Math
Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability Using-measurement-formulas-and-techniques Setting-up-and-solving-equation Understanding-pattern Understanding-data-presentation-techniques Understanding-and-applying-congruence-and-similarity Converting-from-one-measure-to-another understanding-number-representations WPI - 39 As the first step, with help from subject-matter experts, we developed a fine-grained model of 106 knowledge components We built four cognitive models with different granularity, including a unidimensional model and a fine-grained model developed at WPI with 106 skills. We start with building the fine grained model. After that, we took the 39 learning standard in MA curriculum framework and associated each of the fine grained skill with one learning standard. Then, the 39 learning standards are nested inside the 5 strands, names as being used by the National Council of Teachers of Mathematics. At the top of the tree, is the unidimensional model that has been used in most traditional tests. WPI - 78 Ordering-fractions Equation-solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent-Fractions-Decimals-Percents 26

27 Building Skill Models … … WPI - 1 WPI - 5 WPI - 39 WPI - 78 Math
Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability WPI - 5 Using-measurement-formulas-and-techniques Setting-up-and-solving-equation Understanding-pattern Understanding-data-presentation-techniques Understanding-and-applying-congruence-and-similarity Converting-from-one-measure-to-another understanding-number-representations WPI - 39 More than 1000 questions in the ASSISTments system were tagged and associated with one or more of the skills. Recall this item 19. It is tagged in WPI-106, XXXX, correspondingly XXX. Yet, in MCAS reports, it is XXX It is easier to list component concepts and skills than it is to determine which are the hardest for students to learn. Yet, it is important to find out the problem causing skills because these skills should be the focus of assessment and instruction. We use scaffolding questions to help with this problem. Scaffolding questions are tagged with only one skill. So that we can assess each component of knowledge separately and give them “identifiability”. WPI - 78 Ordering-fractions Equation-solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent-Fractions-Decimals-Percents 27

28 Cognitive Diagnostic Assessment – data
Sept, 2004 – May, 2005 447 students Online data: 7.3 days; 87 items (sd. = 35) Item level response of 8th grade MCAS test (May, 2005) Data Sept, 2005 – May, 2006 474 students Online data: 5 days; 51 items (sd. = 24) Item level 8th grade MCAS scores (May, 2006) All online and MCAS items have been tagged in all four skill models 28

29 Cognitive Diagnostic Assessment - modeling
Fit Mixed-effects logistic regression model Predict total MCAS score Extrapolate the fitted model in time to the month of the MCAS test Obtain probability of getting each MCAS question correct, based upon skill tagging of the MCAS item Sum up probabilities to get total score Longitudinal model (e.g. Singer & Willett, 2003) -- Xijkt is the 0/1 response of student i on question j tapping skill k in month t -- Montht is elapsed month in the study; 0 for September, 1 for October, and so on -- β0k and β1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β00 and β10 : the group average baseline level of achievement and rate of change -- β0 and β1 : the baseline level of achievement and rate of change of the student 29

30 How do I Evaluate Models?
04-05 Data Real MCAS score ASSISTment Predicted Score Skill Models WPI-1 WPI-5 WPI-39 WPI-78 Mary 25.00 23.31 22.85 22.18 20.47 Tom 32.00 29.66 29.15 28.67 27.13 Sue 29.00 28.46 28.23 27.85 26.26 Dick 28.00 27.41 26.70 26.12 24.30 Harry 22.00 23.33 22.58 22.02 20.14 Absolute Difference WPI-1 WPI-5 WPI-39 WPI-78 1.69 2.15 2.82 4.53 2.34 2.85 3.33 4.87 0.54 0.77 1.15 2.74 0.59 1.30 1.88 3.70 1.33 0.58 0.02 1.86 MAD 4.42 4.37 4.22 4.11 %Error 13.00% 12.85% 12.41% 12.09% Pair two-sample t-test

31 Comparing Models of Different Granularities
1-parameter IRT model 04-05 Data WPI-1 WPI-5 WPI-39 WPI-78 MAD 4.42 4.37 4.22 4.11 %Error 13.00% 12.85% 12.41% 12.09% > 4.36 12.83% P =0.006 P <0.001 P =0.21 P =0.10 05-06 Data WPI-1 WPI-5 WPI-39 WPI-78 MAD 6.58 6.51 4.83 4.99 %Error 19.37% 19.14% 15.10% 14.70% 4.67 13.70% P <0.001 P <0.001 P <0.001 P =0.03

32 The Effect of Scaffolding - hypothesis
Only using original questions makes it hard to decide which skill to “blame” Scaffolding questions add identifiability by directly assessing a single skill Hypotheses Using responses to scaffolding questions will improve prediction accuracy Scaffolding questions are more useful for fine grained models Original questions are usually tagged with more than one skill, which makes it hard to decide which skill to blame when a student gives a wrong answer; Scaffolding questions breaks main question down into simpler tasks that directly assess a single skill. It brings in a good chance for us to detect exactly which skills are the real obstacles that prevent students from correctly answering the original questions Worcester Polytechnic Institute

33 The Effect of Scaffolding - results
04-05 Data Only original questions used WPI-1 5.07 WPI-5 4.78 WPI-39 5.20 WPI-78 6.08 Original + Scaffolding questions used 4.42 4.37 4.22 4.11 05-06 Data Only original questions used WPI-1 6.81 WPI-5 6.76 WPI-39 5.98 WPI-78 5.58 Original + Scaffolding questions used 6.58 6.51 4.83 4.99 For data, the order of skill models shift when adding in scaffolding responses. Worcester Polytechnic Institute

34 Cognitive Diagnostic Assessment - usage
Results presented in a nested structure of different granularities to serve a variety of stake-holders This picture shows how the skills are tagged to questions in the builder. this is a report I built. Here are 5 skills we recommend to teachers they can work on. Show what are the problems associated with the skills. Our collaborating teachers responded positively to detailed reports based on the fine-grained model

35 Cognitive Diagnostic Assessment - conclusion
Fine-grained models do the best job estimating student skill level overall Not necessarily the best for all consumers (e.g. principals) Only when scaffolding questions are used Scaffolding questions Helps improve overall prediction accuracy More useful for fine-grained models In this work, I rigorously evaluated the effect of granularity of the cognitive models I demonstrated the value of a fine-grained model versus more coarse-grained models in ASSISTments systems. I found evidence that that using students’ responses to scaffolding questions were helpful in tracking students’ knowledge this is the first evidence we have saying that our skill mappings are good enough to better predict a state test than some less fine grained models. Having a good fine-grained model based on a thorough understanding of what is hard for students is useful as it can lead to better categorization of test items and better guidance for teacher. And the result was replicated, and contrast with a completely different Bayesian Network methodology as well. Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue) Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp

36 Skill Model Refinement – why bother?
WPI-78 might have some mis-taggings Expert-built models are subject to the risk of “expert blind spot” Our best-guess in 7 hours A best guess model should be iteratively tested and refined Say this is current work

37 Skill Model Refinement - approaches
Human experts manually update hand-crafted models (1,000+ items )* (100+ skills) Not practical to do it often Data mining can help Skills or items of high residuals Consistently over-predicted or under-predicted skills across students “Un-learned” skills (i.e. negative slopes from mixed-effects models) The first approach I can think of is to hand the model back to our subject-matter expert and ask her to improve the model. But this is hard. Esp. In terms of items or skills, the candidates to be examined would be items for which the mixed-effects model produces highest residuals, and those for which student performance has been consistently over-predicted or under-predicted across all students. Another clue is to focus on the “un-learned” skills, or items/skills that the mixed-effects models produced high residuals. One reason that a skill might have a poorly fit slope would be that we tagged items with the same skill names that share some superficial similarity. In this paper, I introduced the concept of GLOP. GLOP is a group of items that are organized into one group by subject matter experts because they are associated with the same skill in the skill model. So, the GLOPs from which students did not show much learning raise a signal that maybe these items do not belong to the same group, so there was no transfer among them. These information can be used to aid content experts so that they can concentrate on the most problematic skill tagging and gain efficiency on model improvements. Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.

38 Skill Model Refinement - approaches
Searching for better models automatically Learning Factor Analysis (LFA) (Koedinger, & Junker, 1999) A semi-automated method Three parts Difficulty factors associated with problems A combinatorial search space by applying operators (add, split, merge) on the base model A statistical model that evaluate how a model fit the data Can we raise efficiency of LFA? Human identify difficulty factors through task analysis Auto-methods search for better models based upon factors The big idea of LFA is Human identify difficulty factors through task analysis, and then search for better models based upon the factors. This method is semi-automatic because of the human cognitive task analysis. As a basis of LFA, difficulty factors have always been found by subject experts through a process of “difficulty factor assessment” (DFA) (Koedinger, 2000). {Based upon theory or task analysis, researchers can hypothesize the likely factors that cause student difficulties, and by assessing performance difference on pairs of problems that vary by only one factor, the experts identify the hidden knowledge component that can be used to improve a skill model This has been considered as a weakness of LFA speaking of efficiency.} Tell the story: Ideally, I’d be happy to see Neil and Cris sitting in front of a computer to review items and identify difficulty factors. Then run LFA. But people are busy. Doing this for GLOPs of a lot of items is time consuming. I want to save their time. So, can raise efficiency of LFA by squeezing the left bar? Auto-methods search for better models based upon factors

39 Suggesting Difficulty Factors
Some items in a random sequence cause significantly less learning than others Hypothesis Some factor inherited in the items that introduce extra skills not intrinsic Create factor tables Preliminary results show some validity Skill Factor Circle-area High Low {As hypothesized by Kenneth Koedinger (personal communication), hard questions were hard because they involved multiple (or extra) skills that were not intrinsic to the GLOP.} Intuitively, it is highly possible that there is certain factor inherited in the items, which makes it harder for student to learn from the items, or make it harder for the learning to transfer to later items. Based upon this hypothesis, I create a factor table for each GLOP. This is a very simple way. Of course, there are other ways of factor tables.: one factor each item; only introduce a factor when there is reliably difference; more factor values; Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.

40 Roadmap Motivation Contributions Background - ASSISTments
Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling Conclusion & general implications Now that I have all other bullys checked, let me conclude the dissertation.

41 Conclusion of the Dissertation
The dissertation establishes novel assessment methods to better assess students in tutoring systems Assess students better by analyzing their learning behaviors when using the tutor Assess students longitudinally by tracking learning over time Assess students diagnostically by modeling fine- grained skills The contribution of the dissertation lies in that it established novel assessment methods to better assess students in tutoring systems 41

42 Voice from Education Secretary
Computer Research Association (2005) Secretary of Education, Arne Duncan weighed in on the NCLB Act, and called for continuous assessment Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says. Computer Research Association (CRA) report (Computer Research Association, 2005) has pointed out that continuous assessment systems research is a huge growth area. Recently, in an interview with U.S. News & World Report (Ramírez, & Clark, 2009), Secretary of Education Arne /a:n/ Duncan weighed in on the NCLB Act, and called for continuous assessment. He mentioned that he is concerned about over-testing, and feels that fewer, better tests would be more effective. He wants to develop better data management systems that will help teachers track individual student progress in real-time Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from 42

43 General implication Continuous assessment systems are possible to build Save classroom instruction time by assessing tutoring Track individual progress and be quite accurate at helping stakeholders get performance information Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven” The general implication from this research suggests that continuous assessment systems are possible to build and that they can be quite accurate at helping schools get information on their students over a long period of time. Strong evidence implies it is possible to develop a continuous assessment system saves classroom instruction time by assessing students while they are getting tutoring. we can do the two things, assessment and assistance, well simultaneously. Thereby teachers can be relieved from the dilemma of the hard choice between assessment and assistance. accurately and longitudinally assesses students gives fine grained feedback that is more cognitively diagnostic Thus, I claim our results within ASSISTments system is important because it provides evidence that reliable assessment and instructional assistance can be effectively blended. This opens up the possibility of a completely different approach to assessment. With that said, a tantalizing question is Are we likely to see states move from a test that happens once a year, to an assessment tracking system that offers continuous assessment (Computer Research Association, 2005) every few weeks? While more research is warranted, my results suggest that perhaps the answer should be yes. 43

44 A metaphor for this shift
Businesses don’t close down periodically to take inventory of stock any more Bar code; auto-checkout Non-stopped business Richer information Committee on the Foundations of Assessment Board on Testing and Assessment Center for Education National Research Council James W. Pellegrino Naomi Chudowsky Robert Glaser Let me end this talk with a metaphor that was used in "Knowing What Students Know" (page 284). I think it's a brilliant analogy and a good vision. It says the businesses don’t close down once or twice a year to take inventory of their stock. Instead they take advantage of bar code and auto-checkout to continuously monitor the flow of items. Not only business is non-stopped, the information collected is much richer. This is exact what I want to do for schools and students with continuous assessment. (page 284).

45 Acknowledgement My advisor Committee members The ASSISTment team
Neil Heffernan Committee members Ken Koedinger Carolina Ruiz Joe Beck The ASSISTment team My family Many more… 45

46 Thanks! Questions?

47 Backup slides 47

48 Motivation – the problems
III: The “moving” target problem Testing and instruction have been separate fields of research with their own goals Psychometric theory assumes a fixed target for measurement ITS wants student ability to “move” Standard psychometric theory requires a fixed target for measurement, which requires that learning during testing be limited. Yet, this “fixed target” assumption has hardly met in an ITS as the ultimate goal of a tutoring system is to help students learn. The targets are thereby (hopefully!) moving.

49 More Contributions Working systems
The reporting system that gives cognitive diagnostic reports to teachers in a timely fashion Establish an easy approach to detect the effectiveness of individual tutoring content AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series . pp Springer Berlin / Heidelberg. JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp Amsterdam, Netherlands: IOS Press.

50 Evidence 62% 50% 37% Here is a screen shot of a item report for one real class. For this class, the 3rd scaffolding question is the hardest, with %correct = 37%. Well some people would say, data from one class is not convincing.

51 Evidence Congruence Perimeter Equation-Solving
This is a screen shot of summary report of more than 3000 students of them got the original question correct. 71 students messed up with the 3 scaffolds, but got all other scaffolds correct. In contrast, only 45 students only answer the first scaffolding questions wrong, but got all other correct. No students, among the 3000 students, get the equation solving correctly but answered others wrong. no a one! I built such a report 5 years. I should have come up with an educational data mining paper with it!

52 Terminology MCAS Item/question/problem Response Original question
Scaffolding question Hint message Bottom-out hint Buggy message Attempt Skill/knowledge component Skill model/cognitive model/Q-matrix Single mapping model Multi-mapping model 52

53 53

54 The reporting system I developed the first reporting system for ASSISTments in 2004 that is online, live, and gives detailed feedback at a grain size for guiding instruction 54 54 Worcester Polytechnic Institute 54 54

55 “It’s spooky; he’s watching everything we do”. – a student
The grade book By clicking the student’s name shown as a link in our report, teachers can even see each action a student has made, his inputs and the tutor’s response and how much time he has spent on a given problem. “It’s spooky; he’s watching everything we do”. – a student 55

56 Identifying difficult steps
56

57 Informing hard skills 57

58 Linear Regression Model
An approach to modeling relationship between one or more variables (y) and one or more variables (X) Y depends linearly on X How linear regression works? Minimizing sum-of-squares Example of linear regression with one independent variable Stepwise regression Forward; backward; Combination to adjust the values of slope and intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'. Backward elimination, which involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant. Methods that are a combination of the above, testing at each stage for variables to be included or excluded. 58 Worcester Polytechnic Institute

59 1-Parameter IRT Model Item response theory (IRT) model relates the probability of an examinee's response to a test item to an underlying ability in a logistic function 1-PL IRT model where βn is the ability of person n and δi is the difficulty of item i. I used BI-LOG MG to run the model and get estimate of student ability and item difficulty 59 Worcester Polytechnic Institute

60 Dynamic assessment - The models
The predicted scores is adjusted score, generated by SPSS leave-one-out crossvalidation. 60

61 Dynamic assessment - The models
61

62 Dynamic assessment – The models
62

63 Dynamic assessment - Validation
63

64 Longitudinal Modeling - data
What a longitudinal data looks like? Ideally, we want to see something like this. Students %correct starts low at the beginning of semester. Then increase gradually during the year and get high at the end before MCAS. But what does real data look like? Welcome to the real world! It does not look like there is much learning, en? Actually, when we look at individual student’s curve of learning, it is kind of a mess. I guess that’s why we need data modeling but not only data plotting to give us an answer. Average %correct on original questions over time (FAKE data) What does our real data look like? 64

65

66

67 Longitudinal Modeling - methodology
What do we get from (linear) mixed effects models? Average population trajectory for the specified group Trajectory indicated by two parameters intercept: slope: The average estimated score for a group at time j is One trajectory for every single student Each student got two parameters to vary from the group average Intercept: slope: The estimated score for student i at time j is Given the “messy” longitudinal data, what shall we do? I learned the mixed-effect model is a popular approach. This is the stuff that I had to figure out by myself over a summer. Well, I realized that I am not a statistician or a psychometrian, but I taught myself what I need to learn. Trajectory: /`trædʒiktəri/ Mixed effect models is the technique I have been using and felt comfortable with. I won’t try to explain the technical detail of the model due to time. Instead, I jump directly into results. 67 67 67 Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford University Press, New York. 67

68 Longitudinal Modeling - results
BIC: Bayesian Information Criterion (the lower, the better) I trained a serious of models. Model A is an “average” model with no change. { From A to B, we see a big drop-down of BIC value of 100 points. Literatureu suggestion 10 points of BIC is equivalent to p of } From model A to model B, I introduced the TIME as covariate. Model B fits the data significantly better than model A. From A to B, we see a big drop-down of BIC value of 100 points. The model estimates a positive and significant coefficient for the TIME parameter, which indicates there is a general trend of student performance increasing over a year. Limited by time, I won’t go through details of all the models where I was trying to explore what factors have impact on students rate of learning This result was first published at the 2006 WWW conference together with the dynamic assessment work. Later in that year, I followed up by combining together the two pieces of work, and obtained an even better estimate of student MCAS performance, which led to a paper published a paper at the intelligent tutoring system conference. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp New York, NY: ACM Press Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp

69 Mixed effects models Individuals in the population are assumed to have their own subject-specific mean response trajectories over time The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects) It is possible to predict how individual response trajectories change over time Flexibility in accommodating imbalance in longitudinal data Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study 69

70 Sample longitudinal data
70

71 Comparison of Approaches
Ayers & Junker (2006) Estimate student proficiency using 1-PL IRT model LLTM (linear logistic test model) Main question difficulty decomposed into K skills 1-PL IRT fits dramatically better Only main questions used Additive, non-temporal WinBUGS 71 Worcester Polytechnic Institute

72 Comparison of Approaches
Pardos et al. (2006) Conjunctive Bayes nets Non-temporal Scaffolding used Bayes Net Toolbox (Murphy, 2001) DINA model (Anozie, 2006) 72 Worcester Polytechnic Institute

73 Comparison of Approaches
Feng, Heffernan, Mani & Heffernan (2006) Logistic mixed-effects model (Generalized Linear Mixed-effects Model, GLMM) Temporal Xi j is the 0/1 response of student i on question j tapping KC k in month t, R lme4 library Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k. 73 Worcester Polytechnic Institute

74 Comparison of Approaches
Comparing to LLTM in Ayers & Junker (2006) Student proficiency depends on time Question difficulty depends on KC and time Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM Scaffolding used to gain identifiability Ayers & Junker (2006) use regression to predict MCAS after obtaining estimate of student ability (θ) (MAD= 10.93%) No such regression process in my work logit(p=1) = θ – 0; estimated score = full score * p Higher MAD, but provide diagnostic information Most difficult KC: according to lowest proportion correct among all questions depending on each KC 74 Worcester Polytechnic Institute

75 Comparison of Approaches
Comparing to Bayes nets and conjunctive models Bayes: probability reasoning; conjunctive GLMM: linear learning; max-difficulty reduction Computationally much easier and faster Results are still comparable GLMM is better than Bayes nets when WPI-1, WPI-5 used GLMM is comparable with Bayes nets when WPI-39 or WPI-78 used WPI-39: GLMM 12.41%, Bayes: 12.05% WPI-78: GLMM 12.09%, Bayes: 13.75% 75 Worcester Polytechnic Institute

76 Cognitive Diagnostic Assessment – BIC results
#data points are different Items tagged with more than one skill will be duplicated in the data Finer grained models have more multi-mappings, and thus, more data points (higher BIC) WPI-5 better than WPI-1; WPI-78 better than WPI-39 Calculate MAD as the evaluation gauge 3085 -222 4870 Model WPI-1 WPI-5 WPI-39 WPI-78 04-05 Data 05-06 Data 36 -15522 399 There is an issue with this BIC criterion. However, despite that WPI-78 has more parameters and more data points, both add-on values for BIC, WPI-78 still got lower BIC than WPI-39, which tells us it fits data much better. [ɡeidʒ] Overfitting: Given that the fine-grained model is composed of 78 skills, people might think the model would naturally fit the data better than the skill models that contain far less skills, maybe even overfit the data with so many free parameters. However, I evaluate the effectiveness of the skill models over a totally different data from MCAS tests, the external state tests as the testing set. Predicting students’ scores on this test will be our gauge of model performance. Hence, I argue that overfitting would not be a problem in our approach. 76 Worcester Polytechnic Institute

77 Analyzing Instructional Effectiveness
Detect relative instructional effectiveness among items in the same GLOP using learning decomposition. Prior encounters 1 Correct? t1 Tom t4 t3 t2 Item Student Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.

78 Searching Results Among 38 GLOPs, LFA found significant better models for 12 Shall I be happy? “Sanity” check: random assigned factor tables #items in GLOP (#GLOPs) Learning- suggested factors Random factor table 2 (11) 5 3 (5) 4 (7) 3 1 5-11 (15) 4 (5, 6, 8, 9) 1 (5) Further works need to be done Quantitatively measure whether and how data analysis results can be helpful for subject-matter experts Explore the automatic factor assigning approach on more data for other systems Contrast with human experts as controlled condition “Add” certain amount of transfer happens between items, but not full transfer. Random doing bad, makse our way somehwat impressive. some validity of using educational data mining findings to help refine existing skill models

79 Bayesian Information Criterion
Guess which item is the most difficult one? 25 learning difficulty 25 learning difficulty Item ID Square-root Factor-High 894 1 41 4673 117 Log likelihood -532.6 -524 Bayesian Information Criterion 1,079.2 1,065.99 Num of skills 1 2 Num of parameters 4 Coefficients 1.099, 0.137 1.841, 0.100; , 0.055


Download ppt "Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18th, 2009 Where are these people from/"

Similar presentations


Ads by Google