Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Design of Experiments Lecture I
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Reliability and Validity
Increasing your confidence that you really found what you think you found. Reliability and Validity.
Knowledge Engineering Week 3 Video 5. Knowledge Engineering  Where your model is created by a smart human being, rather than an exhaustive computer.
HUDM4122 Probability and Statistical Inference March 30, 2015.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 11, 2012.
Data Synchronization and Grain-Sizes Week 3 Video 2.
Useful Statistical Tools February 19, Today’s Class Aphorisms Useful Statistical Tools Probing Question Assignments Surveys.
Validity, Sampling & Experimental Control Psych 231: Research Methods in Psychology.
Meta-analysis & psychotherapy outcome research
Psych 231: Research Methods in Psychology
Study Designs By Az and Omar.
Assessment Report Department of Psychology School of Science & Mathematics D. Abwender, Chair J. Witnauer, Assessment Coordinator Spring, 2013.
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
by B. Zadrozny and C. Elkan
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Causal inferences During the last two lectures we have been discussing ways to make inferences about the causal relationships between variables. One of.
Between groups designs (2) – outline 1.Block randomization 2.Natural groups designs 3.Subject loss 4.Some unsatisfactory alternatives to true experiments.
Chapter 2 Doing Social Psychology Research. Why Should You Learn About Research Methods?  It can improve your reasoning about real-life events  This.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 4, 2013.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Methodology Matters: Doing Research in the Behavioral and Social Sciences ICS 205 Ha Nguyen Chad Ata.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
PSY2004 Research Methods PSY2005 Applied Research Methods Week Six.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Learning Analytics: Process & Theory March 24, 2014.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 16, 2012.
Experimental Algorithmics Reading Group, UBC, CS Presented paper: Fine-tuning of Algorithms Using Fractional Experimental Designs and Local Search by Belarmino.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
The Theory of Sampling and Measurement. Sampling First step in implementing any research design is to create a sample. First step in implementing any.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 22, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Opening. Two heads Before proceeding find a partner to go through the next slides with you. If you don’t have a partner be sure to share this later with.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.
Experimental Psychology PSY 433 Chapter 5 Research Reports.
LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.
Knowing What Students Know Ganesh Padmanabhan 2/19/2004.
TMA04 - Writing the DE100 Project Report Discussion Section
Advanced Methods and Analysis for the Learning and Social Sciences
Experimental Psychology
Assessment Theory and Models Part II
Core Methods in Educational Data Mining
TMA04 - Writing the DE100 Project Report Discussion Section
Big Data, Education, and Society
Big Data, Education, and Society
Core Methods in Educational Data Mining
Experiment Basics: Designs
Stat 217 – Day 28 Review Stat 217.
Big Data, Education, and Society
Learning Analytics: Process & Theory
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Experiment Basics: Designs
Chapter 4 Summary.
Qi Li,Qing Wang,Ye Yang and Mingshu Li
Presentation transcript:

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013

Today’s Class Advanced Detector Validation and Evaluation

Calculating Statistical Significance for EDM models Up to this point, we’ve discussed cross- validation mostly Tests generalizability Relatively mathematically easy

But it’s possible to calculate statistical significance for some metrics A’ specifically The next few slides will be review…

Review

Comparing Two Models (Review)

Comparing Model to Chance (Review) 0.5 0

Complication This test assumes independence If you have multiple data points per student Then this assumption is not only wrong, it’s dangerous – Biases in the direction of statistical significance

What can we do? We can borrow a technique from the meta- analysis literature

Meta-Analysis What is it? What is it usually used for?

Special Warning When publishing using this techniques, it is probably wise to cite the appropriate paper but not mention the words meta-analysis Some reviewers just can’t deal with the idea that one can use meta-analytic statistical techniques in cases other than meta-analysis

Combining Statistical Significance Stouffer’s Z Z sqrt(K)

Let’s consider an example Is a model better than chance?

Data Student 1 A’ = 0.69, Np = 12, Nn = 13 Student 2 A’ = 0.65, Np = 14, Nn = 20 Student 3 A’ = 0.72, Np = 16, Nn = 25 Student 4 A’ = 0.83, Np = 70, Nn = 55 Student 5 A’ = 0.86, Np = 80, Nn = 60 Overall A’ = 0.80, Np = 192, Nn = 173

Compute Student 1 A’ = 0.69, Np = 12, Nn = 13 Student 2 A’ = 0.65, Np = 14, Nn = 20 Student 3 A’ = 0.72, Np = 16, Nn = 25 Student 4 A’ = 0.83, Np = 70, Nn = 55 Student 5 A’ = 0.86, Np = 80, Nn = 60 Overall A’ = 0.80, Np = 192, Nn = 173

Compute Student 1 A’ = 0.69, Np = 12, Nn = 13 Student 2 A’ = 0.65, Np = 14, Nn = 20 Student 3 A’ = 0.72, Np = 16, Nn = 25 Student 4 A’ = 0.83, Np = 70, Nn = 55 Student 5 A’ = 0.86, Np = 80, Nn = 60 Overall A’ = 0.80, Np = 192, Nn = 173

Comparing models Same methods can be used to compare between A’ for different models

Are the models significantly different?

Data A’a = 0.69, A’b = 0.72, Np = 12, Nn = 13 A’a = 0.65, A’b = 0.64, Np = 14, Nn = 20 A’a = 0.72, A’b = 0.75, Np = 16, Nn = 25 A’a = 0.83, A’b = 0.86, Np = 70, Nn = 55 A’a = 0.86, A’b = 0.92, Np = 80, Nn = 60

Other use of meta-analytic methods in detector validation Studying whether a detector generalizes – Population – Demographic group Compare A’ in original context to A’ in new context

More complicated case What if populations of students are overlapping but non- identical for two contexts? – E.g. 4 schools used software A and software B, but two additional schools used software B – E.g. A lot of kids were absent due to a school assembly You could just drop the non-overlapping kids – Not always practical Alternative: use Strube’s Adjusted Z – Deals with the partial non-independence of student performance within two different tutor lessons – Explicitly incorporates correlation between a student’s performance in the two tutor lessons, treating non-overlapping students as having a correlation of 0

Advanced Thoughts about Validation There are a lot of frameworks for thinking about validity of models, coming from decades of thought in the world of statistics We’ve discussed a few types of validity this semester so far I’ll mention a few here – Not an exhaustive list!

Generalizability Does your model remain predictive when used in a new data set? Underlies the cross-validation paradigm that is common in data mining Knowing the context the model will be used in drives what kinds of generalization you should study

Ecological Validity Do your findings apply to real-life situations outside of research settings? For example, if you build a detector of student behavior in lab settings, will it work in real classrooms?

Construct Validity Does your model actually measure what it was intended to measure?

Construct Validity Does your model actually measure what it was intended to measure? One interpretation: does your model fit the training data?

Construct Validity Another interpretation: do your model features plausibly measure what you are trying to detect? If they don’t, you might be over-fitting (Or your conception of the domain might be wrong!) See Sao Pedro paper from earlier in the semester for evidence that attention to this can improve model generalizability

Predictive Validity Does your model predict not just the present, but the future as well?

Substantive Validity Do your results matter? Are you modeling a construct that matters? If you model X, what kind of scientific findings or impacts on practice will this model drive? Can be demonstrated by predicting future things that matter

Substantive Validity For example, we know that boredom correlates strongly with – Disengagement – Learning Outcomes – Standardized Exam Scores – Attending College Years Later By comparsion, whether someone prefers visual or verbal learning materials doesn’t even seem to predict very reliably whether they learn better from visual or verbal learning materials (See lit review in Pashler et al., 2008)

Content Validity From testing; does the test cover the full domain it is meant to cover? For behavior modeling, an analogy would be, does the model cover the full range of behavior it’s intended to? – A model of gaming the system that only captured systematic guessing but not hint abuse (cf. Baker et al, 2004; my first model of this) – would have lower content validity than a model which captured both (cf. Baker et al., 2008)

Conclusion Validity Are your conclusions justified based on the evidence?

Other validity concerns?

Relative Importance? Which of these do you want to optimize? Which of these do you want to satisfice? Can any be safely ignored completely? (at least in some cases)

Exercise In groups of 3 Write the abstract of the worst behavior detector paper ever

Any group want to share?

Exercise #2 In different groups of 3 Now write the abstract of the best behavior detector paper ever

Any group want to share?

Asgn. 6 Any questions?

Next Class Monday, March 11 Regression in Prediction Readings Witten, I.H., Frank, E. (2011) Data Mining: Practical Machine Learning Tools and Techniques. Sections 4.6, 6.5. Assignments Due: 6. Regression

The End