ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Chapter 8 Flashcards.
Psychology Practical (Year 2) PS2001 Correlation and other topics.
Principles of Measurement Lunch & Learn Oct 16, 2013 J Tobon & M Boyle.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Part II Sigma Freud & Descriptive Statistics
Part II Sigma Freud & Descriptive Statistics
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.
Concept of Measurement
Statistical Methods Chichang Jou Tamkang University.
Chapter Eighteen MEASURES OF ASSOCIATION
Statistical Evaluation of Data
Measures of Association Deepak Khazanchi Chapter 18.
Research Methods in MIS
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Inferential Statistics
Measurement and Data Quality
Copyright © 2001 by The Psychological Corporation 1 The Academic Competence Evaluation Scales (ACES) Rating scale technology for identifying students with.
GEODETIC INSTITUTE LEIBNIZ UNIVERSITY OF HANNOVER GERMANY Ingo Neumann and Hansjörg Kutterer The probability of type I and type II errors in imprecise.
Multiple Choice Questions for discussion
Chapter 15 Correlation and Regression
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Statistical Evaluation of Data
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
User Study Evaluation Human-Computer Interaction.
Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
Tests and Measurements Intersession 2006.
Linear correlation and linear regression + summary of tests
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Research methods in clinical psychology: An introduction for students and practitioners Chris Barker, Nancy Pistrang, and Robert Elliott CHAPTER 4 Foundations.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Academic Research Academic Research Dr Kishor Bhanushali M
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Evidence Based Practice RCS /9/05. Definitions  Rosenthal and Donald (1996) defined evidence-based medicine as a process of turning clinical problems.
Question paper 1997.
MOI UNIVERSITY SCHOOL OF BUSINESS AND ECONOMICS CONCEPT MEASUREMENT, SCALING, VALIDITY AND RELIABILITY BY MUGAMBI G.K. M’NCHEBERE EMBA NAIROBI RESEARCH.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Writing A Review Sources Preliminary Primary Secondary.
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Chapter 13 Understanding research results: statistical inference.
TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Statistics Correlation and regression. 2 Introduction Some methods involve one variable is Treatment A as effective in relieving arthritic pain as Treatment.
Approaches to quantitative data analysis Lara Traeger, PhD Methods in Supportive Oncology Research.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Measurement Chapter 6. Measuring Variables Measurement Classifying units of analysis by categories to represent variable concepts.
STATISTICAL TESTS USING SPSS Dimitrios Tselios/ Example tests “Discovering statistics using SPSS”, Andy Field.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Classical Test Theory Margaret Wu.
Reliability & Validity
Instrumentation: Reliability Measuring Caring in Nursing
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
15.1 The Role of Statistics in the Research Process
RES 500 Academic Writing and Research Skills
Multitrait Scaling and IRT: Part I
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen, Germany

Overview 1.Introduction: Formal test descriptions in practice 2.Definition of usability in the context of test description 3.Illustrating problems: Reliability 4.Criteria of usability: foundation, scaling, general attributes 5.Two examples of enhanced usability: NDR and PDR 6.Summary

Introduction: Psychometric admeasurements in practice today and tomorrow 1.Test users often use poor quality tests (e.g. Piotrowski et al.; Wade & Baker, 1977) Psychometric knowledge (Moreland et al. 1995)/Competence approach (Bartram, 1995, 1996) 2.What should be described? CBT: Criteria for software usability (ISO 9241/10, 1991; Willumeit, Gediga & Hamborg, 1995) and further criteria: platform- independence, possibility of making own norm banking, protection) 3.How should it be described? 4.Good practice guidelines and standards are based on quality criteria (e.g. Standards for educational and psychological Testing, PA, 1999; International Guidelines for testing, ITC, 2000) Quality SupplyQuality Demand

Definition of Usability Scope of usability: Usability in the context of psychological testing concerns all important kinds of information for test users to describe a test for various purposes and the ways to communicate them. This includes test manuals as well as a formal test descriptions with the help of psychometric admeasurements. Aim of usability: The product or effect of good usability is that any test user finds all necessary information quickly and in a proper standardized form, ready to use for answering the questions of the test users to enable them to decide whether a test is an appropriate help for the diagnostic question. Frame of usability: Quality assurance in the context of psychological testing refers to test construction, test translation, test description and the use of tests in practice. Methods to enhance quality control can contain guidelines for test use, standards for test description, etc. Usability is a strategy to enhance quality on the level of formal description. Consequences of usability concern the reengineering of formal test description,

Indices of measurement of error Spearman Correlation % or SMC Phi-Coefficient Retest Pearson correlation Yules Y Cronbachs Alpha Kuder-Richardsons Formula 20 Spearman-Brown prophecy formula intraclass-correlation Sensitivity TP/(TP+FN) Specificy TN/(TN+FP) Standard error of a score Kappa Reclassification Model-Fit Likelihoods Information-function Kappa Interrater Standard error score Measurement of error Dimensional constructCategorical construct CTTIRTGeneralizability Theory nonspecific misclassification specific misclassification

Standard error score Reliability Relationships between indices of error of measurement Spearman Correlation % or SMC Phi-Coefficient Retest Pearson correlation Yules Y Cronbachs Alpha Kuder-Richardsons Formula 20 Spearman-Brown prophecy formula intraclass-correlation Sensitivity TP/(TP+FN) Specificy TN/(TN+FP) Standard error of a score Kappa Reclassification Model-Fit Likelihoods Information-function Kappa Interrater Information-criteria Y/ Kappa/ Phi Korrelation Phi Kappa

test theory/statistic Index: Generic formula Algorithm scale (correction) Interpretation of the score (operational meaning) Top-down vs. bottom-up strategy to develop a coefficient Practitioners point of view Scientists point of view Defining the operational meaning Scale definition Specification of within a test theory Index: Defining the influencing factors Index: Generic formula

Rescaling reliability: Number of distinctive results (NDR) (Wright & Master, 1982; Lehrl & Kinzel, 1973; Müller, 2001) Formula R = test score range k = critical difference

Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fisher(1925) criteria of estimating Criteria of usability for formal quality criteria (modified from Müller, 2001, 2002a,b; Goodmann & Kruskal, 1954) Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fisher(1925) criteria of estimating

NDR at work... NDR = 2 NDR = 5 NDR = 10 r =.50 r =.92 r =.98 Distribution of reliability coefficientDistribution of NDR coefficient Conclusion: many precise tests Conclusion: some precise tests

Probability of distinctive results (PDR) Formula Complete score comparison of pairs Rectangular distribution shows an 80 % probability to distinguish two test scores Gaussian distribution shows a 60 % probability to distinguish two test scores

Reliability PDR PDR: Simulation study Performance to separate test scores with respect to reliability and score distribution

PDR: Example Subscale Resignation; Stress-Coping-Questionnaire SVF-KJ; Hampel, Petermann & Dickow, 1999; N=1123 Subscale Unsicherheit Symptom Check List ( Derogatis, 1977; German Version Franke, 1995; N=875 r = 0.81 PDR = 41.6 %PDR = 30.6 % r = 0.81

Reviewing NDR and PDR 1. NDR and PDR can be derived in any test theoretical model – there is progress in the application area. 2. NDR and PDR have an easy to understand operational meaning 3. NDR and PDR are predictable for the test user for the nominal/actual value comparison NDR and PDR serve as examples of how to develop more usable formal test descriptions

Summary 1. Usability is a possible strategy with explicit and observable criteria, for improving formal test descriptions – and strengthening indirectly the role of guidelines and standards. 2. With NDR and PDR two easy to understood coefficients have been proposed, the application of which in is progress in several test theoretical models.

Thank you for your attention!

Medicine: Effect-size measures Practitioners coefficient Scientific coefficient (Cohen, 1988) NNTs [ Number-Needed-to-Treat ] the number of patients who need to be treated to prevent 1 adverse outcome. Taken from EBM Glossary - Evidence Based Medicine Volume 125 Number 1

Measuring in technical fields: Solutions from engineering The is a German Norm DIN 2257 on how to measure the physical length of an object and how to report the result. The norm allows as output only values with statistical evidence.

Criteria of usability for formal quality criteria for NNT Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit, that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fishers (1925) criteria of estimating

Criteria of software usability (from Willumeit, Gediga & Hamborg, 1995) Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions: 1. Suitability for the task 2. Self-descriptiveness 3. Controllability 4. Conformity with user expectations 5. Error tolerance 6. Suitability for individualization 7. Suitability for learning

KR20 and Cronbach Kuder-Richardson-Formel KR20 i item pi relative Anzahl von 1 qi relative Anzahl von 0 (aus Cronbach, 1951) Cronbachs Alpha c Anzahl der Variablen s i 2 Varianz der Variablen i s tot 2 Varianz der Summe

Formula to the error of measurement in categorial constructs Cohens Kappa Weiter 16 Maße zur Konkordanz zweier Messungen für binaäre Daten verglichen Conger & Ward (1984) Yule Vierfelderinterdependenzmaß Q-Koeffizient Phi-Koeffizient Abhängigkeit von Randsummenverteilung Abhängigkeit des Signifkanztests von N (Yates-Kontinuitätskorrektur, 1934) B1B1 B2B2 A1A1 ab A2A2 cd

Formula to the error of measurement in categorial constructs Frickes Übereinstimmungs- koeffizient SS: Quadratsumme innerhalb einer Person; max SS: maximal mögliche Quadratsumme innerhalb der Personen Punkt-biseriale Korrelation X=arithmetisches Mittel aller Testrohwerte X R =arithmetishes Mittel der Pbn mit richtigen Antworten s x =Standardabweichung der Testrohwerte aller Pbn N = Anzahl aller Pbn N R =Anzahl der Pbn, mit richtigen Antworten Tetrachrorische Korrelation ABC I143 II042 III052

Formula to the error of measurement in CTT, IRT + prophecy-formula Spearman-Brown-Formel k= Faktor der Testverlängerung Rasch model CTT

Some Formula for the error of measurement in metric constructs reliability (Kelley, 1921) Pearson (1907) - Correlation Bravais (1846) Spearmans rho (1904) Kendalls Tau, 1942 (S=difference of pro- und inversionsnumber)

Non-linear Relationsship between reliability, NDR and the standard error score 1 reliability NDR Standard error score NDR Standard error score

Item-Response-Theory (Fischer & Molenaar, 1994) 1. Dichotomous raschmodel 2. Linear logistic test model 3. Linear logistic model for change 4. Dynamic generalization of the raschmodel 5. One parametric logistic model 6. Linear logistic latent class analysis 7. Mixture distribution rasch models 8. Polytomous rasch Models 9. Extended rating scale and partial credit models 10. Polytomous mixed rasch models

...more IRT (van der Linden & Hambleton, 1997) 1. Nominal categories model 2. Response model for multiple choice 3. Graded response model 4. Partial credit model 5. Generalized partial credit model 6. Logistic model for time-limit tests 7. Hyperbolic cosine IRT model for unfolding direct responses 8. Single-item response model 9. Response model with manifest predictors 10. A linear multidimensional model

Formula of some IRT rasch model binomial model Unfolding-model Birnbaum model Latent-Class-model

Criteria of software usability (from Willumeit, Gediga & Hamborg, 1995) Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions: 1. Suitable for the task 2. Self-descriptiveness 3. Controllability 4. Conformity with user expectations 5. Error Tolerance 6. Suitable for individualization 7. Suitability for learning

Norm scales

SCL-90-R test score distribution

Simulation study about the relationsship between measures of association Y/ Kappa/ Phi correlation Y/ Kappa/ Phi Q correlation SMCY/Kappa/Phi Q correlation Phi SMC Phi Kappa SMC Kappa Measure of association Linear relationship? dcA2A2 baA1A1 B2B2 B1B1 dichotome Normal distribution- equal marginals Skewed distribution - unequal marginals

Efficiency in measuring Content: efficiency Concept: The less effort you need for the same amount of information, the more efficiency the test is efficiency = f(Information;effort) Indice: E = Amount of Information/Time Estimates: Information Theory (Shannon & Weaver, 1949)

Amount of Information of a signal: Chess example In the chess example you need at least (binary, chance) 6 questions that are 6 bit. 1. Frage: links- rechts? 2. Frage: oben- unten? 3. Frage The scale unit bit can be understand as the minimal or optimal number of questions, to identify a signal out of quantity of alternatives.

AB Schachspieler C 1:2 Rasch variances are a measure of the variability of persons within a dimension Als Maßeinheit der Unterschiedlichkeit dient die Differenz der Gewinnwahrscheinlichkeiten. 1: 2

1. Gewinnwahrscheinlichkeiten -> Lösungswahrscheinlichkeiten 2. Gegner -> Testaufgabe (Itemparameter) 3. Spielstärke -> Personenparameter 4. Differenz der Gewinnwahrscheinlichkeit definiert über den Logit des Raschmodells Interpretable Rasch Variances Difference to solve a question or task

Empirical Evidence of the range of person parameters in rasch units AID Kubingen & WurstStandardformParallelform Alltagswissen21,121,3 Realitätssicherheit13,313,1 Angewandtes Rechnen21,720,5 EigenschaftAutorAusdehnung Verbaler IntelligenztestMetzler & Schmidt11,4 Averbale IntelligenzForman & Pieswanger8,2 Einstellung zur SexualmoralWakenhut8,1 Einstellung zur StrafrechtsreformWakenhut7,2 BeschwerdelisteFahrenberg6,4 Räumliches VorstellungsvermögenGittler5,9 Umgang mit Zahlen bei KindernRasch3,5

Usability criteria explanations Relevant dependencies: Example: Reliability and test length, stability,... Irrelevant dependencies: Example: Reliability and test score distribution Displaying numbers: Integer, positive, predictable range Meaningful scale unit Familiarness: each new coefficient should distinctly more usable than the traditional

7. Linearität zur Unit-in-Change Erläuterung: Linearität zur Unit-in-Change -Im Falle der Messgenauigkeit betrifft dies die Beziehung der Reliabilität zum Messfehler. -Im Falle der Übereinstimmung betrifft dies die Beziehung von Yules Y zur Veränderung der Zellhäufigkeit a bzw. d. Korrelation/Reliabilität Standardmessfehler Yules Y Freq (Zelle a)

Evaluation the progress trough enhancing usability 1. Formal test criteria are used more frequently for test selection 2. Tests in practice are of higher quality

Ergonomics in psychological test selection Ergonomics Psychological diagnostic Configuration of Environment Software conception Designing a tool to fit in hand. Developing a program to be used intuitively Restrict a test description, that relevant information are ready to use

Integrating ergonomics in the formal test description Human interfacetechniques test user Psychometric admeasurements test Analysis of usage evaluation Usability criteria 1. Formal test criteria are used more frequently for test selection 2. Tests in practice are of higher quality

Ergonomics and the development of criteria of usability Requirement Analysis (Mayhew,1999) User-Profile Task- Analysis Platform Capabilities/ Constrains Testuser Test selection Test theory

Top-down vs. bottom-up strategy to develop a coefficient test theory/statistic Index: Generic formula Algorithm scale (correction) Interpretation of the score (operational meaning) Practitioners point of view Scientists point of view Defining the operational meaning Scale definition Specification of within a test theory Index: Defining the influencing factors Index: Generic formula CTT none association P-R-E SEDTTS NDR f(me, score range, probability)