ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,

ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen, Germany http://www.joergmmueller.de/default.htm

Overview 1.Introduction: Formal test descriptions in practice 2.Definition of usability in the context of test description 3.Illustrating problems: Reliability 4.Criteria of usability: foundation, scaling, general attributes 5.Two examples of enhanced usability: NDR and PDR 6.Summary

Introduction: Psychometric admeasurements in practice today and tomorrow 1.Test users often use poor quality tests (e.g. Piotrowski et al.; Wade & Baker, 1977) Psychometric knowledge (Moreland et al. 1995)/Competence approach (Bartram, 1995, 1996) 2.What should be described? CBT: Criteria for software usability (ISO 9241/10, 1991; Willumeit, Gediga & Hamborg, 1995) and further criteria: platform- independence, possibility of making own norm banking, protection) 3.How should it be described? 4.Good practice guidelines and standards are based on quality criteria (e.g. Standards for educational and psychological Testing, PA, 1999; International Guidelines for testing, ITC, 2000) Quality SupplyQuality Demand

Definition of Usability Scope of usability: Usability in the context of psychological testing concerns all important kinds of information for test users to describe a test for various purposes and the ways to communicate them. This includes test manuals as well as a formal test descriptions with the help of psychometric admeasurements. Aim of usability: The product or effect of good usability is that any test user finds all necessary information quickly and in a proper standardized form, ready to use for answering the questions of the test users to enable them to decide whether a test is an appropriate help for the diagnostic question. Frame of usability: Quality assurance in the context of psychological testing refers to test construction, test translation, test description and the use of tests in practice. Methods to enhance quality control can contain guidelines for test use, standards for test description, etc. Usability is a strategy to enhance quality on the level of formal description. Consequences of usability concern the reengineering of formal test description,

Indices of measurement of error Spearman Correlation % or SMC Phi-Coefficient Retest Pearson correlation Yules Y Cronbachs Alpha Kuder-Richardsons Formula 20 Spearman-Brown prophecy formula intraclass-correlation Sensitivity TP/(TP+FN) Specificy TN/(TN+FP) Standard error of a score Kappa Reclassification Model-Fit Likelihoods Information-function Kappa Interrater Standard error score Measurement of error Dimensional constructCategorical construct CTTIRTGeneralizability Theory nonspecific misclassification specific misclassification

Standard error score Reliability Relationships between indices of error of measurement Spearman Correlation % or SMC Phi-Coefficient Retest Pearson correlation Yules Y Cronbachs Alpha Kuder-Richardsons Formula 20 Spearman-Brown prophecy formula intraclass-correlation Sensitivity TP/(TP+FN) Specificy TN/(TN+FP) Standard error of a score Kappa Reclassification Model-Fit Likelihoods Information-function Kappa Interrater Information-criteria Y/ Kappa/ Phi Korrelation Phi Kappa

test theory/statistic Index: Generic formula Algorithm scale (correction) Interpretation of the score (operational meaning) Top-down vs. bottom-up strategy to develop a coefficient Practitioners point of view Scientists point of view Defining the operational meaning Scale definition Specification of within a test theory Index: Defining the influencing factors Index: Generic formula

Rescaling reliability: Number of distinctive results (NDR) (Wright & Master, 1982; Lehrl & Kinzel, 1973; Müller, 2001) Formula R = test score range k = critical difference

Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fisher(1925) criteria of estimating Criteria of usability for formal quality criteria (modified from Müller, 2001, 2002a,b; Goodmann & Kruskal, 1954) Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fisher(1925) criteria of estimating

NDR at work... NDR = 2 NDR = 5 NDR = 10 r =.50 r =.92 r =.98 Distribution of reliability coefficientDistribution of NDR coefficient Conclusion: many precise tests Conclusion: some precise tests

Probability of distinctive results (PDR) Formula Complete score comparison of pairs Rectangular distribution shows an 80 % probability to distinguish two test scores Gaussian distribution shows a 60 % probability to distinguish two test scores

Reliability PDR PDR: Simulation study Performance to separate test scores with respect to reliability and score distribution

PDR: Example Subscale Resignation; Stress-Coping-Questionnaire SVF-KJ; Hampel, Petermann & Dickow, 1999; N=1123 Subscale Unsicherheit Symptom Check List ( Derogatis, 1977; German Version Franke, 1995; N=875 r = 0.81 PDR = 41.6 %PDR = 30.6 % r = 0.81

Reviewing NDR and PDR 1. NDR and PDR can be derived in any test theoretical model – there is progress in the application area. 2. NDR and PDR have an easy to understand operational meaning 3. NDR and PDR are predictable for the test user for the nominal/actual value comparison NDR and PDR serve as examples of how to develop more usable formal test descriptions

Summary 1. Usability is a possible strategy with explicit and observable criteria, for improving formal test descriptions – and strengthening indirectly the role of guidelines and standards. 2. With NDR and PDR two easy to understood coefficients have been proposed, the application of which in is progress in several test theoretical models.

Thank you for your attention!

Medicine: Effect-size measures Practitioners coefficient Scientific coefficient (Cohen, 1988) NNTs [ Number-Needed-to-Treat ] the number of patients who need to be treated to prevent 1 adverse outcome. Taken from EBM Glossary - Evidence Based Medicine Volume 125 Number 1

Measuring in technical fields: Solutions from engineering The is a German Norm DIN 2257 on how to measure the physical length of an object and how to report the result. The norm allows as output only values with statistical evidence.

Criteria of usability for formal quality criteria for NNT Foundation 1.Unambiguous operational meaning 2.Unambiguous formal definition 3.Broad application area 4.Relevant dependencies 5.Independent of irrelevant factors Scale Definition 1.Meaningful scale unit, that implies: Interval scale Positive values Defined range of values 2.Comparable to the reference scale 3.Significant scale unit, that implies a minimum of observations (N min ) Global attributes in using 1.Relevance 2.Informative (not redundant) 3.Predictable for the test user (nominal/actual value comparison) 4.Easy to learn 5.Easy to utilise 6.Fishers (1925) criteria of estimating

Criteria of software usability (from Willumeit, Gediga & Hamborg, 1995) Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions: 1. Suitability for the task 2. Self-descriptiveness 3. Controllability 4. Conformity with user expectations 5. Error tolerance 6. Suitability for individualization 7. Suitability for learning

KR20 and Cronbach Kuder-Richardson-Formel KR20 i item pi relative Anzahl von 1 qi relative Anzahl von 0 (aus Cronbach, 1951) Cronbachs Alpha c Anzahl der Variablen s i 2 Varianz der Variablen i s tot 2 Varianz der Summe

Formula to the error of measurement in categorial constructs Cohens Kappa Weiter 16 Maße zur Konkordanz zweier Messungen für binaäre Daten verglichen Conger & Ward (1984) Yule Vierfelderinterdependenzmaß Q-Koeffizient Phi-Koeffizient Abhängigkeit von Randsummenverteilung Abhängigkeit des Signifkanztests von N (Yates-Kontinuitätskorrektur, 1934) B1B1 B2B2 A1A1 ab A2A2 cd

Formula to the error of measurement in categorial constructs Frickes Übereinstimmungs- koeffizient SS: Quadratsumme innerhalb einer Person; max SS: maximal mögliche Quadratsumme innerhalb der Personen Punkt-biseriale Korrelation X=arithmetisches Mittel aller Testrohwerte X R =arithmetishes Mittel der Pbn mit richtigen Antworten s x =Standardabweichung der Testrohwerte aller Pbn N = Anzahl aller Pbn N R =Anzahl der Pbn, mit richtigen Antworten Tetrachrorische Korrelation ABC I143 II042 III052

Formula to the error of measurement in CTT, IRT + prophecy-formula Spearman-Brown-Formel k= Faktor der Testverlängerung Rasch model CTT

Some Formula for the error of measurement in metric constructs reliability (Kelley, 1921) Pearson (1907) - Correlation Bravais (1846) Spearmans rho (1904) Kendalls Tau, 1942 (S=difference of pro- und inversionsnumber) 12345 32354

Non-linear Relationsship between reliability, NDR and the standard error score 1 reliability NDR Standard error score NDR Standard error score

Item-Response-Theory (Fischer & Molenaar, 1994) 1. Dichotomous raschmodel 2. Linear logistic test model 3. Linear logistic model for change 4. Dynamic generalization of the raschmodel 5. One parametric logistic model 6. Linear logistic latent class analysis 7. Mixture distribution rasch models 8. Polytomous rasch Models 9. Extended rating scale and partial credit models 10. Polytomous mixed rasch models 11....

...more IRT (van der Linden & Hambleton, 1997) 1. Nominal categories model 2. Response model for multiple choice 3. Graded response model 4. Partial credit model 5. Generalized partial credit model 6. Logistic model for time-limit tests 7. Hyperbolic cosine IRT model for unfolding direct responses 8. Single-item response model 9. Response model with manifest predictors 10. A linear multidimensional model 11....

Formula of some IRT rasch model binomial model Unfolding-model Birnbaum model Latent-Class-model

Criteria of software usability (from Willumeit, Gediga & Hamborg, 1995) Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions: 1. Suitable for the task 2. Self-descriptiveness 3. Controllability 4. Conformity with user expectations 5. Error Tolerance 6. Suitable for individualization 7. Suitability for learning

Norm scales

SCL-90-R test score distribution

Simulation study about the relationsship between measures of association Y/ Kappa/ Phi correlation Y/ Kappa/ Phi Q correlation SMCY/Kappa/Phi Q correlation Phi SMC Phi Kappa SMC Kappa Measure of association Linear relationship? dcA2A2 baA1A1 B2B2 B1B1 dichotome Normal distribution- equal marginals Skewed distribution - unequal marginals

Efficiency in measuring Content: efficiency Concept: The less effort you need for the same amount of information, the more efficiency the test is efficiency = f(Information;effort) Indice: E = Amount of Information/Time Estimates: Information Theory (Shannon & Weaver, 1949)

Amount of Information of a signal: Chess example In the chess example you need at least (binary, 50-50 chance) 6 questions that are 6 bit. 1. Frage: links- rechts? 2. Frage: oben- unten? 3. Frage 4.5. 6. The scale unit bit can be understand as the minimal or optimal number of questions, to identify a signal out of quantity of alternatives.

AB Schachspieler C 1:2 Rasch variances are a measure of the variability of persons within a dimension Als Maßeinheit der Unterschiedlichkeit dient die Differenz der Gewinnwahrscheinlichkeiten. 1: 2

1. Gewinnwahrscheinlichkeiten -> Lösungswahrscheinlichkeiten 2. Gegner -> Testaufgabe (Itemparameter) 3. Spielstärke -> Personenparameter 4. Differenz der Gewinnwahrscheinlichkeit definiert über den Logit des Raschmodells Interpretable Rasch Variances Difference to solve a question or task

Empirical Evidence of the range of person parameters in rasch units AID Kubingen & WurstStandardformParallelform Alltagswissen21,121,3 Realitätssicherheit13,313,1 Angewandtes Rechnen21,720,5 EigenschaftAutorAusdehnung Verbaler IntelligenztestMetzler & Schmidt11,4 Averbale IntelligenzForman & Pieswanger8,2 Einstellung zur SexualmoralWakenhut8,1 Einstellung zur StrafrechtsreformWakenhut7,2 BeschwerdelisteFahrenberg6,4 Räumliches VorstellungsvermögenGittler5,9 Umgang mit Zahlen bei KindernRasch3,5

Usability criteria explanations Relevant dependencies: Example: Reliability and test length, stability,... Irrelevant dependencies: Example: Reliability and test score distribution Displaying numbers: Integer, positive, predictable range Meaningful scale unit Familiarness: each new coefficient should distinctly more usable than the traditional

7. Linearität zur Unit-in-Change Erläuterung: Linearität zur Unit-in-Change -Im Falle der Messgenauigkeit betrifft dies die Beziehung der Reliabilität zum Messfehler. -Im Falle der Übereinstimmung betrifft dies die Beziehung von Yules Y zur Veränderung der Zellhäufigkeit a bzw. d. Korrelation/Reliabilität Standardmessfehler Yules Y Freq (Zelle a)

Evaluation the progress trough enhancing usability 1. Formal test criteria are used more frequently for test selection 2. Tests in practice are of higher quality

Ergonomics in psychological test selection Ergonomics Psychological diagnostic Configuration of Environment Software conception Designing a tool to fit in hand. Developing a program to be used intuitively Restrict a test description, that relevant information are ready to use

Integrating ergonomics in the formal test description Human interfacetechniques test user Psychometric admeasurements test Analysis of usage evaluation Usability criteria 1. Formal test criteria are used more frequently for test selection 2. Tests in practice are of higher quality

Ergonomics and the development of criteria of usability Requirement Analysis (Mayhew,1999) User-Profile Task- Analysis Platform Capabilities/ Constrains Testuser Test selection Test theory

Top-down vs. bottom-up strategy to develop a coefficient test theory/statistic Index: Generic formula Algorithm scale (correction) Interpretation of the score (operational meaning) Practitioners point of view Scientists point of view Defining the operational meaning Scale definition Specification of within a test theory Index: Defining the influencing factors Index: Generic formula CTT none association P-R-E SEDTTS NDR f(me, score range, probability)

ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,

Similar presentations

Presentation on theme: "ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,

Similar presentations

Presentation on theme: "ITC Conference ITC Conference, Winchester, 2002 Computer-based Testing Usability of Psychometric Admeasurements Dr. J. M. Müller University of Tübingen,"— Presentation transcript:

Similar presentations

About project

Feedback