Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands.

Slides:

Advertisements

Similar presentations

The meaning of Reliability and Validity in psychological research

Advertisements

Standardized Scales.

Action Research Not traditional educational research often research tests theory not practical Teacher research in classrooms and/or schools/districts.

You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.

Copyright © 2012 Pearson Education, Inc. or its affiliate(s). All rights reserved

Spiros Papageorgiou University of Michigan

Conceptualization and Measurement

1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.

Part II Sigma Freud & Descriptive Statistics

Part II Sigma Freud & Descriptive Statistics

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

Statistical Issues in Research Planning and Evaluation

Using the CEFR in Catalonia Neus Figueras

Chapter 4 Validity.

Hypothesis Testing: One Sample Mean or Proportion

Standard Setting Different names for the same thing Standard Passing Score Cut Score Cutoff Score Mastery Level Bench Mark.

7-2 Estimating a Population Proportion

FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,

Validity, Reliability, & Sampling

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

1 Evaluating Psychological Tests. 2 Psychological testing Suffers a credibility problem within the eyes of general public Two main problems –Tests used.

Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.

Measurement and Data Quality

Validity and Reliability

Reliability, Validity, & Scaling

Raili Hildén University of Helsinki Relating the Finnish School Scale to the CEFR.

Testing Hypotheses Tuesday, October 28. Objectives: Understand the logic of hypothesis testing and following related concepts Sidedness of a test (left-,

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.

Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

Technical Adequacy Session One Part Three.

Psychometrics William P. Wattles, Ph.D. Francis Marion University.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.

Chapter 8: Confidence Intervals

 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.

WELNS 670: Wellness Research Design Chapter 5: Planning Your Research Design.

 Remember, it is important that you should not believe everything you read.  Moreover, you should be able to reject or accept information based on the.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

How to write a professional paper. 1. Developing a concept of the paper 2. Preparing an outline 3. Writing the first draft 4. Topping and tailing 5. Publishing.

CHAPTER OVERVIEW The Measurement Process Levels of Measurement Reliability and Validity: Why They Are Very, Very Important A Conceptual Definition of Reliability.

Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.

 There must be a coherent set of links between techniques and principles.  The actions are the techniques and the thoughts are the principles.

Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.

Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 15 Developing and Testing Self-Report Scales.

Measurement Theory in Marketing Research. Measurement What is measurement?  Assignment of numerals to objects to represent quantities of attributes Don’t.

Psychometrics. Goals of statistics Describe what is happening now –DESCRIPTIVE STATISTICS Determine what is probably happening or what might happen in.

Criteria for selection of a data collection instrument. 1.Practicality of the instrument: -Concerns its cost and appropriateness for the study population.

1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.

©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

MAVILLE ALASTRE-DIZON Philippine Normal University

Chapter 6 - Standardized Measurement and Assessment

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

RELIABILITY BY DONNA MARGARET. WHAT IS RELIABILITY?  Does this test consistently measure what it’s supposed to measure?  The more similar the scores,

TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.

Educational Research Chapter 8. Tools of Research Scales and instruments – measure complex characteristics such as intelligence and achievement Scales.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.

Jean-Guy Blais Université de Montréal

ECML Colloquium2016 The experience of the ECML RELANG team

Introduction to the Validation Phase

How to Write MA proposal in Applied Linguistics

Classical Test Theory Margaret Wu.

Journalism 614: Reliability and Validity

RELATING NATIONAL EXTERNAL EXAMINATIONS IN SLOVENIA TO THE CEFR LEVELS

Evaluation of measuring tools: reliability

Lecture Slides Elementary Statistics Twelfth Edition

Presentation transcript:

Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands

Overview What is the problem –CEF –Psychometric problems Internal validation –Reliability –Dimensionality Linking –Test equating –Standard setting

Problem 1: What is the CEF? (descriptive) Classification system of linguistic acts (in a foreign language (?)) –Basic building blocks: can do statements –Multifaceted skills (listening, reading,... table 2, p.26-27) qualitative aspects (range, coherence, table 3, p. 28) contexts (personal, educational, table 5, p. 48)

Solution 1: What could we do if the CEF were nothing else than a descritpive system? Determination problem: is a concrete performance (linguistic act) an exemplar of a can do statement? We place a check mark along the can do statement that is exempified by the performance, forming thus a complicated high-dimensional profile Probably we would encounter problems of consistency (e.g. being successful in only one of two exemplars of the same can do statement)

Problem 2: The CEF is also a hierarchic system Three main classicications: Basic, Independent and Proficient (A, B and C) Further subdivisions: A1, A2, B1, B2, C1 and C2 (cumulative and therefore ordered) Implications –Language proficiency is measurable using this system –It implies a (rudimental) theory of language acquisition

The Linking Problem Devise a structured observation of linguistic acts (a test) Assign a person to one of the levels A1,...,C2 –using his/her test performance –using ‘objective’ rules –in such a way that the assigned level ‘B1’ corresponds to the ‘real’ level B1 as defined in the CEF The ‘Manual’ tells you how to proceed.

Internal validation Restriction to itemized tests ‘Universal simplification’: test performance is summarized by a single number, the test score Typical result: a score (on my test) in the range 21 to 32 corresponds to a B1 Why should one have confidence in your test?

Reliability Every measurement contains an error True score is the average score over a (huge) number of similar test administrations True scores and measurement errors are not ‘knowable’ in particular cases One can know something ‘in general’ Basic notion: Reliability coefficient

Reliability coefficient (Rel) Rel is the correlation between the scores on two parallel tests. It lies in [0,1] Reliability is a characteristic of the test in some population We do not compute Rel (in the population) but (in a sample): estimation error Establishing the reliability with a single test administration is very hard (Cronbach’s  )

Rel should be high(ish)?

Example continued (SEM = 3)

Dimensionality Can do at level B1 Can recognise significant points in straightforward newspaper articles on familiar subjects Can do at level B1 Can understand clearly written, straightforward instructions for a piece of equipment

Associated Problems Is a test consisting of 20 exemplars (items) of the ‘black’ can do equivalent to a test consisting of 20 exemplars of the ‘blue’ can do? –If ‘Yes’: How do we know this? –If ‘No’: What is the justification of placing the two can do’s at the same level (B1)? Maybe the score on the ‘blue test’ is partly determined by the trait ‘technical insight’. The previous hypothesis can be tested empirically

Multidimensionality: techniques If two tests measure the same concept, they will generally not perfectly correlate, because of measurement error. Correction for attenuation:

Factor Analysis Is the basic technique to reveal the dimensionality structure of a test battery Has a lot of variants, some technically very complicated The basic notions should be mastered by every scholar in language testing Reference: Section F of the Reference Supplement to the Manual.

Transition (1) The concepts discussed so far refer to internal validation, but also to external validation –The blue test of the example must not be a test of technical insight –An informative factor analysis will include other than pure language tests (or subtests) Provisional conclusion: my test is professionally constructed and measures proficiency in the sense described by the CEF –The items are well devised exemplars of can do statements –There is a well defined balance across qualitative aspects deemed important in the CEF In this sense, the test is linked to the CEF.

Transition (2) But we want more –Assignment to a CEF-level, using only test scores, e.g., –less than 55  ‘level A2 or lower’ –55 or more  ‘level B1 or higher’ –73 or more  ‘level B2 or higher’ –By implication [55,72] assigns to B1 –55 and 73 are called cutting points (or cut-off scores) Two classes of techniques –Test Equating –Standard Setting

Test Equating To replace an ‘old’ test X by a ‘new’ test Y Problem: find for each possible score on X an equivalent score on Y Especially useful if X and Y are not equally difficult. Many techniques, some very easy to apply But...

Example: Equipercentile Equating test X (old)test Y (new) score% %

Standard Setting If there is no test to equate with, somebody has to decide whether the minimum score for a ‘B1’ assignment is 55 or 54 or 61 or whatever. –Who is/are this somebody? –How do they decide? –Is their decision the ‘truth’? –Why would we trust such a decision?

Who is involved? Not a single individual, but a group of persons (‘the larger the better’) A group of experts, i.e. trained persons –They know the CEF –They recognize exemplars A whole chapter of the manual is devoted to this problem.

How to decide? Many standard setting methods documented in the literature (starting in 1954!) See Section B of the Reference Supplement of the manual. –Test centered –Examinee centered

Standard Setting in Dialang Suppose the target level is B1 For each item, ask the following question: “Do you think a person at level B1 should be able to answer this item correctly?” Count the number of items responded to by ’yes’. Average across judges and you get the standard

Is their decision the truth? ?

Why would we trust such decisions? Intra-judge consistency –‘Easy’ items (yes), ‘hard’ items (no) Inter-judge consistency –Do you agree ‘in general’ with your colleagues External validation –Maybe there is an external test –Overall judgment of the teacher