Quality Standards in TIMSS and PIRLS The Basis for Valid and Reliable Data for Educational Decision Making Ina V.S. Mullis, Michael O. Martin, & Pierre.

Slides:

Advertisements

Similar presentations

Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.

Advertisements

Test Development.

Action Research Not traditional educational research often research tests theory not practical Teacher research in classrooms and/or schools/districts.

You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.

General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.

Anita M. Baker, Ed.D. Jamie Bassell Evaluation Services Program Evaluation Essentials Evaluation Support 2.0 Session 2 Bruner Foundation Rochester, New.

Benchmark Assessment Item Bank Test Chairpersons Orientation Meeting October 8, 2007 Miami-Dade County Public Schools Best Practices When Constructing.

GLOBAL TOBACCO SURVEILLANCE SYSTEM Global Youth Tobacco Survey Training Workshop Introduction to the GYTS Sample Design & Weights.

Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.

Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.

Documentation and survey quality. Introduction.

Chapter 7 Correlational Research Gay, Mills, and Airasian

Chapter 13 Survey Designs

Purpose of the Standards

Creating Research proposal. What is a Marketing or Business Research Proposal? “A plan that offers ideas for conducting research”. “A marketing research.

Survey Designs EDUC 640- Dr. William M. Bauer

Statistics Education Research Journal Publishing in the Statistics Education Research Journal Robert C. delMas University of Minnesota Co-Editor Statistics.

Codex Guidelines for the Application of HACCP

The Education Adjustment Program Profile – Revised.

Power Point Slides by Ronald J. Shope in collaboration with John W. Creswell Chapter 13 Survey Designs.

DEVELOPING ALGEBRA-READY STUDENTS FOR MIDDLE SCHOOL: EXPLORING THE IMPACT OF EARLY ALGEBRA PRINCIPAL INVESTIGATORS:Maria L. Blanton, University of Massachusetts.

1st NRC Meeting, October 2006, Amsterdam 1 ICCS 2009 Field Operations.

1st NRC Meeting, October 2006, Amsterdam 1 Sampling: Next Steps.

NEXT GENERATION BALANCED ASSESSMENT SYSTEMS ALIGNED TO THE CCSS Stanley Rabinowitz, Ph.D. WestEd CORE Summer Design Institute June 19,

1st NRC Meeting, October 2006, Amsterdam 1 ICCS Sampling Design.

Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.

Data Management Seminar, 8-11th July 2008, Hamburg ICCS 2009 Main Survey Field Operations.

Student Engagement Survey Results and Analysis June 2011.

Data Management Seminar, 8-11th July 2008, Hamburg 1 Summary Common Sources of Error.

Quality in language assessment – guidelines and standards Waldek Martyniuk ECML Graz, Austria.

Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.

Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.

Classroom Assessments Checklists, Rating Scales, and Rubrics

Overview of ICCS field trial National Research Coordinators Meeting Windsor, June 2008.

Evaluating a Research Report

Quality Control Program in ICCS Plans for the Main Study National Research Coordinators Meeting Windsor, June 2008.

3rd NRC Meeting, 9-12 June 2008, Windsor ICCS 2009 Main Survey Field Operations.

National adaptations to main survey instruments and layout verification National Research Coordinators Meeting Windsor, June 2008.

Quality Assurance Program National Research Coordinators Meeting Amsterdam, October 2006.

Data Quality Assessment

Data Management Seminar, 9-12th July 2007, Hamburg 11 ICCS 2009 – Field Trial Survey Operations Overview.

1st NRC Meeting, October 2006, Amsterdam 1 Data Management Procedures Preview of software used in ICCS Michael Jung, IEA Data Processing Center.

ICCS Main Survey Overview National Research Coordinators Meeting Madrid, February 2010.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Data Management Seminar, 8-11th July 2008, Hamburg 1 ICCS 2009 – On-line Data Collection in the Main Survey.

Data Management Seminar, 9-12th July 2007, Hamburg WinW3S - Introduction.

Copyright © Allyn & Bacon 2008 Intelligent Consumer Chapter 14 This multimedia product and its contents are protected under copyright law. The following.

RESEARCH METHODS Lecture 29. DATA ANALYSIS Data Analysis Data processing and analysis is part of research design – decisions already made. During analysis.

Study Overview National Research Coordinators Meeting Amsterdam, October

Data Management Seminar, 8-11th July 2008, Hamburg WinW3S - Introduction.

United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.

Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.

1 Scoring Provincial Large-Scale Assessments María Elena Oliveri, University of British Columbia Britta Gundersen-Bryden, British Columbia Ministry of.

1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.

Minnesota Manual of Accommodations for Students with Disabilities Training January 2010.

Monitoring Afghanistan, 2015 Food Security and Agriculture Working Group – 9 December 2015.

ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

11 PIRLS The Trinidad and Tobago Experience Regional Policy Dialogue on Education 2-3 December 2008 Harrilal Seecharan Ministry of Education Trinidad.

Welcome. Contents: 1.Organization’s Policies & Procedure 2.Internal Controls 3.Manager’s Financial Role 4.Procurement Process 5.Monthly Financial Report.

Instrument Development and Psychometric Evaluation: Scientific Standards May 2012 Dynamic Tools to Measure Health Outcomes from the Patient Perspective.

EVALUATING EPP-CREATED ASSESSMENTS

Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.

INTERNATIONAL ASSESSMENTS IN QATAR TIMSS

Assessments for Monitoring and Improving the Quality of Education

Monitoring and Evaluation Systems for NARS Organisations in Papua New Guinea Day 2. Session 6. Developing indicators.

Systems Analysis and Design

Data harmonization in International Surveys on Education

Measuring Data Quality

Presentation transcript:

Quality Standards in TIMSS and PIRLS The Basis for Valid and Reliable Data for Educational Decision Making Ina V.S. Mullis, Michael O. Martin, & Pierre Foy Russian Education in the Mirror of the International Comparative Studies June 19, 2013

1 Our Mission: Provide Internationally Comparable Data of High Quality for Improving Education Data about student achievement –Reading, Mathematics, and science Data about the contexts for teaching and learning –Key factors influencing achievement –Relevant for educators and policy makers

2 “Internationally Comparable Data of High Quality” Requires 100% attention to doing high quality work With quality assurance steps along the way Classic attributes of high quality achievement data: –Reliability –Validity –International Comparability

3 Reliability Instruments measure consistently what they are intended to measure Instruments are the same Environment for using the instruments is the same Persons respond to the instruments in the same way Instruments are scored in the same way Ensure that comparisons are made based on “real” achievement and not affected by extraneous factors

4 Validity Inferences drawn from results can be supported by evidence Requires unified agreement –about how the construct has been conceptualized and articulated… e.g., is this mathematics? –on how it has been operationalized… e.g., do these items measure mathematics? In other words, does a student with a high score on the mathematics test actually know a lot of mathematics?

5 What About International Comparability? Our Curricula are different! Our languages are different! Our school systems are organized differently! –Different age of entry –Duration of compulsory schooling –Percentage of students attending school –Stages of schooling (primary, elementary, etc.) –Different promotion and retention policies

6 Validity in an International Context Need to ensure that data are internationally comparable Inferences made about achievement differences between countries can be substantiated Accomplished by setting high quality standards in all TIMSS and PIRLS procedures for developing and administering the achievement assessments

7 Ensuring Reliability and Validity of the TIMSS and PIRLS Achievement Data Assessment Framework Test Development Field Test Translation Verification Target Population Sampling

8 Ensuring Reliability and Validity of the TIMSS and PIRLS Achievement Data Data Collection Constructed Response Scoring Database Construction Achievement Scaling Reporting Achievement Data

9 Assessment Frameworks Dealing with different curricula Define the constructs in detail –TIMSS: content and cognitive domains –PIRLS: Purposes and processes

10 Assessment Frameworks Developed through widespread collaboration with participating countries Literature reviews, current perspectives Surveys to align assessments with countries’ curricula Iterative reviews by National Research Coordinators –Within country and in plenary Iterative reviews by expert panels – SMIRC, RDG

11 Assessment Frameworks Updated with each assessment cycle Incorporate fresh perspectives Accommodate new countries Evolve over time

12 Test Development In accordance with Assessment Framework Assess topics/content in Framework Ambitious frameworks require many items for adequate measurement –Each domain requires sufficient representation Trend measurement requires many items –Items released and replaces with each cycle TIMSS and PIRLS have lots of items!

13 Test Development Developed in proportion to emphases agreed in Framework According to decisions about item format –50% multiple choice; 50% constructed response With scoring guides for constructed-response items According to careful plan for measuring trends –Approximately 60% trend, 40% new

14 Field Test Essential for confirming appropriateness and comparability of items – different languages? And to verify the proper implementation of all procedures Twice as many items as needed Translation by each country Scoring guides for constructed-response items and training

15 Field Test TIMSS & PIRLS ISC develops manuals describing standardized procedures IEA DPC checks and processes data TIMSS & PIRLS ISC conducts item analyses –Difficulty –Discrimination –Scoring reliability

16 Finalizing Item Selection Task force and TIMSS & PIRLS ISC make initial recommendation about items to retain Field test data and initial recommendation reviewed by expert committees – SMIRC, RDG Field test data and expert committee recommendation about item selection reviewed by the NRCs from participating countries Assessment items adopted by NRCs

17 Reliability and Validity in Data Collection, Analysis, and Reporting Are the target populations comparable? Was sampling conducted properly? Are translations comparable? Were the tests administered appropriately? Was scoring done correctly? Are the data comparable? Are the achievement results comparable?

18 Comparable Target Populations? School systems organized differently In TIMSS and PIRLS, Amount of instruction  Years of schooling PIRLS: 4 years of schooling, counting from 1 st year of primary  4 th grade TIMSS: 4 & 8 years of schooling  4 th & 8 th grades Based on ISCED definitions

19 Comparable Target Populations? Why grade and not age as the basis? –Better for improving education! Education is organized by grade, so grade- based data easier to use for implementing reforms Amount of instruction, not maturation, the primary determinant of achievement –Students learn through instruction, not simply by growing older

20 Comparable Target Populations? Has country chosen correct grade? Are all eligible students included in definition? –Generally yes, for most countries –If less than 100%, annotated in International Reports Are exclusions kept to a minimum? –Generally yes, for most countries –If more than 5%, annotated in International Reports

21 Sampling Conducted Correctly? TIMSS and PIRLS Requirements Random sample design –Developed and authorized by Statistics Canada Accurate school sampling frame –School sampling done by Statistics Canada Proper classroom sampling –Use of WinW3S software mandatory

22 Sampling Conducted Correctly? TIMSS and PIRLS goals for sampling participation Participation rates for schools, classes, and students –100%!!! Sampling precision goals –Percentages ± 5% –Means ± 0.1 S.D. Usually 150 schools and one or two classes per school (Approx. 4,000 students)

23 Sampling Conducted Correctly? Procedures acceptable and fully documented? –Reviewed by Statistics Canada and Sampling Referee Acceptable participation rates? –At least 85% schools, 95% classes, 85% students –Generally yes, for most countries –Others annotated in International Reports, or below a line Population coverage and participation rates published in International and Technical Reports

24 Translations Comparable? Has country correctly translated all assessment items? –IEA provides guidelines and instructions –IEA Secretariat verifies each translation –Issues referred to National Research Coordinator for resolution Do the test booklets conform to international layout? –TIMSS & PIRLS ISC verifies final layout before printing Countries check final printed booklets

25 Tests Administered Correctly? Data collection a National responsibility TIMSS & PIRLS ISC develops manuals describing standardized procedures –School Coordinator Manual –Test Administrator Manual

26 Tests Administered Correctly? How do we verify that data collection procedures have been followed? IEA Secretariat and TIMSS & PIRLS ISC conduct program of international quality control monitoring –IEA Secretariat recruits Quality Control Monitor (QCM) in each country –TIMSS & PIRLS ISC conducts training sessions for QCMs –The QCM visits a sample of 15 schools at each grade; records observations and interviews school coordinator and test administrator

27 Tests Administered Correctly? TIMSS & PIRLS ISC analyzes and reports the results in the Technical Report –Generally, QCM reports are very positive –Data collected according to procedures specified in manuals, with very few exceptions Country also conducts quality control observations at 15 schools NRCs complete online Survey Activities Report

28 Constructed-response Item Scoring Done Correctly? About 50% of TIMSS and PIRLS items are in constructed response format Each constructed-response item has its own tailored scoring guide Scoring training materials prepared for each constructed-response item –Scoring guide –Anchor or exemplar papers –Practice papers

29 Constructed-response Item Scoring Done Correctly? Scoring training conducted separately for Southern Hemisphere and Northern Hemisphere countries Training materials updated based on field test experience –Scoring guides refined –Enhanced sets of exemplar responses and practice papers

30 Constructed-response Item Scoring Done Correctly? How do we know the scoring was done well? Monitor reliability through double scoring –Within-country for the current assessment 200 responses per item –Within-country across trend assessments 200 responses per item scanned from previous assessment and delivered electronically for recoding with current assessment –Across countries for the current assessment 200 responses per item from English-speaking countries delivered electronically

31 Constructed-response Item Scoring Done Correctly? What happens if an item is not scored reliably? Vast majority of items have high scoring reliability Items with less than 70% agreement for within-country or trend reliability are removed from scaling –Extremely rare Scoring reliability results for all countries documented in technical reports

32 Are the Data Comparable? IEA DPC provides data entry software and variable codebooks to standardize data preparation –Software provides data checking and validation tools applied by the countries IEA DPC provides extensive training seminars IEA DPC checks each country’s data files for internal consistency and accuracy IEA DPC interacts with countries to resolve data issues

33 Are the Data Comparable? IEA DPC creates database and sends to TIMSS & PIRLS ISC and Statistics Canada for analysis and reporting Statistics Canada computes sampling weights based on data and sampling documentation –Compares estimated population size using weights against estimate from sampling frame –Interacts with countries to resolve issues Statistics Canada creates final sampling weights, including adjustments for non- response, for analysis and reporting

34 Are the Data Comparable? Initial review of item statistics, before scaling TIMSS & PIRLS reviews achievement item statistics – every item for every country Investigates items for poor discrimination or unreliable scoring – sometimes caused by a translation or printing error Rare, but such “faulty” items are not included in scaling achievement results for that country

35 Are the Data Comparable? Review of item-by-country interactions For each item, examine each country’s performance on the item in light of its overall performance –Outliers may be due to translation error, printing error, etc. For trend, compare item-by-country interaction patterns for current and previous assessments –If different, may delete that item for that country for trend

36 Are the Scaled Achievement Results Comparable? Use IRT scaling to summarize achievement data by modeling item difficulty and discrimination – one scale for all countries Scaling procedure fits a model to each item –The better the fit, the more accurate the results Check fitted model against observed data for each item –Typically, any issues were discovered during initial review

39 Are the Scaled Achievement Results Comparable? For trend items, Data from current assessment and previous assessment scaled together Item fit plotted separately to ensure that the item is a good fit for both sets of assessment data

41 Are the Scaled Achievement Results Comparable? Now that we have item parameters – difficulty and discrimination – we can place students on the scale, i.e., produce student achievement scores (plausible values Done separately for each country Done separately for each achievement scale –Reading, mathematics, science –TIMSS content and cognitive domains and PIRLS purposes and processes Each achievement distribution for each country checked separately

42 Are the Scaled Achievement Results Comparable? Scaling generally is very successful For most TIMSS and PIRLS countries, achievement score distributions are very satisfactory and provide an excellent basis for analysis and reporting Plots provide a good quality control check

46 Are Achievement Results in the TIMSS and PIRLS International Reports Comparable? All reported statistics accompanied by standard errors Tests of statistical significance performed for many differences –Between countries, across countries Annotations for countries not fully meeting sampling standards Achievement results presented in context

47 Why Do We Go to All This Trouble? Provide evidence of the comparative validity of the TIMSS and PIRLS achievement data The data can be trusted for important decision making based on comparisons among countries TIMSS and PIRLS can form the basis for evidence-based policy making

Thank You! Спасибо! Ina V.S. Mullis, Michael O. Martin, & Pierre Foy Russian Education in the Mirror of the International Comparative Studies June 19, 2013