Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring Success in English for Young People Annabelle G. Simpson Director, Channel Management, ETS Global Division.

Similar presentations

Presentation on theme: "Measuring Success in English for Young People Annabelle G. Simpson Director, Channel Management, ETS Global Division."— Presentation transcript:


2 Measuring Success in English for Young People Annabelle G. Simpson Director, Channel Management, ETS Global Division

3 Outline Who is ETS? Two Families of Products: TOEFL® & TOEIC® How does ETS develop quality tests? What is TOEIC® Bridge? What is TOEFL® Junior?

4 ETS: Our Mission To Advance Quality and Equity in Education for All People Worldwide We do this by providing: Fair, valid and reliable assessments Education research Products and services that measure knowledge and skills, promote learning and educational performance and support education and professional development

5 Two Families of English Assessments: TOEFL ® & TOEIC ® TOEFL iBTTOEIC L&R TOEFL ITPTOEIC S&W TOEFL Junior TOEIC Bridge Coming soon….

6 The Origins of ETS Work with Young People English proficiency is an increasingly important skill for students and young adults worldwide - Expanding access to educational, personal and professional opportunities EFL instruction is beginning at earlier ages English-medium instructional environments take many forms internationally: - Public and private schools in English-dominant countries - International schools in non English-dominant countries - Schools in any country using bilingual or CLIL approaches - Vocational schools Responds to aspirations of students as they attain English-language proficiency


8 How ETS Develops Quality Tests

9 Overview Before discussing how ETS develops quality tests, I will discuss what we mean by quality in testing. Then I will discuss the major steps in test development that are required to create a high quality test.

10 What Is a Quality Test? A quality test must be Reliable Valid Fair Practical

11 Reliable A test is only a sample. The items are a sample of all the items that could be asked. The time of testing is a sample of all the times that the test could be given. The person scoring the essay is a sample of all possible scorers.

12 Reliability Is Consistency If test takers knowledge is constant, how consistent would scores be if samples changed and parallel items were used? The test was taken on a different day? Different judges were used for scoring essays? The higher the reliability, the more consistent the scores will be.

13 Factors That Determine Reliability All other things being equal, the more independently scored items, the higher the reliability the more the items correlate with each other, the higher the reliability the greater the variability of scores, the higher the reliability

14 Validity Most important indicator of test quality Extent to which inferences based on test scores are appropriate & supported by evidence Requires evidence to support the use of the test for the intended purpose

15 Evidence of Validity Qualifications of test designers Process used to develop test Qualifications of item writers and reviewers Statistical indicators of item quality and fairness Expert judgments of test content

16 Evidence of Validity Match of items to content standards Relations among parts of the test Relations of scores with other variables Results fit with theories Claims for use of test are met Good consequences

17 Fairness = Validity for All Fairness is an aspect of validity. Tests that show valid differences across groups are fair. Tests that cause invalid differences across groups are not fair.

18 Practicality Tests must be affordable in dollar costs and in time used. Scores must be understandable & helpful to score-users. Items must be acceptable to diverse constituencies. Every test is a compromise among competing demands.

19 Major Steps in Test Development 1) Make Initial Plan for Test 2) Involve External Experts 3) Write/Review Items 4) Pretest Items (Whenever Possible) 5)Review Data & Revise Items 6)Assemble Final Test

20 Major Steps (continued) 7) Administer Tests 8) Checks Before Scoring 9) Scaling & Equating 10) Test Analyses 11) Report Scores 12) Begin Planning for Next Form

21 1) Plan Test Purpose What is test used for? What decisions made on the basis of the scores? Population What are characteristics of test takers? Construct Content & skills

22 Plan Test What constraints on test design? Time, cost, format, scoring, etc. Initial plan for test development work Major tasks, schedule, staff Evidence-Centered Design What claims about test takers? What evidence supports claims? What tasks provide evidence?

23 2) Involve External Experts Diverse (demographic, geographic, institutional, point of view) external contributors required in test design, item writing and reviewing. Diverse experts help establish acceptability, validity and fairness.

24 Tasks of External Experts Set/approve test specifications What content to measure? What skills to measure? What statistical properties? Write and review test items Select items for final form

25 3) Write/Review items Make item-writing assignments Write items to meet specifications Write overage for attrition Internal & external reviews & revisions At least 2 independent content reviews per item Separate editorial review Separate fairness review

26 3) Write/Review items Question (Item ) Author Artwork/graphics Content Reviewer 1 Content Reviewer 2 Content Reviewer 3 Edit Fairness Resolver Studio recording Lock

27 4) Pretest When possible, try out items before operational use. Gives information to : Identify problem items (ambiguous, wrong difficulty, poor discrimination. For MC: no key, multiple keys, bad distracter) Pick most appropriate items to meet specifications Estimate final form characteristics from item data



30 Use Differential Item Functioning (DIF) DIF = statistical measure of how matched people in different groups perform on an item. DIF helps spot items that may be unfair. DIF is NOT proof of bias.

31 Uses of DIF If data available, tests assembled with low DIF items. If no data at assembly, DIF calculated after administration. High DIF items reviewed and removed before test is scored, if judged unfair. External people involved in reviews.

32 5) Review Data & Revise Items Review test items based on data Ensure accuracy, clarity Appropriate difficulty Acceptable discrimination Revise or drop problem items Write new items if necessary to meet specifications

33 6) Assemble Final Test Choose set of items from pool according to specifications Perform test reviews Meet content, skill, & statistical specifications Check for overlap, cueing of keys Correctness of keys

34 7) Test Administration Print or format for computer Quality control checks Ship securely Administer test Acceptable conditions (space, comfort, light, temperature) Security (copying, impersonation, prior knowledge)

35 8) Checks Before Scoring Investigate complaints & reports Preliminary Item Analysis (PIA) Identify problem items based on statistics (too hard, too easy, poor discrimination, change from pretest) Review items to decide if keep in test or drop before scoring DIF, if not done previously

36 Checks Before Scoring Check for anomalies (sudden drops or increases in scores) that may indicate problems

37 9) Scaling & Equating Raw scores are number right or percent right on a particular test form. 50% right on a hard test form may take more knowledge & skill than 60% right on an easy test form. Raw scores mean different things on different test forms. ETS very rarely reports raw scores

38 Scaling & Equating Scaling is arbitrary range of numbers used to report scores. e.g., 200-800 for SAT, 150-190 for PPST. Equating is a statistical adjustment for differences in the difficulty of different forms of the same test. Equating allows us to treat the scores on different forms of a test as though they meant the same thing.

39 Scaling & Equating If a form happens to be a little harder than the others, it will take fewer raw score points to reach a particular scale score point. If a form happens to be a little easier than the others, it will take more raw score points to reach a particular scale score point. Scaled scores, after equating, mean the same on each form

40 10) Test Analyses Analysis of final form characteristics. Distribution of item difficulty & discrimination Reliability Speededness Did test meet content & statistical specifications? If not, where were problems?

41 11) Report Scores Explain what scores mean so scores are understandable to test users Indicate Standard Error of Measurement on score report

42 12) Plan Next Form What was learned from this administration to make the next administration of the test better? What has to change for next form?


44 About TOEFL ® Junior

45 A TOEFL ® product for a Younger Generation A distinct product within the growing TOEFL ® family of products A natural extension of the TOEFL brand, but specifically geared to the language learning needs of middle grade students - Informed by reviews of research and relevant standards - Based on years of experience developing international assessments of English language proficiency for both adults and K12 students Meets ETS Standards for Quality and Fairness Builds upon ETSs expertise in English language assessment for young learners. TOEFL ® products set the standard for English proficiency worldwide

46 The Paper-Based Test is designed to provide useful Information Purpose is to assess the degree to which students aged 11- 15 have attained language proficiency representative of middle school English-medium instruction

47 TOEFL Junior Structure Format: Paper Three Sections: Listening Reading Language Form and Meaning

48 TOEFL Junior Structure Listening Comprehension: This section tests how well students understand spoken English. Number of Questions: 42 Section administered by CD. Students are asked to answer questions based on a variety of statements, questions, conversations and talks recorded in English. Total time: approximately 35–40 minutes. Question Types Classroom Instruction Short Conversations Academic Listening

49 Sample Listening Item (Narrator): Listen to a high school principal talking to the schools students. (Man): I have a very special announcement to make. This year, not just one, but three of our students will be receiving national awards for their academic achievements. Krista Conner, Martin Chan, and Shriya Patel have all been chosen for their hard work and consistently high marks. It is very unusual for one school to have so many students receive this award in a single year. (Narrator): What is the subject of the announcement? What is the subject of the announcement? (A) The school will be adding new classes. (B) Three new teachers will be working at the school. (C) Some students have received an award. (D) The school is getting its own newspaper.

50 TOEFL Junior PBT Structure Reading Comprehension: - This section tests how well students read and comprehend written English. Students read a variety of materials. - Number of Questions: 42 questions. - Total time: 50 minutes. Question Types - Non-academic - Academic

51 Sample Reading Item Questions are about the following announcement. What time will the festival begin? (A)10 A. M. (B)11 A. M. (C)1 P. M. (D)2 P. M.

52 TOEFL Junior PBT Structure Language Form and Meaning: – This section assesses key language skills such as grammar and vocabulary in context. – The section includes 42 questions. – Total time: approximately 25 minutes. Question Types: – Language Meaning – Language Form

53 Sample Language Form and Meaning Item Questions - refer to the following e-mail.

54 Score Report Section scores for Listening, Language Form and Meaning, and Reading SectionScale Scores Listening Comprehension200-300 Language Form & Meaning200-300 Reading Comprehension200-300 Total Score600-900 The TOEFL Junior score report provides a description of the English- language abilities typical of test takers scoring around a particular scaled score level. There are four possible descriptions for each section of the test Link to the Common European Framework of Reference Lexile measure

55 Listening Descriptions Test takers who score between 210 and 245 may have the following strengths: They can understand the main idea of a brief classroom announcement if it is explicitly stated. They can understand important details that are explicitly stated and reinforced in short talks and conversations. They can understand direct paraphrases of spoken information when the language is simple and the context is clear. They can understand a speakers purpose in a short talk when the language is simple and the context is clear.

56 Common European Framework of Reference for Languages (CEFR) Important Note: CEFR levels are context-dependent. A B2 for middle school is not the same as a B2 for adults.

57 Appropriate Use of the TOEFL ® Junior Test Appropriate for low- to medium-stakes decisions Provides a general standard to measure proficiency levels of proficiency of students aged 11-15 representative of English-medium instructional environments Serves as one piece of information supporting placement into programs designed to increase proficiency levels of these EFL students Provides information about student progress in developing English language proficiency over time

58 The TOEFL ® Junior Test is NOT… …based on any specific curriculum …directly linked to TOEFL iBT scores …intended to predict performance on the TOEFL iBT test …for use to support high-stakes decisions such as for admissions purposes or criterion-based exit testing …a substitute for TOEFL iBT, TOEFL pBT or TOEFL ITP

59 Participating Countries Latin America Brazil, Chile Asia - China, Indonesia, Japan, Korea, Vietnam Europe Bulgaria, France, Greece, Italy, Poland, Turkey Middle East Egypt. Gaza/West Bank, Lebanon, Morocco


61 The TOEIC Bridge Test

62 What is the TOEIC Bridge Test? A test to measure the emerging competencies of beginning learners of English A tool to help language learners focus on areas for improvement

63 Why use the TOEIC Bridge Test? To measure beginner English proficiency To motivate English Language Learners To set language learning goals

64 How is the TOEIC Bridge Test different from the TOEIC ® Listening and Reading Test? The TOEIC Bridge test takes only one hour. The TOEIC ® test takes two hours. There are 100 questions in the TOEIC Bridge test, 200 in the TOEIC ® test. The TOEIC Bridge has only five parts, the TOEIC ® test has seven parts. There is more time between questions in the TOEIC Bridge test. In the TOEIC Bridge test, the speakers speak more slowly.

65 Differences (Continued) TOEIC Bridge test questions are easier. TOEIC Bridge test questions cover more general topics. The scaled score range on the TOEIC Bridge is from 20 to 180; on the TOEIC ® test, scores are on a scale of 10 to 990. The TOEIC Bridge test is a low-stakes test; the TOEIC ® test is a high-stakes test.

66 Test Format Two sections: Section I: Listening Comprehension – Candidates listen to a variety of statements, questions, short conversations, and short talks, and answer 50 questions.(tape mediated) Three Parts: Photo-based (15 questions) Question-Answer (20 questions) Conversations and Short Talks Section II: Reading Comprehension – Candidates read single sentences as well as texts and answer 50 comprehension questions. Two Parts: Incomplete sentences (30 questions) Reading Comprehension (20 questions)

67 TOEIC Bridge Content Areas Animals Basic objects Clothing Dates/days/time Entertainment Family members Food/dining out Games Health Housing/residence Measurement Money Months Music Numbers Recreation/hobbies School subjects Shopping Sports Travel/transportation Weather Work

68 Scoring Total scores range from 20 - 180 Listening and Reading subscores range from 10 – 90 Test administration time is approximately 1.5 hours Test scoring – (under operational conditions) 24-48 hours in most locations

69 The scores are based on the number of correct responses. The correct responses in each section (Listening and Reading) are converted to a score scale. The range of the scale is from 10 – 90 for each section. Summing the scores of the sections produces a total scaled score. The range of the total score is then 20 – 180.

70 CEFR Ratings The TOEIC Bridge test ranges from the A1 level to the B1 level.

71 For Sample Test Questions for TOEFL Junior and TOEIC Bridge:

72 Thank you.

Download ppt "Measuring Success in English for Young People Annabelle G. Simpson Director, Channel Management, ETS Global Division."

Similar presentations

Ads by Google