Arizona English Language Learner Assessment AZELLA

Slides:



Advertisements
Similar presentations
Administering the ACCESS for ELLs® Speaking Test
Advertisements

Presented by Eroika Jeniffer.  We want to set tasks that form a representative of the population of oral tasks that we expect candidates to be able to.
SUCCESS BY YOUR STANDARDS ® Administering Idaho’s English Language Assessment (Condensed Training Version) Spring 2015.
Language Assessment System (LAS) Links TM Census Test.
Jill Morgan CTB/McGraw-Hill Trainer Jill Morgan CTB/McGraw-Hill Trainer Connecting Assessment, Language, and Learning.
Woodcock-Muñoz Language Survey
ARIZONA ENGLISH LANGUAGE LEARNER ASSESSMENT AZELLA REASSESSMENT Information - SPRING 2015 ARIZONA DEPARTMENT OF EDUCATION.
ASSESSING ORAL CLASSROOM PRESENTATIONS DAVID W. KALE, PH.D. PROFESSOR OF COMMUNICATION, MVNU.
Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.
Alabama State Department of Education, Special Education Services.
Assessment Updates CD PST Meeting March 10, 2015.
Identification, Assessment and Re-classification of English Learners Initial Identification  Complete within 30 school days of enrollment Administer Home.
Jamal Abedi University of California, Davis/CRESST Presented at The Race to the Top Assessment Program January 20, 2010 Washington, DC RACE TO THE TOP.
Spring 2015 TELPAS Holistic Rating Training System
TEACHER TRAINING WORKSHOPS Module 1: Methodology Unit 3: “Teaching Listening Comprehension”   © English Highway Language Center 2012.
Comprehensive Assessment System Webinar #6 December 14, 2011.
Introduction by Kristina Lauer.  Goals for the course:  Improve writing skills  Increase skills in English grammar  Practice pronunciation  Improve.
1 NYSESLAT Training Copyright 2005 by Harcourt Assessment, Inc. NYSESLAT CONTENTS OF THIS OVERVIEW  Test features  Materials  Administration.
Speaking Sample Items for Scoring Practice Speaking Components Speaking Scoring Guide Test Administration Manual Student Speaking Prompts- on CD Input.
Grammar-Translation Approach Direct Approach
Language Assessment 4 Listening Comprehension Testing Language Assessment Lecture 4 Listening Comprehension Testing Instructor Tung-hsien He, Ph.D. 何東憲老師.
Communication Key Skills INSET. Outline of INSET training 1. A review of the standards for all levels of communication key skill 2. Examples of portfolios.
Horizon Middle School June 2013 Balanced Scorecard In a safe, collaborative environment we provide educational opportunities that empower all students.
PARCC Update June 6, PARCC Update Today’s Presentation:  PARCC Field Test  Lessons Learned from the Field Test  PARCC Resources 2.
Professional Development by Johns Hopkins School of Education, Center for Technology in Education Supporting Individual Children Administering the Kindergarten.
Primary 1 English Assessment Briefing for Parents 4 Jan 2011 Mrs Finella Goh.
Exceeds EOC Target Intermediate Low EOC High Target Novice High EOC Target Novice Mid/High Near EOC Target Novice Mid Below EOC Target Novice Low Score.
What is Assessment? Assessment is a measure of what students are learning. Its purpose is to improve student learning. It can be thought of as a.
Item52321 Content Full realization of the task. All content points included Good realization of the task. There is adherence to the task with one missing.
CELLA Test Administrator Training for Elementary School.
BOY 3 rd Grade Benchmarks DORF DAZE WR TRC. General Scoring Guidelines SCHWA: No penalty for schwa sound /u/ added to consonant sounds. (“buh” for /b/)
AUTHORS: María Eugenia Guerrero Andrade Martha Catalina Puga Cevallos ADVISORS: Director: MS. Miguel Ponce Medina Co-Director: MG. Néstor Bonilla Bonilla.
Chap. 2 Principles of Language Assessment
Collecting primary data: use of questionnaires Lecture 20 th.
Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng,
1 NYSESLAT Training Copyright 2005 by Harcourt Assessment, Inc. NYSESLAT CONTENTS OF THIS OVERVIEW  Test features  Materials  Administration.
AIDES: CLASSROOM VS. PERSONAL. Special Education Aide Under general supervision, to assist the Special Education teacher in the preparation, monitoring,
Collaborative Assessment Experience EE403 Spring 2007.
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
The Audio-lingual Method
CAROLE GALLAGHER, PHD. CCSSO NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 26, 2015 Reporting Assessment Results in Times of Change:
PREPARATION IS THE KEY TO SUCCESS ON THE PSSA GRADES 3-5 PART 1.
Work Sample Seminar1 Developing a Pretest & Posttest for the Literacy Work Sample Portland State University.
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
SPEAKING TESTS IN THE CONTEXT OF LANGUAGE LEARNING.
Teaching Integrated Literacy Lessons by Aligning Reading and Writing Standards and Incorporating Reciprocal Processing Strategies to Improve the Writing.
Jpschools.org ELDA 2016 GRADES 3-12 ESL Department.
The Arizona English Language Learner Assessment (AZELLA)
Directions: This is a Power Point presentation about ESL. The information is used for staff development and there is a short quiz at the end.
Jan/Feb 2007CAPA Examiner Train-the-Trainer1 CAPA Training of CAPA Examiners.
CELDT PRACTICE Speaking Version B.
Arizona Educator Exams Update Arizona Educator Exams: What You Need to Know Kortney Zesiger, M Ed. 1.
 WIDA MODEL: Grades 1-12 Measure of Developing English Language.
Welcome to the Central Campus Curriculum Night Intermediate Grades Reading
AAPPL Assessment Follow Up June What is AAPPL Measure? The ACTFL Assessment of Performance toward Proficiency in Languages (AAPPL) is a performance-
Measuring the Power of Learning.™ 2015–16 CAASPP March, 2016 Laurie Carlson California Assessment of Student Performance and Progress (CAASPP)
Introducing preLAS 2000 Gina Davis Assessment Consultant preLAS.
The Arizona English Language Learner Assessment (AZELLA)
The Parent Guide To the AZMerit and Move on When Reading
A Tribal Language Program
CAPA Examiner Training
Assessment, Accountability and Continuous Improvement
The Arizona English Language Learner Assessment (AZELLA)
Third Grade FSA Parent Meeting
Georgia Department of Education Assessment and Accountability Division
Third Grade FSA Parent Meeting
An Introduction to Evaluating Federal Title Funding
Presentation transcript:

Automated Scoring for Speaking Assessments Arizona English Language Learner Assessment Irene Hunting - Arizona Department of Education Yuan D’Antilio - Pearson Erica Baltierra - Pearson June 24, 2015

Arizona English Language Learner Assessment AZELLA AZELLA is Arizona’s own English Language Proficient Assessment. AZELLA has been in use since school year 2006-2007. Arizona revised its English Language Proficiency (ELP) Standards due to the adoption of the Arizona College and Career Ready Standards in 2010. AZELLA had to be revised to align with the new ELP Standards. Arizona not only revised the alignment of AZELLA but also revised administration practices and procedures. Revisions to the Speaking portion of the AZELLA are particularly notable.

AZELLA Speaking Test Administration Prior to School Year 2012-2013 Administered orally by test administrator One-on-one administration Scored by test administrator Immediate scores Training for test administrators Minimal Not required

AZELLA Speaking Test Concerns Prior to School Year 2012-2013 Inconsistent test administration Not able to standardize test delivery Inconsistent scoring Not able to replicate or verify scoring

AZELLA Speaking Test Desires For School Year 2012-2013 and beyond Consistent test administration Every student has the same testing experience Consistent and quick scoring Record student responses Reliability statistics for scoring Minimal burden for schools No special equipment No special personnel requirements or trainings Similar amount of time to administer

AZELLA Speaking Test Administration For School Year 2012-2013 and beyond Consistent test administration Administered one-on-one via speaker telephone Consistent and quick scoring Student responses are recorded Reliable machine scoring Minimal burden for schools Requires a landline speaker telephone No special personnel requirements or training Slightly longer test administration time

Proposed Solution ----- Meeting Notes (6/16/15 15:03) ----- In order to provide a consistent test admin experience to all ELL students and provide a consistent scoring for all speaking tests, Pearson worked with the Department to implement a telephone-based speaking assessment solution. This solution includes automated delivery of the speaking assessment and automated scoring of the test responses. Here is a quick walk-through of our solution. Tests were administrated one-on-one to students. Test admin dialed a toll-free number and enter a test idenfication number to access the right test form. The speaking test items were delivered through a speaker phone. The timing for item presentation is controled and standardized. students' oral responses are collected through the phone and the audio data are transfered back to our database for grading. A machine scoring algorithm goes through the audio responses to produce a score for each of the students' responses.

Development of Automated Scoring Method Human raters Field testing data Testing System Automated Scores Validation Human Transcribers Recorded Items Item Text ----- Meeting Notes (6/16/15 15:03) ----- Next we're going to talk about how we developed the automated scoring for azella speaking and what it takes to set up a solution like this for states. Test Developers Test Spec

Why does automated scoring of speaking work? The acoustic models used for speech recognition are optimized for various accents Young children speech, foreign accents The test questions have been modeled from field test data The system anticipates the various ways that students respond

Field Tested Items The test questions have been modeled from field test data – the system anticipates the various ways that students respond e.g. “What is in the picture?”

Language models a It’s protractor I don’t know protractor a compass The system estimates the probability of each of those possible responses based on field test data. The responses from field tests were rated by human graders with the rubrics, so we know for each response what score a human grader will assign. We build the scoring algorithm based on those responses and human scores, so that the algorithm can perform like a human grader.

Used for building models Field Testing and Data Preparation Two Field Testing: 2011-2012 Number of students: 31,685 (1st -12th grade), 13,141 (Kindergarten) Stage Total tests Used for building models Used for validation I 13,184 1,200 333 II 10,646 300 III 9,369 IV 6,439 V 5,231

Item Type for Automated Scoring Score Point Domain Syllabification 0-1 Oral Reading Wordlist Repeat 0-6 Speaking Questions about an image 0-4 Similarities and differences Give directions from a map Questions about a statement Give instructions to do something Open questions about a topic Detailed responses to a topic Automated scoring can handle a variety of item types. The item types ranges from confined item types such as wordlist to more open/less confined item type such as picture description and giving instruction.

Sample Speaking Rubric: 0 – 4 Point Item Points Descriptors 4 Student formulates a response in correct understandable English using two or more sentences based on given stimuli. Student responds in complete declarative or interrogative sentences. Grammar errors are not evident and do not impede communication. Student responds with clear and correct pronunciation. Student responds using correct syntax.  3 Student formulates a response in understandable English using two or more sentences based on a given stimuli. Sentences have minor grammatical errors. Student responds with clear and correct pronunciation.  2 Student formulates an intelligible English response based on given stimuli. Student does not respond in two complete declarative or interrogative sentences. Student responds with errors in grammar. Student attempts to respond with clear and correct pronunciation.  1 Student formulates erroneous responses based on given stimuli. Student does not respond in complete declarative or interrogative sentences. Student responds with significant errors in grammar. Student does not respond with clear and correct pronunciation.  Human rating rubrics is a holistic rubrics that capture both the content of speech production (what they say) and the manner of production (how they say it) in terms of pronunciation, fluency etc.

Sample student responses Item Response Transcript Human Score Machine Score Next, please answer in complete sentences. Tell how to get ready for school in the morning. Include at least two steps. first you wake up and then you put on your clothes # and eat breakfast 3 3.35

Validity evidence: Are machine scores comparable to human scores? Measures we looked at: Reliability (internal consistency) Candidate-level (or test-level) correlations Item-level correlations

Structural reliability Stage Human Cronbach α Machine Cronbach α I 0.98 0.99 II III 0.96 0.94 IV 0.95 V Average 0.97

Scatterplot by Stage Stage II Stage III Stage IV Stage V

Item-level performance: by item type Item Type (Stage II) Human-human correlation Machine-human correlation Questions about an image 0.87 0.86 Give directions from a map 0.82 0.84 Open questions about a topic 0.75 0.72 Give instructions to do something 0.83 0.80 Repeat 0.95 0.85 Human-human corr gives us a baseline. Machine performance very closely approximate human raters performance. For some item types, when human raters don’t agree with each other on scoring an item, machine human agreement goes down as well.

Item-level performance: by item type Item Type (Stage IV) Human-Human correlation Machine-Human correlation Questions about an image 0.84 Give directions from a map 0.90 Open questions about a topic 0.82 Detailed response to a topic 0.85 0.87 Give instructions to do something Repeat 0.96 0.89 In some cases, machine grading outperform human raters in terms of consistency.

Summary of Score Comparability Machine-generated scores are comparable to human ratings Reliability (internal consistency) Test-level correlations Item-type-level correlations

Test Administration Preparation One-on-one practice – student and test administrator Demonstration Video Landline Speaker Telephone for one-on-one administration Student Answer Document – Unique Speaking Test Code

Test Administration

Test Administration Warm Up Questions What is your first and last name? What is your teacher’s name? How old are you? Purpose of the Warm Up Questions Student becomes more familiar with prompting Sound check for student voice level, equipment Capture Demographic data to resolve future inquiries Responses are not scored

Challenges Challenge Solution Landline Speaker telephone availability ADE purchased speaker telephones for the first year of administration Difficulty scoring young population Additional warm up questions Added beeps to prompt the student to respond Adjusting acceptable audio threshold Rubric Update and Scoring Engine Recalibration Captured demographics from warm up questions Speaking code key entry process updated Documentation of test administrator name and time of administration Incorrect Speaking Codes

Summary Automated delivery and scoring of speaking assessment is highly reliable solution for large-volume state assessments Standardize test delivery Minimal test set-up and training is required Consistent in scoring Availability of test data for analysis and review

Questions