Presentation is loading. Please wait.

Presentation is loading. Please wait.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.

Similar presentations


Presentation on theme: "Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October."— Presentation transcript:

1 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October 2010 e-rater ® for TOEFL® Independent and Integrated Tasks

2 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Overview Background and Context About e-rater® Research protocol for operational use Initial Results for TOEFL Independent item Recommendations and Actions Questions and Discussion 2

3 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Background Quality of human scoring No pretesting because of security concerns Human rater agreement is variable: – 5 point rubric, across 38 prompts sampled in 2008 Exact agreement varied from 57% - 62% Exact plus adjacent agreement, 97.5% - 99% Quantity of human raters Frequently administered assessment Fluctuating demand, peak volumes Demand for quicker score turnaround Human scoring still desired Market wants quicker score reporting 3

4 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. e-rater ® — ETS’ Automated Scoring of Essays 4

5 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. What is e-rater? Automatically evaluates essay quality – Provides holistic scores Predictions of scores from trained human raters – Emphasizes writing quality over content – Evaluation of features – Provides feedback on essays (e.g., “diagnostic” feedback) – Advisories filter out responses not consistent with good faith submissions 5

6 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Examples of E-rater (micro)Features Subject-Verb Agreement: the motel are … Pronoun Errors: Them are my reasons … Grammar 6

7 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 7

8 8

9 Where do these features come from? A parser assigns a part of speech to every word e-rater examines adjacent or nearly adjacent pairs of words with expected relationships Rare or nonexistent word combinations are identified as a possible error and appropriate feedback issued … At a basic level, the features are what the NLP scientists have successfully created … 9

10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. New feature? Correlation of feature with human scores? Correlation of feature with other, already existing features? Measurement scientists conduct evaluation 10

11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. How Does e-rater Predict the Human Score? Organization and development – 2 features Error rates – Grammar, Usage, Mechanics, Style – 4 features Lexical complexity – 2 features Prompt-specific vocabulary usage – 2 features Proposed new feature? – Detection of indications of “good writing” A “positive feature” Use of Collocations Use of prepositions Predicted essay score 11

12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. E-rater engine upgrade - an introduction Annual process for introducing enhancements Data sets representing all clients Known baseline performance Add the proposed NLP features into a development engine in IT Reproduce model performance results with proposed feature Take difference between known and proposed performance of existing models 12

13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. E-rater engine upgrade - an introduction (2) Results must represent improvement in performance OR Increase in English Language construct coverage, with no degradation from current performance 13

14 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Results In July 2010, e-rater version 10.1 was released with a new feature that detects good use of collocations and prepositions 14

15 15

16 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Impact of this new feature - TOEFL Independent Task Descriptive Feature Name Relative Weight Organization29 Development27 Mechanics10 Usage8 Grammar6 Lexical Complexity, Average word length6 Positive Writing Indicators – collocations & prepositions 4 Style4 Lexical Complexity, Sophistication4 16

17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. e-rater Model Types: Prompt-Specific Each model is trained on responses to a particular prompt Advantages: – Tailored to particular prompt characteristics – High agreement with human raters – Incorporates content features Disadvantages: – Higher demand for training data – Requires pre-testing of prompts 17

18 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. e-rater Model Types: Generic A single model is trained on responses to a variety of prompts Advantages: – Smaller data set required for training – Scoring standards the same across prompts – Applicable to prompts that are not pre-tested Disadvantages: – No content features – Differences between particular prompts are not accounted for – Agreement with human raters is lower 18

19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Implementation Research for Proposed Use Purpose is to evaluate expected – Quality of ratings and reported scores – Effectiveness of using e-rater operationally Research questions – Is e-rater performance comparable to human scores? – Is there any differential performance for subgroups of concern? – Is there any significant impact when e-rater scores are used in reported scores? 19

20 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Implementation Research (2) Construct relevance? – Consistency of e-rater features with TOEFL scoring rubrics? Relationship between e-rater and human ratings? – Overall agreement rates? – Degradation from human-human agreement rates to human-automated agreement rates? – Standardized difference in scores between humans and e-rater? – Subgroup differences (fairness concerns)? 20

21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Implementation Research (3) Impact on reported Writing scores – Change in reported score under multiple implementation possibilities? More or less conservative approaches – Contributory score – Confirmatory score – Differential impact on subgroups? Gender Native Language Native Country – Association of writing scores with external variables? 21

22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Human Scoring Performance Human scoring is the baseline performance against which e-rater is evaluated. human1 by human2 human1human2stats promptNmeansdmeansd stdwtd% adj corr diffkappaexact plus adj All 132,347 3.350.853.350.850.010.6960980.69 22

23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. E-rater Performance Human scoring is the baseline against which e-rater is evaluated. Absolute statistical guidelines also exist. Human-human agreement: 60% exact, 98% exact + adjacent Human-erater agreement: 59% exact, 99% exact + adjacent No degradation in performance from human-human performance 23

24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Impact on Writing Scale Scores (0-30 points) There is no significant impact to the candidate from the implementation of e- rater for the TOEFL Independent task. 24

25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Correlations with Scaled Scores VariableHuman rating E-rater (integer) Reading Scale Score0.540.56 Listening Scale Score0.550.53 Speaking Scale Score0.590.57 (Read+Listen+Speak) Scale Score0.63 Correlations of e-rater scores with TOEFL construct scale scores is on par with the human rating correlation to those same scores. 25

26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Conclusions and Future Work The research team recommended to the TOEFL Committee of Examiners the operational use of a generic e-rater model for the Independent task. Operational use began in July 2009 Subsequently, the research team recommended operational use of a generic model for the Integrated task. Operational use will begin shortly 26

27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. QUESTIONS? DISCUSSION? Thank you! Contact Information: Cathy Trapani ctrapani@ets.org (609)734-5640 27


Download ppt "Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October."

Similar presentations


Ads by Google