Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing.

Slides:

Advertisements

Similar presentations

Project-Based vs. Text-Based

Advertisements

Assistive Technology. Assistive technology is a broad term that refers to accommodations for both physical disabilities and cognitive differences.

On-Demand Writing Assessment

John Clegg. Contents What is CLIL? CLIL objectives What to assess in CLIL Fairness issue Ways of addressing fairness reduce the language demands of the.

Course assessment: Setting and Grading Tests and Examinations By Dr C. Bangira Chinhoyi University of Technology Organized by the Academy of Teaching and.

Electronic Essay Graders Jay Lubomirski.  How electronic essay graders evaluate writing samples  Comparing the electronic graders to the human graders.

Tentative Unit 1 Schedule Week 2 1/19- MLK Day-No Class 1/21-Using library databases (bring computer to class) 1/23- Intro to Exploratory Narrative & Source.

A Quick Glance at Formative and Summative Technology Assessment Tools Laura Gottardo Grand Canyon University- TEC 546 7/21/2011.

1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.

VALIDITY AND TEST VALIDATION Prepared by Olga Simonova, Inna Chmykh, Svetlana Borisova, Olga Kuznetsova Based on materials by Anthony Green 1.

Developing a Hiring System Reliability of Measurement.

Uses of Language Tests.

© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 5 Making Systematic Observations.

NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

ASSESSMENT IN EDUCATION ASSESSMENT IN EDUCATION. Reliability  Test-re-test, equivalent forms, internal consistency.  Test-re-test, equivalent forms,

Benefits from Formal and Informal Assessments

Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 

Measurement and Data Quality

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Second Language Writers and The Machine Scoring of Essays Deborah Crusan Wright State University.

Assessment and Performance-based Instruction

Automated Essay Evaluation Martin Angert Rachel Drossman.

Ian Lucas Executive Director ETS Europe UK CRITERION ® Online Writing Evaluation.

Group 8 ‘GudBoyz’ teaching writing to L2 learners Agus Prayogo Asih Nurakhir Nico Ouwpoly Sutarno.

Office of Institutional Research, Planning and Assessment January 24, 2011 UNDERSTANDING THE DIAGNOSTIC GUIDE.

Katherine S. Holmes READ 7140 May 28, Georgia Writing Test – 5 th Grade GOAL: To assess the procedures to enhance statewide instruction in language.

Chapter 7 Foregrounding Written Communication. Teaching Interactive Second Language Writing in Content- Based Classes Teachers should include a wide range.

Practice Power Point Promethean Professional Development.

Practice with Text Complexity History/Social Studies.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

EVALUATION Evaluation is an integral part of teaching variables. Various tests are conducted for evaluation.

ELA SCHOOL TEAM SESSION Welcome to EEA, 2012! 10/2/2015MSDE1.

The Developmental Reading & English Placement Test

Practice with Text Complexity Science.  Review the 3 dimensions of text complexity  Analyze the 3 dimensions of text complexity using a science text.

Discourse Topics, Linguistics, and Language Teaching Richard Watson Todd King Mongkut’s University of Technology Thonburi arts.kmutt.ac.th/crs/research/

Validity & Practicality

Principles in language testing What is a good test?

Tentative Unit 1 Schedule Week 2 1/19- MLK Day-No Class 1/21-Using library databases (bring computer to class) 1/23- Intro to Exploratory Narrative & Source.

College and Career Readiness Conference Summer 2014.

Further Evaluation of Automated Essay Score Validity P. Adam Kelly Houston VA Medical Center and Baylor College of Medicine

Assessing the Quality of Research

Automated Writing Evaluation(AWE): Past, Present and Prospect Dr. Li Zhang ( 张荔） Shanghai Jiao Tong University Shanghai, China.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

The Four P’s of an Effective Writing Tool: Personalized Practice with Proven Progress April 30, 2014.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.

Alternative Assessment Chapter 8 David Goh. Factors Increasing Awareness and Development of Alternative Assessment Educational reform movement Goals 2000,

L ITERATURE REVIEW RESEARCH METHOD FOR ACADEMIC PROJECT I.

Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)

Tentative Unit 1 Schedule Week 2 1/19- MLK Day-No Class 1/21-Using library databases (bring computer to class) 1/23- Intro to Exploratory Narrative & Source.

Tentative Unit 1 Schedule Week 2 1/20-Using library databases (bring computer to class) 1/22- Intro to Exploratory Narrative & Source evaluations Week.

Monitoring and Assessment Presented by: Wedad Al –Blwi Supervised by: Prof. Antar Abdellah.

PowerPoint & Evaluating Resources PowerPoint & Evaluating Resources Mike Spindler & Emma Purnell.

Introduction to Machine Learning, its potential usage in network area,

Language Assessment.

Automatic Writing Evaluation

MY Access! ® Product Research base

Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.

Automated Essay Scoring The IntelliMetric® Way

By: ODUNTAN, ODUNAYO ESTHER AAA Ph.D Qualifying Examination

MYP Descriptors – Essay Types & Rubrics

Assessing Writing Module 5 Activity 2.

پرسشنامه کارگاه.

Learning About Language Assessment. Albany: Heinle & Heinle

Measurement: Part 1.

Deputy Commissioner Jeff Wulfson Associate Commissioner Michol Stapel

Wrapping-up Thesis Writing.

Presentation transcript:

Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing & Evaluation Tbilisi, Georgia, September, 2007

2 Merits of AES  Psychometric  Objectivity & standardization  Logistic  Saves time & money  Allows for immediate reporting of scores  Didactic  Immediate diagnostic feedback

3 AES - How does it work?  Humans rate sample of essays  Computer extracts relevant text features  Computer generates model to predict human scores  Computer applies prediction model to score new essays

4 AES – Model Determination Feature determination  Text driven – empirically based quantitative (computational) variables  Theoretically driven Weight determination  Empirically based  Theoretically based

5 Scoring Dimensions ContentRhetorical Structure StyleSyntax & Grammar Vocabulary RelevanceOrganizationClarityComplexityRichness Richness of ideas CoherenceFluencySyntactical accuracy Register OriginalityCohesionAccuracyGrammatic al accuracy Spelling ParagraphingAccuracy Focus

6 AES - Examples of Text Features Surface variables  Essay length  Av. word / sentence length  Variability of sentence length  Av. word frequency  Word similarity to prototype essays  Style errors (e.g., repetitious words, very long sentences) NLP based variables  The number of “ discourse ” elements  Word complexity (e.g., ratio of different content words to total no. of words)  Style errors (e.g., passive sentences)

7 AES: Commercially Available Systems  Project Essay Grade (PEG)  Intelligent Essay Assessor (IEA)  Intellimetric  e-rater

8 PEG (Project Essay Grade) Scoring Method  Uses NLP tools (grammar checkers, part- of-speech taggers) as well as surface variables  Typical scoring model uses features  Features are combined to produce a scoring model through multiple regression Score Dimensions  Content, Organization, Style, Mechanics, Creativity

9 Intelligent Essay Assessor Scoring Method  Focuses primarily on the evaluation of content  Based on Latent Semantic Analysis (LSA)  Based on a well-articulated theory of knowledge acquisition and representation  Features combined through hierarchical multiple regression Score Dimensions  Content, Style, Mechanics

10 Intellimetric Scoring Method  “ Brain-based ” or “ mind-based ” model of information processing and understanding  Appears to draw more on artificial intelligence, neural net, and computational linguistic traditions than on theoretical models of writing  Uses close to 500 features Score Dimensions  Content, Creativity, Style, Mechanics, Organization

11 E-rater v2 Scoring Method  Based on natural language processing and statistical methods  Uses a fixed set of 12 features that reflect good writing  Features are combined using hierarchical multiple regression Score Dimensions  Grammar, usage, mechanics, and style  Organization and development  Topical analysis (content)  Word complexity  Essay length

12 Writing Dimensions and Features in e-rater v2 (2004) FeatureDimension 1.Ratio of grammar errors 2.Ratio of mechanics errors 3.Ratio of usage errors 4.Ratio of style errors Grammar, usage, mechanics, & style 5.The number of “ discourse ” units detected in the essay (i.e., background, thesis, main ideas, supporting ideas) 6.The average length of each element in words Organization & development 7.Similarity of the essay ’ s content to other previously scored essays in the top score category 8.The score category containing essays whose words are most similar to the target essay Topical analysis 9.Word repetition (ratio of different content words) 10.Vocabulary difficulty (based on word frequency) 11.Average word length Word complexity 12.Total number of words Essay length

13 Reliability Studies Reliability Studies Studies comparing inter-rater agreement to computer-rater agreement Human- Computer r Human- Human r Sample size TestAuthorSystem (1-r) GRE (36-ps) Petersen & Page, 1997 PEG.83 (6-rs) English placement test (1-p) Shermis et al., 2002 PEG.82 (1-r).85 (2-rs) K-12 norm- referenced test Elliot, 2001Intelli Metric GMATLandauer et al., 1997 IEA ,363 GMATFoltz et al., 1999 IEA (1-r) ,000 GMAT (13-ps) Burstein et al., 1998 e-rater

14 AES: Validity Issues  To what extent are the text features used by AES programs valid measures of writing skills?  To what extent is the AES inappropriately sensitive to irrelevant features and insensitive to relevant ones?  Are human grades an optimal criterion?  Which external criteria should be used for validation?  What are the wash-back effects (consequential validity)?

15 Weighting Human & computer Scores  Automated scoring used only as a quality control (QC) check  Automated scoring and human scoring  Human scoring used only as a QC check

16 AES: To use or not to use?  Are the essays written by hand or composed on computer?  Is there enough volume to make AES cost-effective?  Will students, teachers, and other key constituencies accept automated scoring?

17 Criticism and Reservations  Insensitive to some important features relevant to good writing  Fail to identify and appreciate unique writing styles and creativity  Susceptible to construct-irrelevant variance  May encourage writing for the computer as opposed to writing for people

18 How to choose a program? 1.Does the system work in a way you can defend? 2.Is there a credible research base supporting the use of the system for your particular purpose? 3.What are the practical implications of using the system? 4.How will the use of the system affect students, teachers, and other key constituencies?

19 Thank You