Presentation on theme: "Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing."— Presentation transcript:
Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon Introduction to: Automated Essay Scoring (AES) Anat Ben-Simon National Institute for Testing & Evaluation Tbilisi, Georgia, September, 2007
2 Merits of AES Psychometric Objectivity & standardization Logistic Saves time & money Allows for immediate reporting of scores Didactic Immediate diagnostic feedback
3 AES - How does it work? Humans rate sample of essays Computer extracts relevant text features Computer generates model to predict human scores Computer applies prediction model to score new essays
4 AES – Model Determination Feature determination Text driven – empirically based quantitative (computational) variables Theoretically driven Weight determination Empirically based Theoretically based
6 AES - Examples of Text Features Surface variables Essay length Av. word / sentence length Variability of sentence length Av. word frequency Word similarity to prototype essays Style errors (e.g., repetitious words, very long sentences) NLP based variables The number of “ discourse ” elements Word complexity (e.g., ratio of different content words to total no. of words) Style errors (e.g., passive sentences)
7 AES: Commercially Available Systems Project Essay Grade (PEG) Intelligent Essay Assessor (IEA) Intellimetric e-rater
8 PEG (Project Essay Grade) Scoring Method Uses NLP tools (grammar checkers, part- of-speech taggers) as well as surface variables Typical scoring model uses 30-40 features Features are combined to produce a scoring model through multiple regression Score Dimensions Content, Organization, Style, Mechanics, Creativity
9 Intelligent Essay Assessor Scoring Method Focuses primarily on the evaluation of content Based on Latent Semantic Analysis (LSA) Based on a well-articulated theory of knowledge acquisition and representation Features combined through hierarchical multiple regression Score Dimensions Content, Style, Mechanics
10 Intellimetric Scoring Method “ Brain-based ” or “ mind-based ” model of information processing and understanding Appears to draw more on artificial intelligence, neural net, and computational linguistic traditions than on theoretical models of writing Uses close to 500 features Score Dimensions Content, Creativity, Style, Mechanics, Organization
11 E-rater v2 Scoring Method Based on natural language processing and statistical methods Uses a fixed set of 12 features that reflect good writing Features are combined using hierarchical multiple regression Score Dimensions Grammar, usage, mechanics, and style Organization and development Topical analysis (content) Word complexity Essay length
12 Writing Dimensions and Features in e-rater v2 (2004) FeatureDimension 1.Ratio of grammar errors 2.Ratio of mechanics errors 3.Ratio of usage errors 4.Ratio of style errors Grammar, usage, mechanics, & style 5.The number of “ discourse ” units detected in the essay (i.e., background, thesis, main ideas, supporting ideas) 6.The average length of each element in words Organization & development 7.Similarity of the essay ’ s content to other previously scored essays in the top score category 8.The score category containing essays whose words are most similar to the target essay Topical analysis 9.Word repetition (ratio of different content words) 10.Vocabulary difficulty (based on word frequency) 11.Average word length Word complexity 12.Total number of words Essay length
13 Reliability Studies Reliability Studies Studies comparing inter-rater agreement to computer-rater agreement Human- Computer r Human- Human r Sample size TestAuthorSystem.74-.75 (1-r).75497 GRE (36-ps) Petersen & Page, 1997 PEG.83 (6-rs).71 386 English placement test (1-p) Shermis et al., 2002 PEG.82 (1-r).85 (2-rs).84 102 K-12 norm- referenced test Elliot, 2001Intelli Metric.80.83 188 GMATLandauer et al., 1997 IEA.86.86-.87 1,363 GMATFoltz et al., 1999 IEA.79-.87 (1-r).82-.89500-1,000 GMAT (13-ps) Burstein et al., 1998 e-rater
14 AES: Validity Issues To what extent are the text features used by AES programs valid measures of writing skills? To what extent is the AES inappropriately sensitive to irrelevant features and insensitive to relevant ones? Are human grades an optimal criterion? Which external criteria should be used for validation? What are the wash-back effects (consequential validity)?
15 Weighting Human & computer Scores Automated scoring used only as a quality control (QC) check Automated scoring and human scoring Human scoring used only as a QC check
16 AES: To use or not to use? Are the essays written by hand or composed on computer? Is there enough volume to make AES cost-effective? Will students, teachers, and other key constituencies accept automated scoring?
17 Criticism and Reservations Insensitive to some important features relevant to good writing Fail to identify and appreciate unique writing styles and creativity Susceptible to construct-irrelevant variance May encourage writing for the computer as opposed to writing for people
18 How to choose a program? 1.Does the system work in a way you can defend? 2.Is there a credible research base supporting the use of the system for your particular purpose? 3.What are the practical implications of using the system? 4.How will the use of the system affect students, teachers, and other key constituencies?