Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.

Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Definitions ITA candidates-international graduate students seeking TA positions Performance test-short teaching demonstration, graded on Likert scale Discourse intonation-intonation contours (Thought groups, Prominence, Tone) Inter-rater reliability-coefficient that measures how similarly members of rater team rate the same performance

Why the workshop? Tech needs ITA’s to teach ITA’s need to be able to communicate with undergraduates ITA’s need to pass three tests

Who is the workshop for? Stakeholders 1.ITA candidates 2. performance test raters 3. ITA workshop directors 4. TTU administrators 5. Department heads of ITA candidates’ departments 6. Undergraduates who might be taught by ITAs

Who is this presentation for? 1.Performance test raters 2.ITA workshop directors 3.Anyone involved with rater training 4.Anyone interested in issues of rater reliability Why? Inter-rater reliability necessary to answer stakeholders worries This study is a first step towards validating the summer workshop program

What type of research is inter-rater reliability research? Since this study is observing what the ITA candidates do, not experimentally manipulating them in some way, then the study’s research method is a “correlational or cross-sectional research” (Field, 2013, p. 13). In this case, we are measuring how closely two different raters rate the same criteria for the same ITA candidate on the same performance test. Correlation is an accepted measure of reliability. This study uses Kendall’s Tau for the correlation. Kendall’s Tau fits our needs because it does not require normal distribution and it works well with numerous equal data points (like the strings of 4s and 5s in our data) (Field, 2013).

Research Questions Do rater’s rate the same criteria in the same way? In other words, is there a moderate to high level of inter-rater reliability on the final performance test? Do raters’ ratings become more reliable the more experience they have with the ITA candidates and with the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test?

Raters Three teams of paired raters 4 male and 2 female raters 4 native and 2 non-native English speakers 5 held Master’s degrees and 1 was in process of completing Master’s degree Experience with ITA candidates ranged from 20 years to 0 years (but all had experience teaching EFL/ESL)

Rater training Conducted by the ITA workshop directors Both trainers were authors of text used in the workshop and had many years of experience with ITA candidates Training session for two days before workshop Listened to performances, rated them, discussed ratings and reasons.

Participants Gender-53 are male, and 31 are female Native language-33 Chinese speakers, 10 Bengali speakers, 8 Farsi speakers, 6 Arabic speakers, 6 Korean speakers, 6 Sinhalese speakers, 4 Tamil speakers, 3 Nepali speakers, 2 Spanish speakers, 2 French speakers, 2 Hindi speakers, and 1 speaker of each of the following: English, Indonesian, Japanese, Kamona, Urdu, Vietnamese and Yoruba Number in each group-Team one rated 30 students, Team two rated 29 students, Team three rated 25 students

Materials ITA Performance test version 9.0 Four constructs and ten criteria 1.Grammatical competence-pronunciation, word stress, thought groups 2.Textual competence-grammatical structures, transitional phrases, definitions 3.Sociolinguistic competence-prominence, comprehension checks, tone 4.Functional competence-answering students’ questions

Procedures Entered scores on Excel Used SPSS to calculate Kendall’s Tau coefficients for each team of raters ratings of each candidate on the midterm and final test Evaluated the reliability coefficients of the final test ratings Compared the coefficients of midterm and final test

Analysis Expect a value for Kendall’s Tau 0.2 to 0.4 for moderate correlation and 0.4 or better for good correlation because the ratings are subjective and there are many factors that can influence raters (fatigue, experience with certain groups of English learners). Expect the difference between the final and midterm correlation coefficients for each criteria to be positive indicating that the raters are rating more similarly.

CriteriaMidterm correlationFinal correlationFinal-Midterm 1 pronunciation.272.620 **.348 2 word stress.169 Constant (4s), can’t compute na 3 thought groups-.083.686 **.769 4 grammar.171.431 *.260 5 transitional phrases.331.354 *.023 6 definitions and examples.389*.566 **.177 7 prominence.391*-.050-.441 8 comprehension checks.173.546 **.373 9 intonation.279.527 **.248 10 answering questions.651**.458 ** -.193 Team 1

criteriamidtermfinalFinal-Midterm 1 pronunciation.071.494*.423 2 word stress.405*.314-.091 3 thought groups-.063.316.379 4 grammar Constant (4s), can’t compute.692**na 5 transitional phrases.382.330-.052 6 definitions and examples Constant (4s), can’t compute.520*na 7 prominence.244.014-.230 8 comprehension checks.216.407*.191 9 intonation.329.459*.130 10 answering questions.626**.307-.319 Team 2

criteriamidtermfinalFinal-midterm 1 pronunciation.309.443*.134 2 word stress.341 Constant (4s), can’t compute na 3 thought groups.243.454*.211 4 grammar-.107.801**.908 5 transitional phrases.342.256-.086 6 definitions and examples.548**.336-.212 7 prominence.258.753**.495 8 comprehension checks.586**.503**-.083 9 intonation.389*.373*-.016 10 answering questions.486**.618**.132 Team 3

The Good, the Bad and the Ugly News What we are really concerned with are the final test scores; the midterm is a practice for both the ITA candidates and the raters, and all of the criteria for every rater team had moderate to strong correlation on the final test with the notable exception of criteria 8, Prominence. Inter-rater reliability for most of the criteria for most of the teams went up from midterm to final, with gains in reliability far outweighing losses in all cases except criteria 8, Prominence (and oddly enough Criteria 10, answering students’ questions).

RQs answered Do rater’s rate the same criteria in the same way? In other words, is there a moderate to high level of inter-rater reliability on the final performance test? Yes, in every case excepting criteria 8, prominence. Do raters’ ratings become more reliable the more experience they have with the ITA candidates and the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test? Yes, except for prominence (and to a lesser extent criteria 10, answering students’ questions).

What’s it all mean? Since two of the three teams had reliability issues with prominence and there were no other reliability issues, And there was a decrease in reliability on prominence with both of these teams from the midterm to the final test, It seems that the raters had a vague understanding of prominence and/or a difficulty perceiving prominence.

What’s to be done? More time in rater training sessions devoted to understanding prominence and prominence’s role in the construct of sociolinguistic competence in general. More time in rater train sessions devoted to hearing prominence when it is used.

Limitations of this study Only studies one summer workshop Only studies three rater teams Different rater teams are likely to have different reliability issues Did not interview raters to learn their justifications for their ratings

Thank you for your attention. Have a great day.

Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.

Similar presentations

Presentation on theme: "Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.

Similar presentations

Presentation on theme: "Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence."— Presentation transcript:

Similar presentations

About project

Feedback