Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University.

Similar presentations


Presentation on theme: "Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University."— Presentation transcript:

1 Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University

2 AGENDA  Quiz  Questions for the good of the group  PPT Evaluating a Norm-Referenced Test  Break  Activity: EL Interview  Assessment of Bilinguals  We will likely only get through half of this and will continue on this topic next week

3 Differences Between Measurement and Evaluation  Measurement: The student correctly answered 85 out of 100 items on a multiple-choice exam.  Evaluation: The student performed at an above average level.

4 Purposes of an Achievement Test  The purposes of an achievement test is to obtain relevant and accurate data needed to make important decisions with a minimum amount of error.  It is a tool for measuring a sample of student performance.  Never use only one single test or type of measurement instrument.

5 Norm-referenced Tests Benefits:  Norms provide a frame of reference to use when compare a student’s test score with the performance the scores of other students in similar programs because they are based on nationally accepted educational goals.  The Scores do not indicate what a student has achieved; however, but only how the student compares with other students in their comparison group.  The score provide a general indication of the strengths and weaknesses of the students in a particular school compared to a composite national curriculum.

6 A Con: Test Bias  A biased test discriminates against a certain group based on socioeconomic status, race, ethnicity, language differences or gender but this is not a concern because they factors are not controlled for by stratified norm samples  Linguistic bias results from students’ inability to understand an item because the language is too complex.

7 Turn and Talk  Turn to your table partner and give an example of test bias.

8 Reliability  Reliability refers to:  The degree of internal consistency with which the items relate to each other and the construct you are trying to measure. Measured by Cronbach’s Alpha.  The reproducibility of a set of scores obtained from a particular group, on a particular day, under particular circumstances.  Achievement test results that are reliable are consistent, reproducible, and generalizable.  An example is a second measurement with the same test on the same individual would obtain the same result (although test variance would mean the scores would be slightly different).

9 The Reliability Coefficient  Reliability can be quantified by a reliability coefficient or a measure of the amount of variation in test performance.  Reliability estimates range from 0 to 1, with 0 indicating no reliability and 1 indicating perfect reliability.  The reliability coefficient indicates the proportion of variability in a set of scores that reflect true differences among individuals.  Educators seek to use the most reliable procedures and tests available.

10 The Reliability Coefficient  It is an index of the extent to which observations can be generalized; the square of the correlation between obtained scores and true scores on a measure.  The proportion of variability in a set of scores reflects true differences among individuals.  If there is relatively little error, the ratio of true-score variance to obtained-score variance approaches a reliability index of 1.0 (perfect reliability).  If there is a relatively large amount of error, the ratio of true-score variance to obtained-score variance approaches.00 (total unreliability).  We want to use the most reliable tests available.

11 Standards for Reliability  The more important the decision to be made from the results of a test, the greater the reliability that test should demonstrate.  If test scores are to be used for administrative purposes and are reported for groups of individuals, a reliability of.60 should be the minimum. The relatively low standard is acceptable because group means are not affected by a test’s lack of reliability.  If weekly (or more frequent) testing is used to monitor pupil progress, a reliability of.70 should be the minimum. This relatively low standard is acceptable because random fluctuations can be taken into account when a behavior or skill is measured often..60 reliability for administrative purposes.70 reliability for progress monitoring

12 Standards for Reliability  If the decision being made is a screening decision, there is still a need for higher reliability. For screening devices, a standard of.80 is recommended.  If a test score is to be used to make an important decision concerning an individual student (such as special education placement), the minimum standard should be.90..80 reliability for screening purposes.90 reliability for eligibility/plac ement purposes

13 Standard Error of Measurement (SEM)  SEM is another index of test error.  It is the average standard deviation of error distributed around a person’s true score.  The difference between a student’s actual score and their highest or lowest hypothetical score.  When using a norm-referenced test, we do not know the test taker’s true score or the variance of the measurement error that forms the distribution around that person’s true score.  We estimate the error distribution by calculating the SEM.

14 SEM  Joe took a test and received a score of 100 and the SEM is 3.  We will build a “band of error” around his test score of 100 using a 68% interval. (A 68% interval is approximately equal to 1 standard deviation on either side of the mean.)

15 SEM  For a 68% interval, use the following formula: Test score ± 1(SEM) Where the test score is Joe’s score = 100 ± (1 x 3) = 100 ± 3  Why the “±” 1? Because we are adopting the normal distribution for our theoretical distribution of error, and 68% of the values lie within the area between 1 standard deviation below the mean and 1 standard deviation above the mean.  Chances are 68 out of 100 that Joe’s true score falls within the range of 97 and 103.

16 SEM  What about a 95% confidence interval? A 95 percent interval is approximately equal to an area within 2 standard deviations on either side of the mean. Test score ± 2(SEM) = 100 ± (2 x 3) = 100 ± 6  Changes are 95 out of 100 that Joe’s true score falls within the range of 94 and 106.

17 Turn and Talk  Turn to your table partner and Partner 1 (youngest) is to define reliability in user friendly terms.  Partner 2 will state the reliability coefficients necessary for screening tools and individual student decisions (such as eligibility).

18 Confidence Interval  The range of scores within which a person’s true score will fall with a given probability.  Since we can never know a person’s true score, we can estimate the likelihood that a person’s true score will be found within a specified range of scores called the confidence interval.  Confidence intervals have two components:  Score range  Level of confidence

19 Confidence Interval  Score range: the range within which a true score is likely to be found.  A range of 80-90 tells us that a person’s true score is likely to be within that range.  Level of confidence: tells us how certain we can be that the true score will be contained within the interval  If a 90% confidence interval for an IQ is 106-112m we can be 90% sure that the true score will be contained within that interval  It also means that there is a 5% chance the true score is higher than 112 and a 5% chance the true score is lower than 106.  To have greater confidence would require a wider confidence interval.  You will have the choice of confidence intervals on the WJ-III Compuscore.  68% option (default)  90% option  95% option

20 Validity  Test validation is defined as the process of collecting evidence to establish that the inferences, which are based on the test results, are appropriate.  A test can have high reliability and yet not really measure anything of importance or it can fail to be an appropriate measure of a certain construct.  Validity is the most fundamental consideration in developing and evaluating tests.  Validity does not exist on an all-or-none basis.  A test will always be valid to some degree – high, moderate, or weak – in a particular situation with a particular sample. Validity is the most important consideration in evaluating tests

21 Validity  “A test that leads to valid inferences in general or about most students may not yield valid inferences about a specific student. First, unless the student has been systematically acculturated in the values, behavior, and knowledge found in the public culture of the United States, a test that assumes such cultural information is unlikely to lead to appropriate inferences about that student.”

22 Validity  “Second, unless a student has been systematically instructed in the content of an achievement test, a test assuming such academic instruction is unlikely to lead to appropriate inferences about the student’s ability to profit from instruction. It would be inappropriate to administer a standardized test of written language (which counts misspelled words as errors) to a student who has been encouraged to use inventive spelling and reinforced for doing so. It is unlikely that the test results would lead to correct inferences about that student’s ability to provide from systematic instruction in spelling” (Salvia, Ysseldyke & Bolt, 2009, p. 63).

23 Types of Validity  Content Validity: How well the sample of test tasks represents the domain of tasks to be measured.  Criterion-related Validity: How well test performance predicts future performance or estimates current performance on some valued measure other than the test itself (called criterion).  Construct Validity: How well test performance can be interpreted as a meaningful measure of some characteristic or quality.

24 Factors Affecting Content Validity  Test itself (e.g., unclear directions, too short, inadequate time limits).  The administration and scoring of a test (test not administered in a uniform way).  Personal factors influencing how students respond to the test (e.g., student fatigue).  Validity is always specific to a particular group (influenced by factors such as age, sex, cultural gackground).

25 Relationship of Reliability and Validity  Test validity is requisite to test reliability.  If a test is not valid, then reliability is moot.  If a test is not valid, there is no point in discussing reliability because test validity is required before reliability can be considered in any meaningful way.  If a test is not reliable, it is also not valid.  In other words, a test can be reliable without being valid; however, a test cannot be valid unless it is reliable.

26 Universe of All Tests Reliable Tests Unreliable Tests Valid Tests Invalid Tests Understanding the Relationship Between Reliability and Validity

27 Factors Affecting Content Validity  Test itself.  The administration and scoring of a test.  Personal factors influencing how students respond to the test.  Validity is always specific to a particular group.

28 Turn and Talk  Turn to your table partner and discuss the concepts of reliability and validity in user- friendly definitions that you can share in class.

29 The Bottom Line  “Test users are expected to ensure that the test is appropriate for the specific students being assessed.” Salvia, Ysseldyke & Bolt, 2009, p. 71


Download ppt "Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University."

Similar presentations


Ads by Google