Introduction to Psychometrics Psychometrics & Measurement Validity Some important language Properties of a “good measure” –Standardization –Reliability.

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Criterion-related Validity
Survey Methodology Reliability and Validity EPID 626 Lecture 12.
The Research Consumer Evaluates Measurement Reliability and Validity
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Increasing your confidence that you really found what you think you found. Reliability and Validity.
VALIDITY AND RELIABILITY
1. Observational 2. Physiological and Neuroscientific 3. Self-report --majority of social & behavioral science research.
Measurement Reliability and Validity
Face, Content & Construct Validity
Introduction to Psychological Measurement
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
RESEARCH METHODS Lecture 18
Chapter 4 Validity.
Validity, Sampling & Experimental Control Psych 231: Research Methods in Psychology.
Reliability & Validity
Beginning the Research Design
Psychometrics, Measurement Validity & Data Collection
Introduction to Psychometrics
SOWK 6003 Social Work Research Week 4 Research process, variables, hypothesis, and research designs By Dr. Paul Wong.
Criterion-related Validity About asking if a test “is valid” Criterion related validity types –Predictive, concurrent, & postdictive –Incremental, local,
Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a “good item” Item Analysis Internal Reliability.
Introduction to Psychometrics Psychometrics & Measurement Validity Constructs & Measurement “Kinds” of Items Properties of a “good measure” –Standardization.
Psych 231: Research Methods in Psychology
Variables cont. Psych 231: Research Methods in Psychology.
Validity, Reliability, & Sampling
Practical Psychometrics Preliminary Decisions Components of an item # Items & Response Approach to the Validation Process.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Chapter 7 Evaluating What a Test Really Measures
Classroom Assessment A Practical Guide for Educators by Craig A
Understanding Validity for Teachers
Chapter 4. Validity: Does the test cover what we are told (or believe)
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Measurement and Data Quality
Reliability, Validity, & Scaling
Experimental Research
Instrument Validity & Reliability. Why do we use instruments? Reliance upon our senses for empirical evidence Senses are unreliable Senses are imprecise.
Instrumentation.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Final Study Guide Research Design. Experimental Research.
Study of the day Misattribution of arousal (Dutton & Aron, 1974)
Reliability & Validity
Measurement Validity.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
Psychometrics & Validation Psychometrics & Measurement Validity Properties of a “good measure” –Standardization –Reliability –Validity A Taxonomy of Item.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Evaluating Survey Items and Scales Bonnie L. Halpern-Felsher, Ph.D. Professor University of California, San Francisco.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a “good item” Item Analysis Internal Reliability.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Validity and Reliability in Instrumentation : Research I: Basics Dr. Leonard February 24, 2010.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Educational Research Chapter 8. Tools of Research Scales and instruments – measure complex characteristics such as intelligence and achievement Scales.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Data Collection Methods NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN.
Consistency and Meaningfulness Ensuring all efforts have been made to establish the internal validity of an experiment is an important task, but it is.
Survey Methodology Reliability and Validity
Questions What are the sources of error in measurement?
Test Validity.
Week 3 Class Discussion.
پرسشنامه کارگاه.
5. Reliability and Validity
Presentation transcript:

Introduction to Psychometrics Psychometrics & Measurement Validity Some important language Properties of a “good measure” –Standardization –Reliability –Validity Common Item types Reverse Keying

Psychometrics (Psychological measurement) The process of assigning values to represent the amounts and kinds of specified attributes, to describe (usually) persons. We do not “measure people” We measure specific attributes or characteristics of a person Psychometrics is the “centerpiece” of empirical psychological research and practice. All data result from some form of “measurement” For those data to be useful we need “Measurement Validity” The better the measurement, the better the data, the better the conclusions of the psychological research or application

Most of what we try to measure in Psychology are constructs They’re called because most of what we care about as psychologists are not physical measurements, such as height, weight, pressure & velocity… …rather the “stuff of psychology”  learning, motivation, anxiety, social skills, depression, wellness, etc. are things that “don’t really exist”. They are attributes and characteristics that we’ve constructed to give organization and structure to behavior. Essentially all of the things we psychologists research, both as causes and effects, are Attributive Hypotheses with different levels of support and acceptance!!!!

Measurement of constructs is more difficult than of physical properties! We can’t just walk up to someone with a scale, ruler, graduated cylinder or velocimeter and measure how depressed they are. We have to figure out some way to turn their behavior, self-reports or traces of their behavior into variables that give values for the constructs we want to measure. So, measurement is, much like the rest of research that we’ve learned about so far, all about representation !!! Measurement Validity is the extent to which the data (variable values) we have represent the behaviors (constructs) we want to study.

What are the different types of constructs we measure ??? The most commonly discussed types are... Achievement -- “performance” broadly defined (judgements) e.g., scholastic skills, job-related skills, research DVs, etc. Attitude/Opinion -- “how things should be” (sentiments) polls, product evaluations, etc. Personality -- “characterological attributes” (keyed sentiments) anxiety, psychoses, assertiveness, etc. There are other types of measures that are often used… Social Skills -- achievement or personality ?? Aptitude -- “how well some will perform after then are trained and experiences” but measures before the training & experience” some combo of achievement, personality and “likes” IQ -- is it achievement (things learned) or is it “aptitude for academics, career and life” ??

Each question/behavior is called an  item Kinds of items  objective items vs. subject items “objective” does not mean “true” “real” or “accurate” “subjective” does not mean “made up” or “inaccurate” Items are names for “how the observer/interviewer/coder transforms participant’s responses into data” Objective Items - no evaluation, judgement or decision is needed either “response = data” or a “mathematical transformation” e.g., multiple choice, T&F, matching, fill-in-the-blanks Subjective Items – response must be evaluated and a decision or judgment made what should be the data value content coding, diagnostic systems, behavioral taxonomies e.g., essays, interview answers, drawings, facial expressions

Some more language … A collection of items is called many things… e.g., survey, questionnaire, instrument, measure, test, or scale Three “kinds” of item collections you should know.. Scale (Test) - all items are “put together” to get a single score Subscale (Subtest) – item sets “put together” to get multiple separate scores Surveys – each item gives a specific piece of information Most “questionnaires,” “surveys” or “interviews” are a combination of all three.

Desirable Properties of Psychological Measures Interpretability of Individual and Group Scores Population Norms Validity Reliability Standardization

Administration – test is “given” the same way every time who administers the instrument specific instructions, order of items, timing, etc. Varies greatly - multiple-choice classroom test  hand it out) - WAIS page administration manual Scoring – test is “scored” the same way every time who scores the instrument correct, “partial” and incorrect answers, points awarded, etc. Varies greatly -- multiple choice test (fill in the sheet) -- WAIS – 200+ page scoring manual

Reliability (Agreement or Consistency) Inter-rater or Inter-observers reliability do multiple observers/coders score an item the same way ? important whenever using subjective items Internal reliability -- do the items measure a central “thing” Cronbach’s alpha  α =.00 – 1.00  higher values mean stronger internal consistency/reliability External Reliability -- consistency of scale/test scores test-retest reliability – correlate scores from same test given 3-18 weeks apart alternate forms reliability – correlate scores from two “versions” of the test

Validity (Consistent Accuracy) Face Validity -- do the items come from “domain of interest” ? non-statistical -- decision of “target population” Content Validity -- do the items come from “domain of interest”? non-statistical -- decision of “expert in the field” Criterion-related Validity -- does test correlate with “criterion”? statistical -- requires a criterion that you “believe in” predictive, concurrent, postdictive validity Construct Validity -- does test relate to other measures it should? Statistical -- Discriminant validity convergent validity -- correlates with selected tests divergent validity -- doesn’t correlate with others

“Is the test valid?” Jum Nunnally (one of the founders of modern psychometrics) claimed this was “silly question”! The point wasn’t that tests shouldn’t be “valid” but that a test’s validity must be assessed relative to… the construct it is intended to measure the population for which it is intended (e.g., age, level) the application for which it is intended (e.g., for classifying folks into categories vs. assigning them quantitative values) So, the real question is, “Is this test a valid measure of this construct for this population in this application?” That question can be answered!

Face Validity Does the test “look like” a measure of the construct of interest? “looks like” a measure of the desired construct to a member of the target population will someone recognize the type of information they are responding to? Possible advantage of face validity.. If the respondent knows what information we are looking for, they can use that “context” to help interpret thequestions and provide more useful, accurate answers Possible limitation of face validity … if the respondent knows what information we are looking for, they might try to “bend & shape” their answers to what they think we want -- “fake good” or “fake bad”

Content Validity Does the test contain items from the desired “content domain”? Based on assessment by “subject matter experts” (SMEs) in that content domain Is especially important when a test is designed to have low face validity e.g., tests of “honesty” used for hiring decisions Is generally simpler for “achievement tests” than for “psychological constructs” (or other “less concrete” ideas) e.g., it is a lot easier for “math experts” to agree whether or not an item should be on an algebra test than it is for “psychological experts” to agree whether or not an items should be on a measure of depression. Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by experts in the domain.

Criterion-related Validity Do the test scores correlate with criterion behavior scores?? Now Before Later predictive -- test taken now predicts criterion measured later want to estimate what will happen before it does e.g., your GRE score (taken now) predicts grad school (later) When criterion behavior occurs concurrent postdictivepredictive Test taken now concurrent -- test taken now “replaces” criterion measured now often the goal is to substitute a “shorter” or “cheaper” test e.g., the written drivers test replaces road test postdictive – test taken now captures behavior & affect of before most of the behavior we study “has already happened” e.g., adult memories of childhood feelings or medical history

Construct Validity Does the test interrelate with other tests as a measure of this construct should ? We use the term construct to remind ourselves that many of the terms we use do not have an objective, concrete reality. Rather they are “made up” or “constructed” by us in our attempts to organize and make sense of behavior and other psychological processes attention to construct validity reminds us that our defense of the constructs we create is really based on the “whole package” of how the measures of different constructs relate to each other So, construct validity “begins” with content validity (are these the right types of items) and then adds the question, “does this test relate as it should to other tests of similar and different constructs?

The statistical assessment of Construct Validity … Discriminant Validity Does the test show the “right” pattern of interrelationships with other variables? -- has two parts Convergent Validity -- test correlates with other measures of similar constructs Divergent Validity -- test isn’t correlated with measures of “other, different constructs” e.g., a new measure of depression should … have “strong” correlations with other measures of “depression” have negative correlations with measures of “happiness” have “substantial” correlation with measures of “anxiety” have “minimal” correlations with tests of “physical health”, “fakingbad”, “self-evaluation”, etc.

Population Norms In order to interpret a score from an individual or group, you must know what scores are typical for that population Requires a large representative sample of the target population preferably  random, research-selected & stratified Requires solid standardization  both administrative & scoring Requires great inter-rater reliability if subjective The Result ?? A scoring distribution of the population. lets us identify “normal,” “high” and “low” scores lets us identify “cutoff scores” to define important populations and subpopulations

Desirable Properties of Psychological Measures Standardization Administration & Scoring Reliability Interrater, Internal Consistency, Test-Retest & Alternate Forms Validity Face, Content, Criterioin-Related, Construct Population Norms Scoring Distribution & Cutoffs Interpretability of Individual and Group Scores

Reverse Keying We want the respondents to carefully read an separately respond to each item of our scale/test. One thing we do is to write the items so that some of them are “backwards” or “reversed” … Consider these items from a depression measure… 1. It is tough to get out of bed some mornings. disagree agree 2. I’m generally happy about my life I sometimes just want to sit and cry Most of the time I have a smile on my face If the person is “depressed”, we would expect then to give a fairly high rating for questions 1 & 3, but a low rating on 2 & 4. Before aggregating these items into a composite scale or test score, we would “reverse key” items 2 & 4 (1=5, 2=4, 4=2, 5=1)