Presentation is loading. Please wait.

Presentation is loading. Please wait.

IS 4800 Empirical Research Methods for Information Science Class Notes March 2, 2012 Instructor: Prof. Carole Hafner, 446 WVH Tel: 617-373-5116.

Similar presentations


Presentation on theme: "IS 4800 Empirical Research Methods for Information Science Class Notes March 2, 2012 Instructor: Prof. Carole Hafner, 446 WVH Tel: 617-373-5116."— Presentation transcript:

1 IS 4800 Empirical Research Methods for Information Science Class Notes March 2, 2012 Instructor: Prof. Carole Hafner, 446 WVH hafner@ccs.neu.edu Tel: 617-373-5116 Course Web site: www.ccs.neu.edu/course/is4800sp12/

2 Outline Finish discusion of usability testing Hypothesis testing review Sampling, Power and Effect Size Chi square – review and SPSS application Correlation – review and SPSS application Begin t-test if time permits

3 UI/Usabililty evaluation What are the three approaches ?? What are the advantages and disadvantages of each? Explain a usability experiment that is within-subjects Explain a usability experiment that is between- subjects What are the advantages and disadvantages of each ?

4 What is a Usability Experiment? Usability testing in a controlled environment There is a test set of users They perform pre-specified tasks Data is collected (usually quantitative and qualitative) Take mean and/or median value of quantitative attributes Compare to goal or another system Contrasted with “expert review” and “field study” evaluation methodologies The growth of usability groups and usability laboratories

5 Usability Experiment Defining the variables to collect ? Techniques for data collection ? Descriptive statistics to use Potential for inferential statistics Basis for correlational vs experimental claims Reliability and validity

6 Subjects representative sufficient sample Variables independent variable (IV) characteristic changed to produce different conditions. e.g. interface style, number of menu items. dependent variable (DV) characteristics measured in the experiment e.g. time taken, number of errors. Experimental factors

7 Hypothesis prediction of outcome framed in terms of IV and DV null hypothesis: states no difference between conditions aim is to disprove this. Experimental design within groups design each subject performs experiment under each condition. transfer of learning possible less costly and less likely to suffer from user variation. between groups design each subject performs under only one condition no transfer of learning more users required variation can bias results. Experimental factors (cont.)

8 Summative Analysis What to measure? (and it’s relationship to a usability goal) Total task time User “think time” (dead time??) Time spent not moving toward goal Ratio of successful actions/errors Commands used/not used frequency of user expression of: confusion, frustration, satisfaction frequency of reference to manuals/help system percent of time such reference provided the needed answer

9 Measuring User Performance Measuring learnability Time to complete a set of tasks Learnability/efficiency trade-off Measuring efficiency Time to complete a set of tasks How to define and locate “experienced” users Measuring memorability The most difficult, since “casual” users are hard to find for experiments Memory quizzes may be misleading

10 Measuring User Performance (cont.) Measuring user satisfaction Likert scale (agree or disagree) Semantic differential scale Physiological measure of stress Measuring errors Classification of minor v. serious

11 Reliability and Validity Reliability means repeatability. Statistical significance is a measure of reliability Validity means will the results transfer into a real-life situation. It depends on matching the users, task, environment Reliability - difficult to achieve because of high variability in individual user performance

12 Formative Evaluation What is a Usability Problem?? Unclear - the planned method for using the system is not readily understood or remembered (info. design level) Error-prone - the design leads users to stray from the correct operation of the system (any design level) Mechanism overhead - the mechanism design creates awkward work flow patterns that slow down or distract users. Environment clash - the design of the system does not fit well with the users’ overall work processes. (any design level) Ex: incomplete transaction cannot be saved

13 Qualitative methods for collecting usability problems Thinking aloud studies Difficult to conduct Experimenter prompting, non-directive Alternatives: constructive interaction, coaching method, retrospective testing Output: notes on what users did and expressed: goals, confusions or misunderstandings, errors, reactions expressed Questionnaires Should be usability-tested beforehand Focus groups, interviews

14 user observed performing task user asked to describe what he is doing and why, what he thinks is happening etc. Advantages simplicity - requires little expertise can provide useful insight can show how system is actually use Disadvantages subjective selective act of describing may alter task performance Observational Methods - Think Aloud

15 variation on think aloud user collaborates in evaluation both user and evaluator can ask each other questions throughout Additional advantages less constrained and easier to use user is encouraged to criticize system clarification possible Observational Methods - Cooperative evaluation

16 paper and pencil cheap, limited to writing speed audio good for think aloud, diffcult to match with other protocols video accurate and realistic, needs special equipment, obtrusive computer logging automatic and unobtrusive, large amounts of data difficult to analyze user notebooks coarse and subjective, useful insights, good for longitudinal studies Mixed use in practice. Transcription of audio and video difficult and requires skill. Some automatic support tools available Observational Methods - Protocol analysis

17 analyst questions user on one to one basis usually based on prepared questions informal, subjective and relatively cheap Advantages can be varied to suit context issues can be explored more fully can elicit user views and identify unanticipated problems Disadvantages very subjective time consuming Query Techniques - Interviews

18 Set of fixed questions given to users Advantages quick and reaches large user group can be analyzed more rigorously Disadvantages less flexible less probing Query Techniques - Questionnaires

19 Advantages: specialist equipment available uninterrupted environment Disadvantages: lack of context difficult to observe several users cooperating Appropriate if actual system location is dangerous or impractical for to allow controlled manipulation of use. Laboratory studies: Pros and Cons

20 Steps in a usability experiment 1.The planning phase 2.The execution phase 3.Data collection techniques 4.Data analysis

21 The planning phase (your proposal) Who, what, where, when and how much? Who are test users, and how will they be recruited? Who are the experimenters? When, where, and how long will the test take? What equipment/software is needed? How much will the experiment cost? Prepare detailed test protocol *What test tasks? (written task sheets) *What user aids? (written manual) *What data collected? (include questionnaire) How will results be analyzed/evaluated? Pilot test protocol with a few users

22 Execution Phase: Designing Test Tasks Tasks: Are representative Cover most important parts of UI Don’t take too long to complete Goal or result oriented (possibly with scenario) Not frivolous or humorous (unless part of product goal) First task should build confidence Last task should create a sense of accomplishment

23 Detailed Test Protocol What tasks? Criteria for completion? User aids What will users be asked to do (thinking aloud studies)? Interaction with experimenter What data will be collected? All materials to be given to users as part of the test, including detailed description of the tasks.

24 Execution phase Prepare environment, materials, software Introduction should include: purpose (evaluating software) voluntary and confidential explain all procedures recording question-handling invite questions During experiment give user written task description(s), one at a time only one experimenter should talk De-briefing

25 Execution phase: ethics of human experimentation applied to usability testing Users feel exposed using unfamiliar tools and making errors Guidelines: Re-assure that individual results not revealed Re-assure that user can stop any time Provide comfortable environment Don’t laugh or refer to users as subjects or guinea pigs Don’t volunteer help, but don’t allow user to struggle too long In de-briefing answer all questions reveal any deception thanks for helping

26 Data collection - usability labs and equipment Pad and paper the only absolutely necessary data collection tool! Observation areas (for other experimenters, developers, customer reps, etc.) - should be shown to users Videotape (may be overrated) - users must sign a release Video display capture Portable usability labs Usability kiosks

27 Before you start to do any statistics: look at data save original data Choice of statistical technique depends on type of data information required Type of data discrete - finite number of values continuous - any value Analysis of data

28 Testing usability in the field (6 things you can do) 1. Direct observation in actual use discover new uses take notes, don’t help, chat later 2. Logging actual use objective, not intrusive great for identifying errors which features are/are not usedprivacy concerns

29 Testing Usability in the Field (cont.) 3. Questionnaires and interviews with real users ask users to recall critical incidents questionnaires must be short and easy to return 4. Focus groups 6-9 users skilled moderator with pre-planned script computer conferencing?? 5 On-line direct feedback mechanisms initiated by users may signal change in user needs trust but verify 6. Bulletin boards and user groups

30 Advantages: natural environment context retained (though observation may alter it) longitudinal studies possible Disadvantages: distractions noise Appropriate for “beta testing” where context is crucial for longitudinal studies Field Studies: Pros and Cons

31 31 Statistical Thinking (samples and populations) H1: Research Hypothesis: –Population 1 is different than Population 2 H0: Null Hypothesis: –No difference between Pop 1 and Pop 2 State test criteria ( , tails) Compute p(observed difference|H0) –‘p’ = probability observed difference is due to random variation If p accept H1 –alpha typically set to 0.05 for most work –p is called the “level of significance” (actual) –alpha is called the criterion

32 32 Relationship between alpha, beta, and power. Correct p = power Type I err p = alpha Type II err p = beta Correct p = 1-alpha H1 TrueH1 False “The Truth” Decide to Reject H0 & accept H1 Do not Reject H0 & do not accept H1

33 33 Relationship Between Population and Samples When a Treatment Had No Effect

34 34 Relationship Between Population and Samples When a Treatment Had An Effect

35 35 Some Basic Concepts Sampling Distribution –The distribution of every possible sample taken from a population (with size n) Sampling Error –The difference between a sample mean and the population mean: M - μ –The standard error of the mean is a measure of sampling error (std dev of distribution of means)

36 36 Degrees of Freedom –The number of scores in sample with a known mean that are free to vary and is defined as n-1 –Used to find the appropriate tabled critical value of a statistic Parametric vs. Nonparametric Statistics –Parametric statistics make assumptions about the nature of an underlying population –Nonparametric statistics make no assumptions about the nature of an underlying population Some Basic Concepts

37 Population  Mean?Variance? Sampling Sample of size N Mean values from all possible samples of size N aka “distribution of means”    Z M = ( M - 

38 Estimating the Population Variance S 2 is an estimate of σ 2 S 2 = SS/(N-1) for one sample (take sq root for S) For two independent samples – “pooled estimate”: S 2 = df 1 /df Total * S 1 2 + df 2 /df Total * S 2 2 df Total = df 1 + df 2 = (N1 -1) + (N2 – 1) From this calculate variance of sample means: S 2 M = S 2 /N needed to compute t statistic

39 Z tests and t-tests t is like Z: Z = M - μ / t = M – 0 / We use a stricter criterion (t) instead of Z because is based on an estimate of the population variance while is based on a known population variance.

40 Given info about population of change scores and the sample size we will be using (N) T-test with paired samples Now, given a particular sample of change scores of size N We can compute the distribution of means We compute its mean and finally determine the probability that this mean occurred by chance ?  = 0 S 2 est  2 from sample = SS/df df = N-1 S 2 M = S 2 /N

41 t test for independent samples Given two samples Estimate population variances (assume same) Estimate variances of distributions of means Estimate variance of differences between means (mean = 0) This is now your comparison distribution

42 t test for independent samples, continued This is your comparison distribution NOT normal, is a ‘t’ distribution Shape changes depending on df df = (N1 – 1) + (N2 – 1) Distribution of differences between means Compute t = (M1-M2)/SDifference Determine if beyond cutoff score for test parameters (df,sig, tails) from lookup table.

43 43 Effect size The amount of change in the DVs seen. Can have statistically significant test but small effect size.

44 44 Power Analysis Power –Increases with effect size –Increases with sample size –Decreases with alpha Should determine number of subjects you need ahead of time by doing a ‘power analysis’ Standard procedure: –Fix alpha and beta (power) –Estimate effect size from prior studies Categorize based on Table 13-8 in Aron (sm/med/lg) –Determine number of subjects you need –For Chi-square, see Table 13-10 in Aron reading

45 45 X^2 tests –For nominal measures –Can apply to a single measure (goodness of fit) Correlation tests –For two numeric measures t-test for independent means –For categorical IV, numeric DV

46 Categorial Examples Observational study/descriptive claim –Do NU students prefer Coke or Pepsi? Study with correlational claim –Is there a difference between males and females in Coke or Pepsi preference? Experimental Study with causal claim –Does exposure to advertising affect Coke or Pepsi preference? (students assigned to treatments)

47 47 Understanding numeric measures Sources of variance –IV –Other uncontrolled factors (“error variance”) If (many) independent, random variables with the same distribution are added, the result approximately a normal curve –The Central Limit Theorem

48 48 The most important parts of the normal curve (for testing) Z=1.65 5%

49 49 The most important parts of the normal curve (for testing) Z=1.96 2.5% Z=-1.96 2.5%

50 50 Hypothesis testing – one tailed Hypothesis: sample (of 1) will be significantly greater than known population distribution –Population completely known (not an estimate) Example – WizziWord experiment: –H1:  WizziWord >  Word –  (one-tailed) –Population (Word users):  Word  –What level of performance do we need to see before we can accept H1?

51 51 Hypothesis testing – two tailed Hypothesis: sample (of 1) will be significantly different from known population distribution Example – WizziWord experiment: –H1:  WizziWord   Word –  (two-tailed) –Population (Word users):  Word  –What level of performance do we need to see before we can accept H1?

52 52 Standard testing criteria for experiments  Two-tailed

53 53 Don’t try this at home You would never do a study this way. Why? –Can’t control extraneous variables through randomization. –Usually don’t know population statistics. –Can’t generalize from an individual.

54 54 Sampling Sometimes you really can measure the entire population (e.g., workgroup, company), but this is rare… More typical: “Convenience sample” –Cases are selected only on the basis of feasibility or ease of data collection. Assumed ideal: Random sample –e.g., random digit dialing (approx)

55 55 Given info about population and the sample size we will be using (N) Hypothesis testing with a sample wrt distribution of means Now, given a particular sample of size N We can compute the distribution of means We compute its mean and finally determine the probability that this mean occurred by chance

56 56 Population  Mean?Variance? Sampling Sample of size N Mean values from all possible samples of size N aka “distribution of means”  NOTE: This is a normal curve

57 57 t-statistics, t-distributions & t-tests

58 58 Single sample t-test What if you know comparison pop’s mean but not stddev? –Estimate population variance from sample variance Estimate of S^2 = SS/(N-1) S^2 M = S^2/N –Comparison is now a t-test, t=(M-u)/S M –df=N-1

59 59 t-test for dependent means aka “paired sample t-test”

60 60 t-test for dependent means When to use One factor, two-level, within-subjects/repeated measures design -or- One factor, two-level, between-subjects, matched pair design In general, a bivariate categorical IV and numeric DV when the DV scores are correlated. Assumes –Population distribution of individual scores is normal

61 61 Wanted: a statistic for differences between paired individuals In a repeated-measures or matched-pair design, you directly compare one subject with him/herself or another specific subject (not groups to groups). So, start with a sample of change (difference) scores: Sample 1 = Mary’s wpm using Wizziword – Mary’s wpm using Word

62 62 Given info about population of change scores and the sample size we will be using (N) Hypothesis testing with paired samples Now, given a particular sample of change scores of size N We can compute the distribution of means We compute its mean and finally determine the probability that this mean occurred by chance ?  = 0 est  2 from sample df = N-1

63 63 t-test for dependent means Map two measures for each subject into one difference score for each –e.g. change due to intervention = after measure – before measure Null hypothesis (usually) no change –Thus mean of comparison dist is zero

64 64 SPSS

65 65 Analyze/Compare Means/Paired Sample t-test

66 Results paired t(9)=2.665, p<.05

67 67 Issues with t-test for dependent means Use with care on longitudinal data –Significant differences (changes) may have been due to something other than your intervention –Better to use between-subjects design with control group for comparing changes over longitudinal studies

68 68 Between-Subjects Design Have two experimental conditions (treatments, levels, groups) Randomly assign subjects to conditions (why?) Measure numeric outcome in each group Each group is a sample from a population Big question: are the populations the same (null hypothesis) or are they significantly different? –What statistic tests this?

69 69 t-test for independent means Tests association between binomial IV and numeric DV. Examples: –WizziWord vs. Word => wpm –Small vs. Large Monitors => wpd –Wait time sign vs. none => satisfaction

70 70 t-test for independent means Two samples No other information about comparison distribution

71 71 Solution – take two samples, gathered at same time Intervention Control The big question: which is correct? H1 Intervention Control H0 Intervention Control

72 72 Wanted: a statistic to measure how similar two samples are (of numeric measures) “t score for the difference between two means” If samples are identical, t=0 As samples become more different, t increases. What is the comparison distribution? –Want to compute probability of getting a particular t score IF the samples actually came from the same distribution (what is the t score for this case?).

73 73 Why t? In this situation, you do not know the population parameters; they must be estimated from the samples. When you have to estimate a comparison population’s variance, the resulting distribution is not normal – it is a “t distribution”. The particular kind of t distribution we are using in this case is called a “distribution of the difference of means”.

74 74 All things t t distribution shape is parameterized by “degrees of freedom” For a distribution of the difference of means,

75 75 Only remaining loophole

76 76 Assumptions for t –Scores are sampled randomly from the population –The sampling distribution of means is normal –Variances of the two populations (whether they are the same or different) are the same. Typical assumption.

77 Finally – the t test for independent samples Pop1 Pop2 Dist of Means 1 Dist of Means 2 Dist of Difference of Means Est of Mean Pooled est of common variance This is now your comparison distribution S? = Sdifference

78 78 Reporting results Significant results t(df)=tscore, p<sig e.g., t(38)=4.72, p<.05 Non-significant results e.g., t(38)=4.72, n.s.

79 79 SPSS

80 80 SPSS

81 Equal variances assumed t(10)=3.796, p<.05

82 82 Sidebar: Control groups To demonstrate a cause and effect hypothesis, an experiment must show that a phenomenon occurs after a certain treatment is given to a subject, and that the phenomenon does not occur in the absence of the treatment. A controlled experiment (“experimental design”) generally compares the results obtained from an experimental sample against a control sample, which is identical to the experimental sample except for the one aspect whose effect is being tested. You must carefully select your control group in order to demonstrate that only the IV of interest is changing between groups.

83 83 Sidebar: Control groups Standard-of-care control (new vs. old) Non-intervention control “A vs. B” design (shootout) “A vs. A+B” design (e.g., S-O-C vs. S-O-C+intervention) Problem: the “intervention” may cause more than just the desired effect. –Example: giving more attention to intervention Ss in educational intervention Some solutions: –Attention control –Placebo control –Wait list control (also addresses measurement issues)

84 84 Sidebar: Control groups Related concepts Blind test – S does not know group Double blind test – neither S nor experimenter know Manipulation check –Test performed just to see if your manipulation is working. Necessary if immediate effect of manipulation is not obvious. –“Positive control” test for intervention effect –“Negative control” test for lack of intervention effect –Example: Student Center Sign: ask students if they saw & read the sign

85 85 Relationship Between Population and Samples When a Treatment Had No Effect

86 86 Relationship Between Population and Samples When a Treatment Had An Effect

87 87 Some Basic Concepts Sampling Distribution –The distribution of every possible sample taken from a population Sampling Error –The difference between a sample mean and the population mean –The standard error of the mean is a measure of sampling error (std dev of distribution of means)

88 88 Degrees of Freedom –The number of scores in sample with a known mean that are free to vary and is defined as n-1 –Used to find the appropriate tabled critical value of a statistic Parametric vs. Nonparametric Statistics –Parametric statistics make assumptions about the nature of an underlying population –Nonparametric statistics make no assumptions about the nature of an underlying population Some Basic Concepts

89 89 Parametric Statistics Assumptions –Scores are sampled randomly from the population –The sampling distribution of the mean is normal –Within-groups variances are homogeneous Two-Sample Tests –t test for independent samples used when subjects were randomly assigned to your two groups –t test for dependent samples (aka “paired-sample t-test”) used when samples are not independent (e.g., repeated measure)

90 Finally – the t test for independent samples Given two samples Estimate population variances (assume same) Estimate variances of distributions of means Estimate variance of differences between means (mean = 0) This is now your comparison distribution

91 Finally – the t test for independent samples, continued This is your comparison distribution NOT normal, is a ‘t’ distribution Shape changes depending on df df = (N1 – 1) + (N2 – 1) Distribution of differences between means Compute t = (M1-M2)/SDifference Determine if beyond cutoff score for test parameters (df,sig, tails) from lookup table.


Download ppt "IS 4800 Empirical Research Methods for Information Science Class Notes March 2, 2012 Instructor: Prof. Carole Hafner, 446 WVH Tel: 617-373-5116."

Similar presentations


Ads by Google