Presentation on theme: "Panel Session: How Many Users is Enough? Determining Usability Test Sample Size Intermediate/Advanced Usability Testing Skills/Methods James R. Lewis IBM."— Presentation transcript:
Panel Session: How Many Users is Enough? Determining Usability Test Sample Size Intermediate/Advanced Usability Testing Skills/Methods James R. Lewis IBM Corp.
Some Current Issues in Sample Size Estimation for Usability Studies Adjusting problem discovery rates estimated from small samples Any need to revise 1-(1-p) n ? Reliability of scenario-based usability testing
I. Adjusting Problem Discovery Rates Estimated from Small Samples Two good reasons to compute problem discovery rates (p) –Projection of required sample size –Estimation of proportion of problems found Bad news -- small-sample estimates of p have a serious overestimation bias Good news – procedures exist for adjusting this overestimation bias
Hypothetical Results (1 st 3 Users) Prob:P1P2P3P4P5P6P7P8P9P10Prop User1XXXXXX0.86 User2XXXXX0.71 User3XXXXX0.71 Prop1.0 nd1.00.31.0nd0.7nd0.3p=0.76 Problems 3, 7, and 9 never discovered 21 cells in table, 16 filled, results in p = 0.76 Problem first described by Hertzum & Jacobsen (2001) Pointed out that smallest possible value for p is 1/n Duplication of problems increases estimate of p
Potential Adjustment Methods Discounting (LaPlace, Good-Turing) –Attempt to allocate probability space to unseen events – p(adj) = p(est)/(1+(N 1 /N)) where N 1 is number of events happening exactly once, N is total number of events –Used in statistical language modeling Normalization –Based on lower limit of 1/n –p(adj) = (p(est) – 1/n)(1-1/n) Regression model –p(adj) =.16 +.823p(norm) –p(adj) =.210 +.829p(norm) +.013*n
Results of Monte Carlo Studies Studied four problem discovery databases Considerable variation in number of participants, number of problems discovered, p, method of execution Overestimation is serious problem Less overestimation as n gets larger
Adequacy of Adjustment Methods Regression methods didnt work well Good-Turing (GT) discounting tended to leave p slight inflated Normalization tended to underestimate p Averaging GT and normalization adjustments led to best results across all databases p(adj) = ½[(p(est)-1/n)(1-1/n)] + ½[p(est)/(1+(N 1 /N))]
Projection from Small Samples Given Goal of 90% Discovery Data Base pp est | N=2 p adj | N=2 n| N=2 p est | N=4 p adj | N=4 n| N=4 nDev from goal Lew89.16.566.21810.346.1651314-.004 Virz90.36.662.3616.485.32866.031 Mantel.38.725.4624.571.42955.005 Savings.26.629.3117.442.27788.006
Key Findings A good adjustment procedure must be accurate and have low variability The combined adjustment procedure has these properties Practitioners can obtain accurate sample size estimates for problem discovery goals ranging from 70% to 95% by making an initial adjusted estimate of p from the first two participants, then adjusting that estimate using data from four participants
II. Any need to revise 1-(1-p) n ? Some recent papers claim a need for this –Replace single value of p with a probability density function (Woolrych & Cockton, 2001) –Replace single value of p with ps for heterogeneous groups of participants/problems (Caulton, 2001) Does the binomial probability formula really require a strict homogeneity assumption?
I dont think so, because Data from Lewis (2001) indicates good prediction given adjustment of initial inflated estimates of p – prediction using adjusted values of p might fix many problems Data from Lewis (1994) indicates good modeling for individual problems or groups of problems (using mean p across problems in group to model group) At lowest possible level, each participant is sole member of a group
III. Reliability of scenario-based usability testing Results from Hertzum & Jacobsen, Molich et al., and Kessner et al. are disturbing Are we engaging in self-deception similar to that of clinicians using projective tests? We need to come to grips with the results from these studies Despite these findings, …
Reliability of Usability Testing In practice, we often only have a single opportunity to evaluate an interface, so we cant tell if our interventions have really improved an interface In my experience, when I have conducted standard scenario-based problem-discovery studies with multiple participants and one observer, and have done so iteratively, the measurements across iterations consistently indicate substantial and statistically reliable improvements in usability.
Reliability of Usability Testing Despite the apparent reality of lack of correspondence of problem reports among groups of investigators and a probable evaluator effect, experience suggests that these usability evaluation methods can work An important task for future research is to reconcile the apparent lack of reliability with the apparent reality of usability improvement achieved through the iterative application of usability evaluation methods
Some Additional Observations We know that observer and participant performance is not perfectly reliable (that there is an evaluator effect) – otherwise there would be no reason to cumulate data across observers/participants No single usability method can detect all possible usability problems – models of problem discovery apply only to the specific usability evaluation setting (method, tasks, etc.) Discovering usability problems is fundamentally different from discovering diamonds due to vague definition of what constitutes a usability problem
In Closing Small-sample estimates of p are inflated, but it is possible to adjust p to reduce bias For most practical purposes there should be no need to complicate the use of 1-(1-p) n Measurements from iterative evaluation and modification of user interfaces indicates the procedures can be reliable – need more research to investigate lack of reliability across independent evaluations – for example, does fixing reported problems increase usability even when the sets of reported problems are different?
Using 1-(1-p) n to Plan Sample Sizes: Method from Lewis (1994) Problem Detection Probability Likelihood of Detecting Problem at Least Once.188.8.131.52.95.99.0168136186225289418.05142737445782.1071318222840.155912141826.2535781115.50123457.90111122 But this still applies only to the discoverable problems in a given usability evaluation setting (method, tasks, etc.)!
References Caulton, D. A. (2001). Relaxing the homogeneity assumption in usability testing. Behaviour & Information Technology, 20, 1-7. Hertzum, M., & Jacobsen, N. E. (2001). The evaluator effect: a chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 13, 421-443. Kessner, M., Wood, J., Dillon, R. F. & West, R.L. (2001). On the reliability of usability testing. In J. Jacko and A. Sears (Eds.), Conference on Human Factors in Computing Systems: CHI 2001 Extended Abstracts (pp. 97-98). Seattle, WA: ACM Press. Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.
References (Cont.) Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479. Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D., Kirakowski, J. (1998). Comparative evaluation of usability tests. In Proceedings of the Usability Professionals Association (pp. 189-200). Washington, DC: UPA. Molich, R., Thomsen, A. D., Karyukina, B., Schmidt, L., Ede, M., van Oel, W., & Arcuri, M. (1999, May). Comparative evaluation of usability tests. In Conference on Human Factors in Computing Systems: CHI 1999 Extended Abstracts (pp. 83- 84). Pittsburgh, PA: ACM Press.