Formal User Studies Marti Hearst (UCB SIMS) SIMS 213, UI Design & Development April 13, 1999.

Slides:



Advertisements
Similar presentations
Multifactorial Designs
Advertisements

Experimental Design, Response Surface Analysis, and Optimization
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
User Testing & Experiments. Objectives Explain the process of running a user testing or experiment session. Describe evaluation scripts and pilot tests.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
The art and science of measuring people l Reliability l Validity l Operationalizing.
1 User Testing. 2 Hall of Fame or Hall of Shame? frys.com.
SIMS 213: User Interface Design & Development Marti Hearst Thurs, March 13, 2003.
User Interface Testing. Hall of Fame or Hall of Shame?  java.sun.com.
I213: User Interface Design & Development Marti Hearst March 1, 2007.
Ch 7 & 8 Interaction Styles page 1 CS 368 Designing the Interaction Interaction Design The look and feel (appearance and behavior) of interaction objects.
SIMS 213: User Interface Design & Development Marti Hearst March 9 and 16, 2006.
SIMS 213: User Interface Design & Development Marti Hearst Tues Feb 13, 2001.
I213: User Interface Design & Development Marti Hearst March 5, 2007.
SIMS 213: User Interface Design & Development Marti Hearst March 9 & 11, 2004.
Outline Single-factor ANOVA Two-factor ANOVA Three-factor ANOVA
Discount Usability Engineering Marti Hearst (UCB SIMS) SIMS 213, UI Design & Development March 2, 1999.
Understanding and Comparing Distributions
13 Design and Analysis of Single-Factor Experiments:
Repeated Measures ANOVA Used when the research design contains one factor on which participants are measured more than twice (dependent, or within- groups.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Comparing Systems Using Sample Data Andy Wang CIS Computer Systems Performance Analysis.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Data Collection & Processing Hand Grip Strength P textbook.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Single-Factor Experimental Designs
Involving Users in Interface Evaluation Marti Hearst (UCB SIMS) SIMS 213, UI Design & Development April 8, 1999.
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
12e.1 ANOVA Within Subjects These notes are developed from “Approaching Multivariate Analysis: A Practical Introduction” by Pat Dugard, John Todman and.
Introduction to Experimental Design
Testing & modeling users. The aims Describe how to do user testing. Discuss the differences between user testing, usability testing and research experiments.
Evaluation of User Interface Design 4. Predictive Evaluation continued Different kinds of predictive evaluation: 1.Inspection methods 2.Usage simulations.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
@ 2012 Wadsworth, Cengage Learning Chapter 10 Extending the Logic of Experimentation: Within-Subjects and Matched-Subjects 2012 Wadsworth,
Questions to Ask Yourself Regarding ANOVA. History ANOVA is extremely popular in psychological research When experimental approaches to data analysis.
MSE-415: B. Hawrylo Chapter 13 – Robust Design What is robust design/process/product?: A robust product (process) is one that performs as intended even.
Chapter 9 Three Tests of Significance Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Mixed-Design ANOVA 5 Nov 2010 CPSY501 Dr. Sean Ho Trinity Western University Please download: treatment5.sav.
McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Using Between-Subjects and Within- Subjects Experimental Designs.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
L643: Evaluation of Information Systems Week 13: March, 2008.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Basics of Windows 95/98/NT. Versions of Windows Windows 95 and 98 used mainly on standalone computers Windows NT used on networked computers (as in our.
Day 10 Analysing usability test results. Objectives  To learn more about how to understand and report quantitative test results  To learn about some.
Tutorial I: Missing Value Analysis
Comparing Systems Using Sample Data Andy Wang CIS Computer Systems Performance Analysis.
Outline of Today’s Discussion 1.Introduction to Factorial Designs 2.Analysis of Factorial Designs 3.Hypotheses For Factorial Designs 4.Eta Squared and.
Statistics (cont.) Psych 231: Research Methods in Psychology.
1 Research Methods in Psychology AS Descriptive Statistics.
12e.1 MANOVA Within Subjects These notes are developed from “Approaching Multivariate Analysis: A Practical Introduction” by Pat Dugard, John Todman and.
Systems and User Interface Software. Types of Operating System  Single User  Multi User  Multi-tasking  Batch Processing  Interactive  Real Time.
Mixed-Design ANOVA 13 Nov 2009 CPSY501 Dr. Sean Ho Trinity Western University Please download: treatment5.sav.
ChE 551 Lecture 04 Statistical Tests Of Rate Equations 1.
CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.
Inferential Statistics Psych 231: Research Methods in Psychology.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
One-Variable Statistics. Descriptive statistics that analyze one characteristic of one sample  Where’s the middle?  How spread out is it?  How do different.
One-Variable Statistics
Lecture Slides Elementary Statistics Twelfth Edition
Data, Univariate Statistics & Statistical Inference
i213: User Interface Design & Development
One-Way Analysis of Variance
Inspecting your data Analyzing & interpreting results
Evaluation.
15.1 The Role of Statistics in the Research Process
HCI Evaluation Techniques
Professor John Canny Fall 2004
Retrieval Performance Evaluation - Measures
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Formal User Studies Marti Hearst (UCB SIMS) SIMS 213, UI Design & Development April 13, 1999

Outline l Experiment Design –Factoring Variables –Interactions l Special considerations when involving human participants l Example: Marking Menus –Motivation –Hypotheses –Design –Analysis

Formal Usability Studies l Situations in which these are useful –to determine time requirements for task completion –to compare two designs on measurable aspects »time required »number of errors »effectiveness for achieving very specific tasks l Require Experiment Design

Experiment Design l Experiment design involves determining how many experiments to run and which attributes to vary in each experiment l Goal: isolate which aspects of the interface really make a difference

Experiment Design l Decide on –Response variables »the outcome of the experiment »usually the system performance »aka dependent variable(s) –Factors (aka attributes) »aka independent variables –Levels (aka values for attributes) –Replication »how often to repeat each combination of choices

Experiment Design l Studying a system; ignoring users l Say we want to determine how to configure the hardware for a personal workstation Hardware choices »which CPU (three types) »how much memory (four amounts) »how many disk drives (from 1 to 3) –Workload characteristics »administration, management, scientific

Experiment Design l We want to isolate the effect of each component for the given workload type. l How do we do this? –WL1CPU1 Mem1Disk1 –WL1CPU1Mem1Disk2 –WL1CPU1Mem1Disk3 –WL1CPU1Mem2Disk1 –WL1CPU1Mem2Disk2 –…–… l There are (3 CPUs)*(4 memory sizes)*(3 disk sizes)*(3 workload types) = 108 combinations!

Experiment Design l One strategy to reduce the number of comparisons needed: –pick just one attribute –vary it –hold the rest constant l Problems: –inefficient –might miss effects of interactions

Interactions among Attributes A1A2 B135 B268 A1A2 B135 B269 A1 B1 A2 A1 B2 A2 B2 Non-interactingInteracting A2 A1 B1B2B1B2

Experiment Design l Another strategy: figure out which attributes are important first l Do this by just comparing a few major attributes at a time –if an attribute has a strong effect, include it in future studies –otherwise assume it is safe to drop it l This strategy also allows you to find interactions between attributes

Experiment Design l Common practice: Fractional Factorial Design –Just compare important subsets –Use experiment design to partially vary the combinations of attributes l Blocking –Group factors or levels together –Use a Latin Square design to arrange the blocks

Adapted from slide by James Landay Between Groups vs. Within Groups l Do participants see only one design or both? l Between groups –two groups of test users –each group uses only 1 of the systems l Within groups experiment –one group of test users »each person uses both systems »can’t use the same tasks (learning) –Why is this a consideration? –People often learn during the experiment.

Special Considerations for Formal Studies with Human Participants l Studies involving human participants vs. measuring automated systems –people get tired –people get bored –people (may) get upset by some tasks –learning effects »people will learn how to do the tasks (or the answers to questions) if repeated »people will (usually) learn how to use the system over time

More Special Considerations l High variability among people –especially when involved in reading/comprehension tasks –especially when following hyperlinks! (can go all over the place)

Experiment Design Example: Marking Menus Based on Kurtenbach, Sellen, and Buxton, Some Articulartory and Cognitive Aspects of “Marking Menus”, Graphics Interface ‘94,

Experiment Design Example: Marking Menus l Pie marking menus can reveal –the available options –the relationship between mark and command l 1. User presses down with stylus l 2. Menu appears l 3. User marks the choice, an ink trail follows

Why Marking Menus? l Supporting markings with pie menus should help transition between novice and expert l Useful for keyboardless devices l Useful for large screens l Pie menus have been shown to be faster than linear menus in certain situations

What do we want to know? l Are marking menus better than pie menus? –Do users have to see the menu? –Does leaving an “ink trail” make a difference? –Do people improve on these new menus as they practice? l Related questions: –What, if any, are the effects of different input devices? –What, if any, are the effects of different size menus?

Experiment Factors l Isolate the following factors (independent variables): –Menu condition »exposed, hidden, hidden w/marks (E,H,M) –Input device »mouse, stylus, track ball (M,S,T) –Number of items in menu »4,5,7,8,11,12 (note: both odd and even) l Response variables (dependent variables): –Response Time –Number of Errors

Experiment Hypotheses l Note these are stated in terms of the factors (independent variables) –Exposed menus will yield faster response times and lower error rates, but not when menu size is small –Response variables will monotonically increase with menu size for exposed menus –Response time will be sensitive to number of menu choices for hidden menus (familiar ones will be easier, e.g., 8 and 12) –Stylus better than Mouse better than Track ball

Experiment Hypotheses –Device performance independent of menu type –Performance on hidden menus (both marking and hidden) will improve steadily across trials. Performance on exposed menus will remain constant.

Experiment Design l Participants –36 right-handed people »usually gender distribution is stated –considerable mouse experience –(almost) no trackball, stylus experience l Task –Select target “slices” from a series of different pie menus as quickly and accurately as possible –Menus were simply numbered segments »meaningful items would have longer learning times –Participants saw running scores »lose points for wrong selection

Experiment Design l One between-subjects factor –Menu Type »Three levels: E, H, or M l Two within-subjects factors –Device Type »Three levels: M, T, or S –Number of Menu Items »Six levels: 4, 5, 7, 8, 11, 12 l How should we arrange these?

Experiment Design EHM 12 Between subjects design How to arrange the devices?

Experiment Design M T S T S M S M T EHM 12 A Latin Square No row or column share labels

Experiment Design M T S T S M S M T EHM How to arrange the menu sizes? Block by size then randomize the blocks.

Experiment Design M T S T S M S M T EHM Block by size then randomize the blocks.

Experiment Design M T S T S M S M T EHM trials per block

Experiment Overall Results So exposing menus is faster … or is it? Let’s factor things out more.

A Learning Effect When we graph over the number of trials, we find a difference between exposed and hidden menus. This suggests that participants may eventually become faster using marking menus. (hypothesized) A later study verified this.

Factoring to Expose Interactions l Increasing menu size increases selection time and number of errors (hypothesized) l No differences across menu groups in terms of response time. l That is, until we factor by menu size AND group –Then we see that menu size has effects on hidden groups not seen on exposed group –This was hypothesized (12 easier than 11)

Factoring to Expose Interactions l Stylus and mouse outperformed trackball (hypothesized) l Stylus and mouse the same (not hypothesized) l Initially, effect of input device did not interact with menu type –this is when comparing globally –BUT... l More detailed analysis: –Compare both by menu type and device type –Stylus significantly faster with Marking group –Trackball significantly slower with Exposed group –Not hypothesized!

Average response time and errors as a function of device, menu size, and menu type Potential explanations: Markings provide feedback for when stylus is pressed properly. Ink trail is consistent with the metaphor of using a pen.

Experiment Design M T S T S M S M T EHM How can we tell if order in which the device appears has an effect on the final outcome? Some evidence: There is no significant difference among devices in the Hidden group. Trackball was slowest and most error prone in all three cases. Still, there may be some hidden interactions, but unlikely to be strong given the previous graph.

Statistical Tests l Need to test for statistical significance –This is a big area –Assuming a normal distribution: »Students t-test to compare two variables »ANOVA to compare more than two variables

Adapted from slide by James Landay Analyzing the Numbers l Example: trying to get task time <=30 min. –test gives: 20, 15, 40, 90, 10, 5 –mean (average) = 30 –median (middle) = 17.5 –looks good! –wrong answer, not certain of anything l Factors contributing to our uncertainty –small number of test users (n = 6) –results are very variable (standard deviation = 32) »std. dev. measures dispersal from the mean

Adapted from slide by James Landay Analyzing the Numbers (cont.) l This is what statistics is for l Crank through the procedures and you find –95% certain that typical value is between 5 & 55 l Usability test data is quite variable –need lots to get good estimates of typical values –4 times as many tests will only narrow range by 2x

Followup Work l Hierarchical Markup Menu study

Followup Work l Results of use of marking menus over an extended period of time –two person extended study –participants became much faster using gestures without viewing the menus

Followup Work l Results of use of marking menus over an extended period of time –participants temporarily returned to “novice” mode when they had been away from the system for a while

Summary l Formal studies can reveal detailed information but but take extensive time/effort l Human participants entail special requirements l Experiment design involves –Factors, levels, participants, tasks, hypotheses –Important to consider which factors are likely to have real effects on the results, and isolate these l Analysis –Often need to involve a statistician to do it right –Need to determine statistical significance –Important to make plots and explore the data

References l Kurtenbach, Sellen, and Buxton, Some Articulartory and Cognitive Aspects of “Marking Menus”, Graphics Interface ‘94, l Kurtenbach and Buxton, User Learning and Performance with Marking Menus, Graphics Interface ‘94, l Jain, The art of computer systems performance analysis, Wiley, 1991 l l Gonick and Smith, The Cartoon Guide to Statistics, HarperPerennial, 1993 l Dix et al. textbook