SBD: Usability Evaluation

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

Data analysis and interpretation. Agenda Part 2 comments – Average score: 87 Part 3: due in 2 weeks Data analysis.
1 SIMS 247: Information Visualization and Presentation Marti Hearst Nov 30, 2005.
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
USABILITY AND EVALUATION Motivations and Methods.
The art and science of measuring people l Reliability l Validity l Operationalizing.
Semester in review. The Final May 7, 6:30pm – 9:45 pm Closed book, ONE PAGE OF NOTES Cumulative Similar format to midterm (probably about 25% longer)
Group Project. Don’t make me think Steve Krug (2006)
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Using Statistics in Research Psych 231: Research Methods in Psychology.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Empirical Methods in Human- Computer Interaction.
Experiments Testing hypotheses…. Recall: Evaluation techniques  Predictive modeling  Questionnaire  Experiments  Heuristic evaluation  Cognitive.
Usable Privacy and Security Carnegie Mellon University Spring 2008 Lorrie Cranor 1 Designing user studies February.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
From Controlled to Natural Settings
Usability Methods: Cognitive Walkthrough & Heuristic Evaluation Dr. Dania Bilal IS 588 Spring 2008 Dr. D. Bilal.
Chapter 14: Usability testing and field studies
SBD: Usability Evaluation
Evaluation: Controlled Experiments Chris North cs3724: HCI.
Predictive Evaluation
1 SWE 513: Software Engineering Usability II. 2 Usability and Cost Good usability may be expensive in hardware or special software development User interface.
Evaluation Methods Analytic: theory, models, guidelines (experts) –Cognitive Walkthrough –Usability Inspection –Heuristic Evaluation Empirical: observations,
Design and Evaluation. Overview Formal Evaluations of Visualization Techniques Design (review) Evaluation/Critique of Visualizations (what we’ve been.
Formative Evaluation cs3724: HCI. Problem scenarios summative evaluation Information scenarios claims about current practice analysis of stakeholders,
Multimedia Specification Design and Production 2013 / Semester 1 / week 9 Lecturer: Dr. Nikos Gazepidis
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
Usability testing. Goals & questions focus on how well users perform tasks with the product. – typical users – doing typical tasks. Comparison of products.
Usability Testing Chapter 6. Reliability Can you repeat the test?
Data analysis and interpretation. Project part 3 Watch for comments on your evaluation plans Finish your plan – Finalize questions, tasks – Prepare scripts.
Usability Testing Chris North cs3724: HCI. Presentations karen molye, steve kovalak Vote: UI Hall of Fame/Shame?
Chapter 8 Usability Specification Techniques Hix & Hartson.
Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.
Usability Engineering Dr. Dania Bilal IS 582 Spring 2006.
SBD: Usability Evaluation Chris North cs3724: HCI.
Usability Engineering Dr. Dania Bilal IS 592 Spring 2005.
Design and Evaluation. Design Use Field Guide to identify information relevant to planning visualization.Field Guide Formally plan visualization using.
SBD: Usability Evaluation Chris North CS 3724: HCI.
Empirical Evaluation Chris North cs5984: Information Visualization.
Fall 2002CS/PSY Predictive Evaluation (Evaluation Without Users) Gathering data about usability of a design by a specified group of users for a particular.
Usability Engineering Dr. Dania Bilal IS 582 Spring 2007.
Usability Engineering Dr. Dania Bilal IS 587 Fall 2007.
Inferential Statistics Psych 231: Research Methods in Psychology.
The Information School of the University of Washington Information System Design Info-440 Autumn 2002 Session #20.
User Interface Evaluation
Qualitative vs. Quantitative
Data analysis and interpretation
Data Collection and Analysis
Professor John Canny Spring 2003
SY DE 542 User Testing March 7, 2005 R. Chow
SBD: Analyzing Requirements
Evaluation techniques
From Controlled to Natural Settings
Information Design and Visualization
Inspecting your data Analyzing & interpreting results
Professor John Canny Spring 2004
Evaluation.
Title of your experimental design
HCI Evaluation Techniques
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Formative Evaluation cs3724: HCI.
Psych 231: Research Methods in Psychology
1. INTRODUCTION.
Professor John Canny Fall 2004
Empirical Evaluation Data Collection: Techniques, methods, tricks Objective data IRB Clarification All research done outside the class (i.e., with non-class.
Human-Computer Interaction: Overview of User Studies
CS 160: Lecture 16 Professor John Canny 8/2/2019.
Presentation transcript:

SBD: Usability Evaluation Chris North CS 3724: HCI

Scenario-Based Design ANALYZE analysis of stakeholders, field studies claims about current practice Problem scenarios Scenario-Based Design DESIGN Activity scenarios metaphors, information technology, HCI theory, guidelines iterative analysis of usability claims and re-design Information scenarios Interaction scenarios PROTOTYPE & EVALUATE summative evaluation formative evaluation Usability specifications

Evaluation Formative vs. Summative Analytic vs. Empirical

Usability Engineering Reqs Analysis Design Evaluate Develop many iterations

Usability Engineering Formative evaluation Summative evaluation

Usability Evaluation Analytic Methods: Empirical Methods: Usability inspection, Expert review Heuristic: Nielsen’s 10 Cognitive walk-through GOMS analysis Empirical Methods: Usability Testing Field or lab Observation, problem identification Controlled Experiment Formal controlled scientific experiment Comparisons, statistical analysis

User Interface Metrics Ease of learning Ease of use User satisfaction

User Interface Metrics Ease of learning learning time, … Ease of use performance time, error rates… User satisfaction surveys… Not “user friendly”

Usability Testing

Usability Testing Formative: helps guide design Early in design process when architecture is finalized, then its too late! Small # of users Usability problems, incidents Qualitative feedback from users Quantitative usability specification

Usability Specification Table Benchmark task Worst case Planned Target Best case (expert) Observed Find most expensive house for sale? 1 min. 10 sec. 3 sec. ??? sec … e.g. frequent tasks should be fast

Usability Test Setup Set of benchmark tasks Consent forms Derived from scenarios (Reqs analysis phase) Derived from claims analysis (Design phase) Easy to hard, specific to open-ended Coverage of different UI features E.g. “Find the 5 most expensive houses for sale” Different types: learnability vs. performance Consent forms Not needed unless recording user’s face/voice (new rule) Experimenters: Facilitator: instructs user Observers: take notes, collect data, video tape screen Executor: run the prototype for faked parts Users Solicit from target user community (Reqs analysis) 3-5 users, quality not quantity

Usability Test Procedure Goal: mimic real life Do not cheat by helping them complete tasks Initial instructions “We are evaluating the system, not you.” Repeat: Give user next benchmark task Ask user to “think aloud” Observe, note mistakes and problems Avoid interfering, hint only if completely stuck Interview Verbal feedback Questionnaire ~1 hour / user

Usability Lab E.g. McBryde 102

Data Note taking Verbal protocol: think aloud E.g. “&%$#@ user keeps clicking on the wrong button…” Verbal protocol: think aloud E.g. user thinks that button does something else… Rough quantitative measures HCI metrics: e.g. task completion time, … Interview feedback and surveys Video-tape screen & mouse Eye tracking, biometrics?

Analyze Initial reaction: Mature reaction: Identify usability problems “stupid user!”, “that’s developer X’s fault!”, “this sucks” Mature reaction: “how can we redesign UI to solve that usability problem?” the data is always right Identify usability problems Learning issues: e.g. can’t figure out or didn’t notice feature Performance issues: e.g. arduous, tiring to solve tasks Subjective issues: e.g. annoying, ugly Problem severity: critical vs. minor

Cost-Importance Analysis Importance 1-5: (task effect, frequency) 5 = critical, major impact on user, frequent occurance 3 = user can complete task, but with difficulty 1 = minor problem, small speed bump, infrequent Ratio = importance / cost Sort by this 3 categories: Must fix, next version, ignored Problem Importance Solutions Cost Ratio I/C

Refine UI Solve problems in order of: importance/cost Simple solutions vs. major redesigns Iterate: Test, refine, test, refine, test, refine, … Until?

Refine UI Solve problems in order of: importance/cost Simple solutions vs. major redesigns Iterate: Test, refine, test, refine, test, refine, … Until? Meets usability specification

Examples Learnability problem: Performance problem: Problem: user didn’t know he could zoom in to see more… Potential solutions: Better labeling: Better zoom button icon, tooltip Clearer affordance: Add a zoom bar slider (like google maps) … NOT: more “help” documentation! You can do better. Performance problem: Problem: user took too long to repeatedly zoom in… Faster affordance: Add a real-time zoom bar Shortcuts: Icons for each zoom level: state, city, street

Project (step 6): Usability Test Usability Evaluation: >=3 users: Not (tainted) HCI students ~10 benchmark tasks Simple data collection (Biometrics optional!) Exploit this opportunity to improve your design Report: Procedure (users, tasks, specs, data collection) Usability problems identified, specs not met Design modifications

Controlled Experiments

Usability test vs. Controlled Expm. Formative: helps guide design Single UI, early in design process Few users Usability problems, incidents Qualitative feedback from users Controlled experiment: Summative: measure final result Compare multiple UIs Many users, strict protocol Independent & dependent variables Quantitative results, statistical significance Engineering oriented Science oriented

What is Science?

What is Science? Phenomenon Measurement Modeling Engineering Science …advent of microarray measurement instrument -> flourishing of biology research Science Modeling

Scientific Method?

Scientific Method Form Hypothesis Collect data Analyze Accept/reject hypothesis How to “prove” a hypothesis in science?

Scientific Method Form Hypothesis Collect data Analyze Accept/reject hypothesis How to “prove” a hypothesis in science? Easier to disprove things, by counterexample Null hypothesis = opposite of hypothesis Disprove null hypothesis Hence, hypothesis is proved

Example Typical question: Spotfire vs. TableLens Which visualization is better for which user tasks? Spotfire vs. TableLens

Cause and Effect Goal: determine “cause and effect” Procedure: Cause = visualization tool (Spotfire vs. TableLens) Effect = user performance time on task T Procedure: Vary cause Measure effect Problem: random variation Cause = vis tool OR random variation? random variation Real world Collected data uncertain conclusions

Stats to the Rescue Goal: Hypothesis: Null hypothesis: Stats: Hence: Measured effect unlikely to result by random variation Hypothesis: Cause = visualization tool (e.g. Spotfire ≠ TableLens) Null hypothesis: Visualization tool has no effect (e.g. Spotfire = TableLens) Hence: Cause = random variation Stats: If null hypothesis true, then measured effect occurs with probability < 5% (e.g. measured effect >> random variation) Hence: Null hypothesis unlikely to be true Hence, hypothesis likely to be true

Variables Independent Variables (what you vary), and treatments (the variable values): Visualization tool: Spotfire, TableLens, Excel Task type: Find, count, pattern, compare Data size (# of items): 100, 1000, 1000000 Dependent Variables (what you measure) User performance time Errors Subjective satisfaction (survey) HCI metrics

Example: 2 x 3 design n users per cell Task1 Task2 Task3 Spot-fire Ind Var 2: Task Type Task1 Task2 Task3 Spot-fire Table-Lens Ind Var 1: Vis. Tool Measured user performance times (dep var)

Groups “Between subjects” variable 1 group of users for each variable treatment Group 1: 20 users, Spotfire Group 2: 20 users, TableLens Total: 40 users, 20 per cell “With-in subjects” (repeated) variable All users perform all treatments Counter-balancing order effect Group 1: 20 users, Spotfire then TableLens Group 2: 20 users, TableLens then Spotfire Total: 40 users, 40 per cell

Issues Eliminate or measure extraneous factors Randomized Fairness Identical procedures, … Bias User privacy, data security IRB (internal review board)

Procedure For each user: * n users Sign legal forms Pre-Survey: demographics Instructions Do not reveal true purpose of experiment Training runs Actual runs Give task measure performance Post-Survey: subjective measures * n users

Data Measured dependent variables Spreadsheet: User Spotfire TableLens task 1 task 2 task 3 user1 12 s 45 104 13 51 138 user2 16 38 97 10 48 116 …

Step 1: Visualize it Dig out interesting facts Qualitative conclusions Guide stats Guide future experiments

Step 2: Stats Task1 Task2 Task3 Spot-fire 37.2 54.5 103.7 Table-Lens Ind Var 2: Task Type Task1 Task2 Task3 Spot-fire 37.2 54.5 103.7 Table-Lens 29.8 53.2 145.4 Ind Var 1: Vis. Tool Average user performance times (dep var)

TableLens better than Spotfire? Problem with Averages? Avg Perf time (secs) Spotfire TableLens

TableLens better than Spotfire? Problem with Averages? lossy Compares only 2 numbers What about the 40 data values? (Show me the data!) Avg Perf time (secs) Spotfire TableLens

The real picture Need stats that compare all data Avg Perf time (secs) What if all users were 1 sec faster on TableLens? What if only 1 user was 20 sec faster on TableLens? Avg Perf time (secs) Spotfire TableLens

Statistics t-test ANOVA: Analysis of Variance Result: Compares 1 dep var on 2 treatments of 1 ind var ANOVA: Analysis of Variance Compares 1 dep var on n treatments of m ind vars Result: p = probability that difference between treatments is random (null hypothesis) “statistical significance” level typical cut-off: p < 0.05 Hypothesis confidence = 1 - p

In Excel

p < 0.05 Woohoo! Found a “statistically significant” difference Averages determine which is ‘better’ Conclusion: Cause = visualization tool (e.g. Spotfire ≠ TableLens) Vis Tool has an effect on user performance for task T … “95% confident that TableLens better than Spotfire …” NOT “TableLens beats Spotfire 95% of time” 5% chance of being wrong! Be careful about generalizing

p > 0.05 Hence, no difference? Vis Tool has no effect on user performance for task T…? Spotfire = TableLens ?

p > 0.05 Hence, no difference? NOT! How? Vis Tool has no effect on user performance for task T…? Spotfire = TableLens ? NOT! Did not detect a difference, but could still be different Potential real effect did not overcome random variation Provides evidence for Spotfire = TableLens, but not proof Boring, basically found nothing How? Not enough users Need better tasks, data, …

Data Mountain Robertson, “Data Mountain” (Microsoft)

Data Mountain: Experiment Data Mountain vs. IE favorites 32 subjects Organize 100 pages, then retrieve based on cues Indep. Vars: UI: Data mountain (old, new), IE Cue: Title, Summary, Thumbnail, all 3 Dependent variables: User performance time Error rates: wrong pages, failed to find in 2 min Subjective ratings

Data Mountain: Results Spatial Memory! Limited scalability?