Advanced Human-Computer Interaction

Advanced Human-Computer Interaction
Usability Evaluation: Analytical, Expert and Empirical Methods

Lecture Overview Features of usable systems
Standards/Guidelines Definition/motivation for evaluation Types of evaluation Formative/summative methods Analytic/Expert/Empirical methods Qualitative / Quantitative data What/where/how to evaluate? Data types and measurement tools

Criteria for Software Quality (ISO 9126): Evaluation of Software
Functionality Reliability Usability Efficiency Maintainability Portability Traditional systems design Now……

Usability - For Specified user and specified task
Learnability Ease of learning (What features make learning easier?) Skills retained over time (Can you remember next week?) Throughput Speed of user task performance (Does it do tasks at acceptable rate?) Low user error rate (Features to prevent errors? Easily corrected?) Flexibility Suitability for intended user expertise (Beginners? Experts?) User expertise levels Freedom of object / action selection System customization (Can it be customized?) Attitude User subjective satisfaction with system (Do you like it/like using it?)

Usability: Key concept in HCI
Previously looked at usability in general form 4 key factors Now ‘can we measure usability of a computer system?’ ‘can we prove that one design has better throughput than another?’ If so then we could reliably compare different designs or determine whether system was ‘good enough’ to sell Need to compare like-with-like to be valid!

ISO 9241 Usability Definition (1990)
How can we measure things like effectiveness, efficiency and satisfaction? Need to bear in mind: Target users (with specific characteristics) Goals and tasks needed, measure their performance Test in kind of environment to which SW will be used (isolated or working?) ‘The effectiveness, efficiency and satisfaction with which specified users can achieve specified goals in a particular environment.”

Functional and Usability Specifications
TRADITIONAL SYSTEMS: Functional specifications are central to ensuring system functionality i.e. statements of exactly what finished system can do and what functions it can perform Programmer then knows inputs, processing and outputs HCI APPROACH: Usability specifications are central to ensuring system usability Making systems easy to use and acceptable to users So need to define USABILITY SPECIFICATIONS in parallel with functional specifications

Purpose of Setting Usability Specifications
Allows designers to define acceptable standards to be achieved through SW e.g. it needs to be better than current version When standards achieved SW goes to market Establish when an interface is ‘good enough’ When to stop iteration

Common Usability Factors
Speed of operation Completion rate Error free rate Satisfaction rating Learnability Retainability Advanced feature usage Main software usability measures identified by IBM to evaluate SW (throughput measures)

Speed of Operation A measure of the delay between initiating an action and achieving the user’s goal. can be critical e.g. safety reasons things going wrong in nuclear power station users must be able to identify/rectify problem immediately Means of judging is to ask if task could be done more quickly by other means? e.g. with internal mail system: will be quicker than paper copy or word of mouth

Completion Rate Given a set of key (‘benchmark’) tasks, how many of them can the average user actually do, or complete, within a specified time. How long does each person take to complete each task?

Error-free Rate Observe users, count total no. actions taken in carrying out tasks, LOG…… % of actions that did not result in an error being made (% error-free) OR In using SW for specified time, what proportion of time was not spent dealing with errors?

Satisfaction Rating A subjective judgement by the user
Cannot be objectively measured Can be quasi-quantified using a rating scale Use questionnaire Very important for user acceptance, future user purchases.

Learnability Difficult to quantify
May take long time to judge complex SW as more difficult/longer to learn Could try to measure.. No. features user can learn just from interface? No. features resemble those similar interfaces with which user is familiar? No. sections of handbook/on-line help referred to? How often in learning period user had to ask for help? Etc.

Retainability = can a user remember/recognize how to perform certain functions some time after they last used interface/SW? By definition, requires time to test Could use benchmark tasks, performed at set intervals (weeks, months, etc.)

Advanced feature usage
Vendors sell updates add new features, fix bugs BUT 80% of users only use 20% of features available in package Sometimes ‘advanced’ feature is used in simple way, not intended by designers e.g. 3d spreadsheet feature used as ‘paper clip’

Time Dimension for Usability Factors (1)
When do you conduct usability studies? HCI designers make ‘informed’ guess about no. factors These are set out in usability spec For designer to start producing 1st prototype When 1st prototype is produced HCI designers starts to experiment with ‘real users’ So we have time dimension……..

Initial performance Can they use SW successfully within 1st 20 mins? Does it look as if they will be able/want to progress satisfactorily from here? Long-term performance Not always commercially realistic Judged after ‘finished’ SW is released (e.g. beta) Over time queries/gripes start coming in from users

First impression Very subjective judgement needed from user ‘Did you like it?’ sort of questions May need knowledge of computing experience of users who are asked such questions Long-term user satisfaction

Measuring Instrument Method for providing values for a particular usability factor in a consistent manner Ask each user in test same things Make sure each user performs same tasks Ideally to obtain quantitative data (numeric) for each usability factor e.g. timing tasks counting tasks done in specified time counting errors made etc.

Measuring Instrument: Objective/Subjective
Objective/subjective measures equally important Objective Observable user performance Often associated with a benchmark test - involves typical (normally simple and frequent) task a user will perform Subjective User opinion: individual comments difficult to analyze/compare Usually associated with questionnaire: rate factors on scale of e.g. 1 to 5

Usability Specification Process: Defining usability through metrics
Duration metrics Time takes to carry out specified tasks Count measures How many times something happens OR How many activities it takes to complete tasks E.g. errors made in given time Proportion completed % e.g. of no. documents user was able to process in given time Quality Difficult to quantify, may have to rely on rating scales

Usability Specification Process: Definition
Set and agree planned levels of metrics Team should agree on acceptable satisfactory level of performance in each usability measurement Analyze impact of alternative design solutions Designers might produce alternative interfaces (e.g. command line VS menu) Try to find which style users prefer/can work effectively with Incorporate user-derived feedback Iterate until planned levels are achieved Set and agree planned levels of metrics. That is, the team should try to agree on would acceptable as a satisfactory level of performance in each of the usability measurements. Analyse impact of alternative design solutions. The designers might produce alternative forms of interface (e.g. command-line v GUI, menu v buttons) and try to find out which style the users prefer, or can work most effectively with.

Sample Rows from a Usability Specification Table for DTP Package
Task Issue Value to be Current Worst Planned Best Measured Level Acceptable Target Possible Level Level Level Install- Installation Length of Many 90 mins 30 mins 20 mins ation task per time to can’t benchmark successfully install number 1 install Initial Set a tab Number of 3 errors 3 errors 2 errors 0 errors Performance errors Initial Delete a tab Length of 6 secs 6 secs 4 secs 2 sec Performance time on first trial First Questionnaire Average score ?? Impression (range 1-5)

Sample Rows from a Usability Specification Table for DTP Package
Task Issue Value to be Current Worst Planned Best Measured Level Acceptable Target Possible Level Level Level Install- Installation Length of Many 90 mins 30 mins 20 mins ation task per time to can’t benchmark successfully install number 1 install Performances that should be achievable by novice, average and expert Use spec to design test procedures: Observe users carrying out tasks, Time them See how results measure up to targets If they fail to meet targets: system may need refinement Determined by ‘informed guesswork’ Based on designers knowledge of existing systems/typical users

Current/Worst Acceptable Levels
Current Level Present level of value to be measured – not known at outset ‘Guess’ by looking at level achieved by users of competitive system/earlier versions Possible from manual system/prototype Worst Acceptable Level not worst possible but acceptable Related to use by novices, Pessimist designers view Should equal/better current level If observed value on attribute does not meet this level, system is formally unacceptable

Planned Target/Best Possible Levels
Planned Target Level ‘what you would like’ level Set higher than current level (if known) Match/exceed competitor’s product Attributes not yet at level focus development effort Best Possible Level Realistic state-of-the-art upper limit Designers would be satisfied with something less Indicates room for future improvement for future versions Assumes expert use/best design/best available

Advantages of Usability Specifications
Part of management/control mechanisms for iterative refinement process Defines quantitative end to (potentially endless) process Allows clear assessment of usability during iterative prototyping cycles Identifies data to be collected, avoids gathering unusable/irrelevant data Objectively identifies weaknesses for further design effort

Disadvantages of Usability Specifications
Measures of specific user actions in specific situations No reliable technique for setting usability specifications Much depends on judgement/knowledge of design team choosing benchmarks tasks/performance levels Different tasks/user groups need different usability specifications

Definition of Evaluation
Gathering info about usability/potential usability of a system ‘Evaluation is concerned with gathering data about the usability of a design or product by a specified group of users for a particular activity within a specified environment or work context’ [Preece et al, 1994] Essential element of design/development process If designers skimp on it they regret it later Someone eventually will evaluate their product i.e. users!

Motivation for Evaluation
suggest improvements to design confirm that SW meets all of functional/usability specifications confirm acceptability of interface and/or supporting materials compare alternative designs to determine ‘best’ ensure that it meets expectations of customers don’t damage reputation match/exceed usability of competitor’s products/earlier version ensure that it complies with any statutory requirements (e.g. EU)

MUSiC Project - Metrics for Usability Standards in Computing (1)
Early 1990’s - European survey: MUSiC Showed evaluation was not always as thorough as it might be Surveyed 80 European companies/institutions to determine how well usability evaluation was adopted Findings: Most companies had high appreciation of importance of usability evaluation Limited knowledge of evaluation methods

MUSiC Project - Metrics for Usability Standards in Computing (2)
Major problems: lack of… metrics e.g. no standards or benchmarks against which usability might be measured money (and time) e.g. evaluation might have been given a low priority when budgeting for projects tagged on at end Commercial pressure to go to market Most aware that users wanted SW that was intuitive/easy to learn, without need for manuals/ time-consuming training

Reasons why Interface Evaluation is Often Omitted or Poorly Performed
Designers assume own personal behaviour is ‘representative’ of average user If they can easily use it they assume everyone can Unsupported assumptions about human performance Acceptance of traditional/standard interface design assume style guides ensure good software However style for one product may not suit another product Postponement of evaluation until ‘more convenient time’ Poor knowledge of evaluation techniques Lack of expertise in analysing experiments

What to Evaluate: Usability (some measured, some subjective)
Test against usability specifications at all lifecycle stages refine until specifications met/exceeded Compare/evaluate initial designs pre-implementation before 1st prototype Evaluate prototypes to compare alternatives, test efficiency of design at various stages Test final(?) implementation of SW system Evaluate documentation

Formative Evaluation: help in ‘forming’ and reforming product during iterative development of prototype Carried out during development period Integral part of development process Purpose: to support iterative refinement Nature: structured, but fairly informal Average of 3 major ‘design-test-redesign’ cycles, with many minor cycles to check minor changes The earlier poor design features or errors are detected, the easier and cheaper they are to correct

Summative Evaluation: evaluate ‘summation’ of all development effort
Done once, after implementation Purpose: quality control – review product to check it meets Own functional & usability specifications (is robust/reliable) Prescribed standards, e.g. Health and Safety, ISO Nature: formal, often involving statistical analysis Can be costly & time-consuming Alternative to setting up formal experimental environment is field/‘beta’ testing Release ‘beta’ and wait for real users to give feedback

Where to Evaluate Designer’s mind: Designers continually ask questions about SW, making checklists Discussion workshops: Teams of 5-12 people e.g. designers, programmers, marketing managers, users Discuss product at early stages/through development to assess progress/make improvement suggestions Representative workplace: e.g. company developing office application might try it out with own admin staff Experimental laboratory: typical test subjects ‘hired’ to try out SW in specially set up room where they’re observed/tested formally Last two expensive/costly to organise, so not always possible

How to Evaluate: General Evaluation Methods
Method Interface User Involvement development Analytic Specification No users Expert Specification or No users prototype Role playing only Observational Simulation or Real users prototype Survey Simulation or Real users Experimental Normally full Real users costs Empirical

How to Evaluate: Analytical and Expert Evaluation Methods
quick, cheap 1 person at initial stages to devise specification for product Usually done by designers making assumptions about user behaviour Expert: examining specification playing role of user

How to Evaluate: Empirical Methods
Usually involve ‘real’ users or expensive ‘expert’ Difficult to prepare/organise Expensive to implement Observational: watching user with simulation/prototype Survey: questionnaires with real users Experimental: real users working with full prototype

Types of Data: Evaluation can result in collection of quantitative or qualitative data
Quantitative data: ‘Objective’ measures of certain factors by direct observation E.g. time to complete certain tasks, accuracy of recall, number of errors made User performances or attitudes can be recorded in a numerical form (rating scale) Qualitative data: ‘Subjective’ responses – opinions/attitudes rather than measurements Reports and opinions that may be categorized in some way but not reduced to numerical values More difficult to analyse – how user goes about tasks

Measurement Tools: Ways of collecting/gathering data Mostly used for empirical evaluation (1+ user)
Semi-structured interview Questionnaire - personal/postal administration Incident diary Feature checklist Focus group Think-aloud Interactive experiment Compare on: Cost, Number of subjects

Analytic Evaluation Advantages Disadvantages
Usable early in design Little/no advance planning Cheap/Quick Doesn’t involve collecting data from subjects Can be done by designer trying to guess behaviour Based on designers knowledge/experience of how users perform Focuses on current state of interface, tries to predict behaviour Evaluator will not consider possible alternatives So encourages strengthening of existing solution Broad assumptions of users’ cognition, Can be difficult for evaluator

Analytic Evaluation: Heuristic Evaluation
‘heuristic’: ‘rule-of-thumb’; general rule; design principle might be derived from general HCI guidelines used in system design e.g. prevent errors, provide feedback Number of reviewers go through product, screen by screen, and evaluate against heuristics for problems Can be economical/cheap in discovering major problems Studies show: 5 reviewers can find about 75% of the problems that are found by 15 reviewers (Nielsen) Studies show: Best done by several experts! Studies show: Most widely adopted usability evaluation

Heuristic Evaluation Heuristic = search strategy
Search for potential usability problems in design Can’t tell what's right / wrong Make interpretations / judgments Checklist to remind you of the need to look for something Various sets of heuristics by advocates of user-centred design … Norman’s 7 principles Schneiderman’s 8 golden rules of dialogue design Neilsen’s 10 heuristics Tognazzini’s interaction design principles

Nielsen’s 10 Heuristics Visibility of system status
Match between systems / real world User control / freedom Consistency / standards Error prevention Helping users recognise, diagnose, recover from errors Recognition rather than recall Flexibility / efficiency of use Aesthetic / minimalist design Help / documentation

Analytic Evaluation: Keystroke Level Model
Best known analytic evaluation technique Carried out by system developer Simple way of estimating how long it would take expert to carry out specified task (with no errors) Proven to be quite accurate (usually within 20%) Basis: Any task to be performed can be broken down into smaller operations carried out in sequence Add together time taken for each of smaller to estimate whole task time

KLM Constants: Get rough idea of how long given task should take Provides target for designers to aim for Averages - modify to suit Operator Meaning Time(secs) K Press key (good) (poor) B Mouse button press Down or up Click P Point with mouse (Fitt’s law: K log2(Distance/Size + 0.5) H Hand to keyboard or mouse 0.40 M Mental preparation for physical action - 1 M per ‘chunk’ R System response time Measure NB: Not ‘accurate’ figure as users vary considerably in e.g. typing speeds

GOMS: Goals, Operators, Methods, Selection
Design evaluation model which predicts user performance Used to filter particular design options Break tasks into components and predict performance times Result = prediction of time to perform task optimally Assume: Humans act rationally to achieve goals Know which actions to perform Peformance is error-free with no allowance for problem-solving behaviour (users are experts who never make mistakes!)

GOMS Goals = objectives
e.g. locate spelling mistake Operators = actions to change system/user’s cognitive state e.g. use menu to locate spell checker Methods = descriptions of procedures for achieving goals stored in memory (part of knowledge from learning) e.g. knowledge of ways to locate spelling mistakes in particular word processor Selection rules = IF..THEN to allow user to choose between methods e.g. IF remember shortcut key THEN use it ELSE use menu

Analytic Evaluation: Cognitive Walkthrough: Carried out by designer/expert user simulating user
‘Expert’ simulates user actions/reactions Suits systems primarily learned by exploration e.g. walk-up-and-use such as ATM, ticket machine User has to learn how to operate it by ‘exploring’ interface Designer must make sure that it is obvious which actions should be taken at each stage Give sufficient/clear feedback to show correct actions taken Overall question - How successfully does this design guide the unfamiliar user through the performance of the task?

Analytic Evaluation: Cognitive Walkthrough: Key Questions
Select task and break it down into subtasks ‘Walk-through’ these stages in sequence, asking e.g. 3 key questions for each: Will it be obvious to user what action to take next? Will user correctly interpret description given to correct action, and connect it with function they’re trying to achieve? Will system give feedback so user will interpret system’s responses correctly - will user know if he/she has made right or wrong choice? Record results using checklist ‘No’ answers indicate need for improvement Predict user behaviour and design problems

Expert Evaluation: Call in expert to evaluate when prototype reaches certain stage Advantages Disadvantages Very useful for picking up major design flaws Gives overview of whole interface Few resources needed (except for experts) Cheap (providing experts fees aren’t too expensive) Relies on role playing (restricting) Dependent on experts previous experience Subject to bias (expert has different ideas to developer) Problems locating experts Cannot capture real user behaviour NB: ‘Expert’ is usually someone working in Human Factors field or GUI development

Observational Evaluation:
Someone observes small group (3-6) of users as they work through specified benchmark tasks Advantages Disadvantages Quickly highlights difficulties Verbal protocols are valuable source of info Can be used for rapid iterative development at any stage Qualitative data Observation can affect user activity/ performance levels. Test subjects may not then be acting like ‘normal’ users Analysis of data can be time/resource consuming Dependent on users being ‘truthful’ and not holding back for fear of embarrassment

Observational Evaluation cont.
Mostly collect qualitative data Can be lab-based or carried out in field of work Collect data: Users audible actions/comments are recorded by observer using Think-aloud protocols or Paper field notes Video/Audio recording Software logging / scan converters

User as collaborator, not just subject User begins by thinking aloud
Observational Evaluation cont. Cooperative Evaluation: Viewing user/evaluator as collaborators: equal User as collaborator, not just subject User begins by thinking aloud can be recorded for later analysis Evaluator can answer and ask questions Evaluation session produces a protocol Transcription and analysis can be time-consuming but rewarding

Observational Evaluation cont.: Usability testing/labs
Use of observational evaluation in lab setting Test ‘usability’ and also… Legislative requirements Health and safety Uncovers most serious and recurring problems Useful for uncovering interpretation problems and execution errors Provides baseline against which other evaluations themselves can be evaluated Depends on skilled testers and suitable metrics Lacks ability to fully take account of work context

Observational Evaluation cont.: Field observation: Contextual enquiry
Use of observational evaluation in natural work setting Use discussions / interviews to reveal interpretations of actions that users place on their behaviour Like cooperative evaluation but in work setting not lab Uses video/verbal protocol not metrics

Survey Evaluation: Use data collection methods of interviews and questionnaires
Interview: structured, flexible, prompted Specific interviewer characteristic e.g. Bias Questionnaire: open or closed? Using rating scale for quantitative aspect to responses? Who will you get the data from? Test subjects need to have sufficient experience with system to be able to operate competently and answer questions about it

Survey Evaluation Advantages Disadvantages
Addresses users’ opinions and understanding of interface. Can be made to be diagnostic Can be applied to users and designers Questions can be tailored to the individual Rating scales lead to quantitative results Can be used on large group of users User experience is important Low questionnaire response rates (especially by post) Possible interviewer bias Possible response bias Analysis can be complicated and lengthy Interviews very time-consuming

Experimental Evaluation: Carry out specified tasks in controlled (scientific) conditions
May use mixture of methods E.g. questionnaires/interviews to establish users previous experience and after ‘experiment’ to find out their subjective judgments on SW Can test systems performance in isolation OR in relation to existing system May use techniques such as Observation e.g. timing of performance Talk-aloud / Data-logging Incurs costs of… Setting up controlled conditions Finding/paying subjects Running experiment, Analysing results

Experimental Evaluation:
Advantages Disadvantages Powerful method (depending on the effects investigated) Quantitative data for statistical analysis Can compare different groups of users in same experimental conditions Reliable and valid results Replicable number of times High cost/resource demands Requires user to have knowledge of experimental method Time spent on experiments can mean evaluation is difficult to integrate into design cycle Tasks can be artificial and restricted Cannot always generalise to full system in typical working situation

Guidelines for Experimental Design
Decide measures of interest and hypotheses Develop a set of representative (benchmark) tasks Run a pilot study Determine experimental design Select sample(s) of typical subjects (size?) Need large enough sample to make results statistically significant Fine tune experimental design to eliminate unwanted variables and effects Explain to subjects and run experiments(s) Collect objective and subjective data Compute statistics and analyse The measures of interest might be derived from the usability specification table (as discussed previously). Representative tasks are what we previously called ‘bench mark’ tasks. The purpose here is to ensure all subjects are doing the same thing so reliable comparisons can be made between individuals or groups. You might have to repeat the experiment a number of times e.g. using different groups of similar users, or groups from different types of users. With any data collection system (experiment, survey, etc) you must have a large enough sample to make any results statistically significant. Again, as with anything termed an ‘experiment’ you must be able control some of the variables in order to try to pinpoint the causes of any effects you observe. For example, in our case, you would carefully select the user subjects so as to try to eliminate any great variation in their experience or expertise before the experiment started. This might be done with a questionnaire, or some other form of pre-test. All subjects should be given precisely the same instructions, and be set the same tasks, in order to control these variables.

Validity in Data Collection
Whether or not evaluation method is measuring what is required given specified purpose Is method valid for particular purpose? E.g. If trying to find out user’s attitudes to interface then expert review opinions are not valid – need real users Check validity of measurements. E.g. Performance times in KLM are only valid for expert error-free user performance

Choice of Method Evaluation has 2 main purposes:
Formative: done throughout development cycle to test ideas, find problems, find improvements Summative: when system is more-or-less complete check it is ready for market Different methods most suited at different stages – Early design - paper-based only - analytic/expert Prototype development - observational/ experimental Late development - survey Mix of objective/subjective measures is desirable Remember, we earlier said that evaluation can have two main purposes : Formative - done throughout the development cycle in order to test ideas, find problems, with a view to improving the design. Summative - when system more-or-less complete - th check that it is at last ready for ‘the market’. Example of late survey method might be to include a questionnaire, or an contact address, along with a ‘beta version’ of the software and then invite comment, over a period of months, from people using the Beta version. This sort of method is often used with software such as Web browsers and other programs that can be downloaded from the Internet.

RECAP: General Evaluation Methods

Lecture Review Features of usable systems
Standards/Guidelines Definition/motivation for evaluation Types of evaluation Formative/summative methods Analytic/Expert/Empirical methods Qualitative / Quantitative data What/where/how to evaluate? Data types and measurement tools

Advanced Human-Computer Interaction

Similar presentations

Presentation on theme: "Advanced Human-Computer Interaction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Human-Computer Interaction

Similar presentations

Presentation on theme: "Advanced Human-Computer Interaction"— Presentation transcript:

Similar presentations

About project

Feedback