Last Update: 3 Oct 2018
This chapter is incomplete for the moment. Please also refer to the slides that we talk about in the course.
A definition of usability from ISO 9241-11:
[...] the extent to which a product can be used by specified users to achieve specific goals with effectiveness, efficiency and satisfaction in a specified context of use.
12.2 Designing a user study
12.2.1 Task and Trial
In a systematic user study you usually define one task or a number of tasks. Each completion of a task is called a trial. A session refers to the time span where a single participant does all trials in, usually between 15 minutes or one hour. The session includes briefing, training and debriefing.
For example, in a typical Fitts' Law study the user is presented with a dot on a screen and has to drag this dot (with his/her finger) to a highlighted target area as quickly as possible. This is the only task but there are many different trials with different positions and sizes of the target area.
Another example is evaluating a new web shop interface. Here, the user can have a series of (possibly consecutive) tasks, e.g. "Find the blue hand bag of brand X and put it into your shopping cart" or "Complete your purchase".
It is important to clearly define the end state of a task, e.g. to decide when to measure the time.
12.2.2 Planning a Session
A session should not take too long to avoid negative effects on motivation and concentration which can distort the resulting data. You can include breaks where the user can recover and where you can remind him/her to perform tasks as quickly/precisely etc. as possible.
A single session has the following steps, each of which you should carefully plan beforehand. Any deviation from your plan may make your experiment invalid.
- Briefing (instructions)
- Pre-session questionnaire [optional]
- Training [optional]
- Task execution (can have multiple rounds, includes breaks and intermediate questionnaires)
- Post-session questionnaire [optional]
- Debriefing (e.g. payment)
Welcome the user and make sure that the environment is prepared and without distractions.
Some points to consider for the instructions:
- Make sure that every participant has exactly the same instructions. Written instructions are usually a good idea.
- Try to be brief but precise.
- Make decisions about the priorities in task execution and include these in the instruction: should the user prioritze speed, precision or avoidance of errors?
Usually, it is important to ask the user for background information:
- left- or right-handed
- prior experience with relevant technologies
Part of this information must be reported in a publication (e.g. the gender proportion and the age range). Other information may be used in the analysis to remove participants (e.g. with too little or too much prior experience) and to perform a comparative analysis (female vs male, experts vs novices).
Depending on the complexity of your interface it can make sense to give the user some time to get used to the system before you start measuring performance. Obviously, this does not apply if your interface is quite common (web site testing) or if the whole point of your study is learnability.
12.2.3 Dependent vs. independent variables
Dependent variable: what you measure
- task completion time
- number of errors
- satisfaction rating
Independent variable: what you manipulate (what you compare)
- prototype A vs. prototype B
- gender: men vs. women
- level of expertise: novice vs. intermediate vs. expert
12.2.4 Types of data
- Nominal data
- unorderend categories (apple, banana, orange...)
- Ordinal data
- ordered categories (low, medium, high)
- Interval data
- distances meaningful (temperature, distance...)
12.2.5 Within-subjects vs. between-subjects
- subject X‘s performance on P1, P2
- subject X compares versions A, B, ...
- group X performs on P1, group Y on P2
- groups X, Y, ... rate versions A, B, ...
Problem: Order of task performance can impact results
- usually because of increasing experience
- can it be that task X is performed such because it occurred always at position 2 ?
- every task must be at every position the same number of times
- NB: more constraints may be necessary => pseudo-random orders
12.3 Objective Measures
Here are five frequently used metrics to measure the usability of a system:
- Task success (effectiveness)
12.3.1 Task success (effectiveness)
Can the task be achieved at all?
How long does it take?
How many errors occur? How safe is the interface?
How often per time unit does one succeed?
How easily/quickly do I learn to use the interface?
12.4 Subjective User Experience
It is hard to measure the user's subjective impression of a system: how intuitive or natural was the interface or how much fun did he or she have using the system?
What methods are there to elicit this kind of information in such a way that we can analyze the resulting data?
- Directly ask users about their experience with a system
- Reveals users‘ perception of the system
How to ask
- Open question, e.g. how was it?
- interesting but hard to analyze
- be specific: „what did you find confusing“ vs. „comment on the interface“
- Rating question e.g. on a scale of 1…5
- Likert scale
- Semantic differential scale
Rating with Lickert scale
- Present a statement (not a question) e.g. "The graphical interface was easy to understand."
- For each statement offer a 5-point scale of agreement that the user has to answer:
- strongly disagree
- neither agree or disagree
- strongly agree
Rating with a Semantic differential scale
- Present a question like "The graphical interface was"
- Offer a pair of opposing (bipolar) adjectives
- easy to understand ... hard to understand
- Insert a number of check boxes in-between (e.g. five steps)
There are a number of standardized tests. It is highly advisable to use these tests or to look at them and learn about good wording and answer methods:
- System usability scale (SUS)
- Computer system usability score (CSUQ)
- Questionnaire for User Interface Satisfaction (QUIS)
- Usefulness, Satisfaction and Ease of Use (USE)
12.4.2 Microsoft Product Reaction Cards
This is a method developed by Microsoft.
- Rater gets 118 cards with one adjective each (slow, fun, impressive, clear, useful...)
- some positive, some negative
- Rater picks cards that describe the system
- Then picks the top 5 cards and explains