Last Update: 3 Oct 2018

This chapter is incomplete for the moment. Please also refer to the slides that we talk about in the course.

9.1 Definition

A definition of usability from ISO 9241-11:

[...] the extent to which a product can be used by specified users to achieve specific goals with effectiveness, efficiency and satisfaction in a specified context of use.

9.2 Designing a user study

9.2.1 Task and Trial

In a systematic user study you usually define one task or a number of tasks. Each completion of a task is called a trial. A session refers to the time span where a single participant does all trials in, usually between 15 minutes or one hour. The session includes briefing, training and debriefing.

For example, in a typical Fitts' Law study the user is presented with a dot on a screen and has to drag this dot (with his/her finger) to a highlighted target area as quickly as possible. This is the only task but there are many different trials with different positions and sizes of the target area.

Another example is evaluating a new web shop interface. Here, the user can have a series of (possibly consecutive) tasks, e.g. "Find the blue hand bag of brand X and put it into your shopping cart" or "Complete your purchase".

It is important to clearly define the end state of a task, e.g. to decide when to measure the time.

9.2.2 Planning a Session

A session should not take too long to avoid negative effects on motivation and concentration which can distort the resulting data. You can include breaks where the user can recover and where you can remind him/her to perform tasks as quickly/precisely etc. as possible.

A single session has the following steps, each of which you should carefully plan beforehand. Any deviation from your plan may make your experiment invalid.

Arrival
Briefing (instructions)
Pre-session questionnaire [optional]
Training [optional]
Task execution (can have multiple rounds, includes breaks and intermediate questionnaires)
Post-session questionnaire [optional]
Debriefing (e.g. payment)

Arrival

Welcome the user and make sure that the environment is prepared and without distractions.

Briefing/Instructions

Some points to consider for the instructions:

Make sure that every participant has exactly the same instructions. Written instructions are usually a good idea.
Try to be brief but precise.
Make decisions about the priorities in task execution and include these in the instruction: should the user prioritze speed, precision or avoidance of errors?

Pre-session questionnaire

Usually, it is important to ask the user for background information:

age
gender
left- or right-handed
prior experience with relevant technologies
education/profession

Part of this information must be reported in a publication (e.g. the gender proportion and the age range). Other information may be used in the analysis to remove participants (e.g. with too little or too much prior experience) and to perform a comparative analysis (female vs male, experts vs novices).

Training

Depending on the complexity of your interface it can make sense to give the user some time to get used to the system before you start measuring performance. Obviously, this does not apply if your interface is quite common (web site testing) or if the whole point of your study is learnability.

9.2.3 Dependent vs. independent variables

Dependent variable: what you measure

task completion time
number of errors
satisfaction rating

Independent variable: what you manipulate (what you compare)

prototype A vs. prototype B
gender: men vs. women
level of expertise: novice vs. intermediate vs. expert

9.2.4 Types of data

Nominal data
- unorderend categories (apple, banana, orange...)
Ordinal data
- ordered categories (low, medium, high)
Interval data
- distances meaningful (temperature, distance...)

9.2.5 Within-subjects vs. between-subjects

within-subject study:

subject X‘s performance on P1, P2
subject X compares versions A, B, ...

between-subject study

group X performs on P1, group Y on P2
groups X, Y, ... rate versions A, B, ...

9.2.6 Counterbalancing

Problem: Order of task performance can impact results

usually because of increasing experience
can it be that task X is performed such because it occurred always at position 2 ?

Solution: counterbalance

every task must be at every position the same number of times
NB: more constraints may be necessary => pseudo-random orders

9.3 Objective Measures

Here are five frequently used metrics to measure the usability of a system:

Task success (effectiveness)
Time-on-task
Errors
Efficiency
Learnability

9.3.1 Task success (effectiveness)

Can the task be achieved at all?

9.3.2 Time-on-task

How long does it take?

9.3.3 Errors

How many errors occur? How safe is the interface?

9.3.4 Efficiency

How often per time unit does one succeed?

9.3.5 Learnability

How easily/quickly do I learn to use the interface?

9.4 Subjective User Experience

It is hard to measure the user's subjective impression of a system: how intuitive or natural was the interface or how much fun did he or she have using the system?

What methods are there to elicit this kind of information in such a way that we can analyze the resulting data?

9.4.1 Questionnaire

Directly ask users about their experience with a system
Reveals users‘ perception of the system

How to ask

Open question, e.g. how was it?
- interesting but hard to analyze
- be specific: „what did you find confusing“ vs. „comment on the interface“
Rating question e.g. on a scale of 1…5
- Likert scale
- Semantic differential scale

Rating with Lickert scale

Present a statement (not a question) e.g. "The graphical interface was easy to understand."
For each statement offer a 5-point scale of agreement that the user has to answer:
- strongly disagree 
- disagree 
- neither agree or disagree 
- agree 
- strongly agree

Rating with a Semantic differential scale

Present a question like "The graphical interface was"
Offer a pair of opposing (bipolar) adjectives     , e.g.
- easy to understand ... hard to understand
Insert a number of check boxes in-between (e.g. five steps)

Standardized questionnaires

There are a number of standardized tests. It is highly advisable to use these tests or to look at them and learn about good wording and answer methods:

System usability scale (SUS)
Computer system usability score (CSUQ)
Questionnaire for User Interface Satisfaction (QUIS)
Usefulness, Satisfaction and Ease of Use (USE)

9.4.2 Microsoft  Product Reaction Cards

This is a method developed by Microsoft.

Rater gets 118 cards with one adjective each (slow, fun, impressive, clear, useful...)
- some positive, some negative
Rater picks cards that describe the system
Then picks the top 5 cards and explains