For more detail on basic
statistical concepts, see N&L Sections 10.1-10.7. In particular you
should be very comfortable with these terms: i.e. be able to explain
what they mean in your own words, produce an example when given an
experiment scenario, or compute the statistic.
- population vs sample
- sample mean
- confidence level of a statistic
- external and internal validity of an
experiment task
- independent and dependent experiment
variables
- nuisance variables
- normal distributions: what it is, how
you get one and how you tell if you have one
- variance and standard deviation, as
describing populations and samples (know how to calculate)
- null and experiment hypotheses
- degrees of freedom of a statistic
You should also be familiar with the
t-test, covered in the remainder of N&L Chapter 10 and below.
two basic kinds of
experiments |
|
In this class, you'll very likely be doing
one of two kinds of statistical tests:
1. Comparing two means: for
example, the effect on a sample of Design A vs Design B
2. Comparing a single mean with a
reference value: for example, the performance of your design
relative to a design requirement
The two experiments are conducted in a
similar way, with one main difference: when you're comparing a single
mean to a reference value, you don't have an independent variable. That
is, you're not varying anything. When comparing two means, the two
means are obtained as a result of setting your independent variable to
two levels.
A different t-statistic is used to evaluate
the two cases (described below).
issues in experiment design
|
|
Randomization
To justify your assumption of normality
(and permit the application of the t-test), you need to collect a random
sample of your population. (The population you are sampling must
also be normally distributed, of course). There are several things you
can do to ensure that your sample is representative of the population
at large. The exact list depends on the specifics of your experiment,
but this should give you the idea in the context of your experiments:
- Selection of subjects: they
should be a representative sample of the population of interest at
large, in all parameters that might matter for your particular
experiment (e.g. age, educational background, handedness, physical
abilities, skill with the interface, visual acuity, ...)
- Order of application of "treatments"
to your subjects: e.g., to avoid learning effects don't always
administer Design A and Design B in the same order.
- Details of your design which might
skew results: e.g. ordering of commands in a menu list (early items
are likely to be encountered sooner)
Individual
Differences
Often, the differences in performance
between individual subjects may be substantial relative to the
difference due to the experiment treatment (e.g. which design). This
acts as a "nuisance variable", since if not accounted for, it will
contribute to the "noise" in the data and mask a real effect.
To address this common problem, the
experiment may be designed to use "paired comparisons", where the
treatments (let's say two, for a comparison-of-two-means test) are
assigned in pairs to each subject. Thus, each subject is tested on both
levels of the independent variable. In the analysis phase, you can in
effect treat each subject as his own control: you can look at his relative
performance across the treatments, and compare this relative
performance with the other subjects. In cases where between-subject
variability is great (as it often is), this is a valuable technique to
increase the experiment's power: a smaller difference in raw treatment
means results in a significant result. Details of this method are given
below.
basic t-test comparing two
means |
|
The t-test is a simple way of determining
the probability as to whether two samples (i.e. sets of measurements)
are part of the same population, or part of two different populations.
Each sample might be measurements taken from one of two levels of an
independent variable - for example, "which design": Design A, or Design
B. The general question to be answered is: does using the two designs
result in a difference in the measured performance parameter (i.e.
dependent variable)? If so, then participants using Design A
effectively represent a different population than participants using
Design B. If not, then statistically, those two sets of measurements
will look as if they all came from the same population. You can
use the t-test to see which of these cases is probably true, to some
level of confidence.
Remember - the t-test is only valid for
normally distributed populations. It is important to (a) randomize
everything possible when collecting your data, and (b) plot a histogram
of your collected data to make sure it is distributed at least roughly
in a sigmoid-shaped curve. If it is not, or if you have violated
randomization principles in data collection, then you cannot assume
that the t-statistic really means anything.
The steps to computing for the t statistic
in this scenario, shown here, is also given in N&L 10.7 (with more
readable equations) as well as in the course notes.
1. Compute a combined variance for
the two samples: |
|
s^2 = (SS1 + SS1)/(N1 + N2 -2) |
|
|
2. Compute "standard error of
difference": |
|
sed = sqrt(s^2(1/N1 + 1/N2)) |
|
|
3. Compute the t statistic itself: |
|
t = (Xmean1 - Xmean2) / sed |
|
|
4. Compute the t statistic's degrees
of freedom |
|
df = N1+N2-2 |
|
|
5. Decide on the significance value
you require for the result. A common value is p=.01 or .05 (e.g. 1%
chance that you are incorrect in rejecting null hypothesis). |
|
|
6. Compare the t statistic to a table
of computed t-values for a two-sided t distribution, given df
and p from steps 4 and 5. You can find tables in the back of a stats
textbook, or in many places online; for example |
|
|
7. If your computed value of t
is higher than the value of t in the table for your df and,
say, p=.05, then you can say that the difference between the two means
is statistically valid - with a probability of p<.05. It is common
in fact to locate the smallest value of p for which your computed t
is significant, and state that as the result's significance. |
This formulation of the
t-statistic is designed to compare the difference between two sample
means. You are here using it to determine whether a difference
between 2 sample means is statistically significant, REGARDLESS OF
DIRECTION. I.e., either Xmean1 or Xmean2 might be larger - you just
want to know if statistically they are different. You can
easily enough determine which is larger by simply looking at them.
Essentially, statistical significance means it's unlikely that these
two means could have come from the same population distribution - the
two populations which these samples represent must be different from
each other.
single-tailed and
two-tailed t-test |
|
One aspect of the distinction between
single-tailed and two-tailed t-tests is covered briefly in Section 10.7
and 10.8. The two-tailed t-test is the "standard" one. The critical
value of t for a single-tailed hypothesis is half that for a
double-tailed hypothesis, and thus this is an easier test to pass.
There are some cases where it is argued that this easier single-tailed
t-test is justified - most notably, where a significant difference in
means will be interesting only if in one direction, e.g. Xmean1 >
Xmean2, but not vice versa. This argument is somewhat controversial,
and we won't be using the single-tailed t-test for this purpose in this
class. If you're not sure, use the two-tailed test, which is
conservative.
There is, however, another use of the
single-tailed t-test which is of particular relevance to us, which is
when you want to use a t-test to compare one sample mean to a reference
value - e.g. a design requirement. N&L 10.8 describes one way of
doing this, based upon a single-tailed t-test - method described in
next section. The reason it's okay to do this is that rather than
looking to see if one sample mean is far enough away from another
sample mean (in either direction), here you're seeing if one sample
mean is far enough away from a constant value (in one direction). The
null hypothesis states that there's no significant difference between
your sample mean and the reference value; this you are trying to
reject. Your experimental hypothesis states that your sample mean is,
e.g., less than the reference value - it does not allow greater
than. The probability (i.e. 5% of the area under the t-distribution
curve, if you are using p=.05) can be bunched all one one side of the
distribution, rather than divided half on each side as it is for a
two-tailed test.
t-test for comparing a mean to a
reference value |
|
Many 444 project experiments
involve comparing a single sample mean to a performance requirement.
N&L 10.8 describes one way of doing this, which is reiterated and
expanded upon here. For purpose of discussion, let's say that your
requirement is that performance must be <= level R (i.e. R is a
maximum permissible value, not a minimum). Conceptually, you need to
make sure that your no part of the confidence interval lies above the
reference value.
(Note: in this case, you don't have two
levels of an independent variable; you aren't actually varying
anything, but comparing one design, for example, to a fixed value).
First of all, there is no point in doing
the test at all if the sample mean is larger than R (Xmean > R).
Clearly, we will not find that Xmean is significantly less than R in
this case. If Xmean <=R, then let's proceed.
Next, what's your confidence interval? It's
a range of values around Xmean, whose size is determined by your
desired significance p (also commonly called alpha)- e.g. 0.05
or 0.01; or, as usual, you can run the process backwards and determine
the size which the confidence interval would be if the constraint were
that it lie entirely under R. If you require a small p, then
your confidence interval will be larger - the test will be harder to
satisfy. So, choose p and proceed.
Here are the computational steps:
1. Compute the variance for your
single sample population: |
|
s^2 = SS / N-1 |
|
|
2. Compute "standard error of the
mean": |
|
sem = sqrt(s^2/N) |
|
|
3. Compute the t statistic's degrees
of freedom |
|
df = N-1 |
|
|
4. Decide on the significance value
you require for the result. A common value is p=.01 or .05 (e.g. 1%
chance that you are incorrect in rejecting null hypothesis). |
|
|
5. Find the critical value of the t
statistic, using the single-tailed t distribution, based on df
and p |
|
|
|
|
6. Compute the top and bottom of your
confidence interval |
|
Xmin = Xmean - (t,p,df x sem) |
|
Xmax = Xmean + (t,p,df x sem) |
|
|
6. Compare Xmax (or Xmin) with the
reference value as defined in the experiment and null hypothesis. In
our example, if Xmax|p <= R, then our sample mean is below R, at
significance p. |
t-test for a paired comparison |
|
The paired comparison t-test is used when
you suspect that individual differences between subjects might be
masking an otherwise significant effect. For example, Subject 1
performs well in all cases, and Subject 2 performs poorly in all cases;
both of them do better with Design A than with Design B, but the
standard deviation of their combined performances is so large that this
individual trend does not show up strongly enough.
The solution is to conduct the t-test on
the difference between the individual's performances on
the two treatments, rather than on all performances lumped together. In
effect, the subject's mean performance is subtracted from his
individual scores before they are "thrown into the pot". Details are as
follows:
1. Compute the difference between
sample means for each individual: |
|
Di = Xai - Xbi |
|
where Xai and Xbi are the measured
responses for treatments A and B on subject i |
|
2. Compute the mean difference
of the individual differences for all subjects: |
|
Dmean = sumation(D)i / N |
|
where N= number of differences =
number of subjects |
|
3. Compute the sum of squares on the
differences: |
|
SSd = summation[ (Di-Dmean)^2] |
|
= summation[(Di)^2] -
[summation(Di)]^2/N |
|
4. Compute the standard deviation of
differences: |
|
sd = sqrt[SSd / (N-1)] |
|
|
5. Compute "standard error of
difference": |
|
sed = sd/sqrt(N) |
|
|
6. Compute the t statistic itself: |
|
t = Dmean / sed |
|
|
7. Compute the t statistic's degrees
of freedom |
|
df = N-1 |
|
|
8. Decide on the significance value
you require for the result. A common value is p=.01 or .05 (e.g. 1%
chance that you are incorrect in rejecting null hypothesis). |
|
|
9. Compare the t statistic to a table
of computed t-values for a two-sided t distribution, given df
and p from steps 4 and 5. You can find tables in the back of a stats
textbook, or in many places online; for example |
|
|
10. If your computed value of t
is higher than the value of t in the table for your df and,
say, p=.05, then you can say that the difference between the two means
is statistically valid - with a probability of p<.05. It is common
in fact to locate the smallest value of p for which your computed t
is significant, and state that as the result's significance. |
The lecture on controlled experiments
contains discussion on sampling from a normal distribution and
hypothesis testing. This TCL script, random.tcl,
shows an example of randomly sampling from two distributions and
performing a t-test on it. You can play around with it to see the
influence of variance and number of samples. It should run in a wish shell.
The final lecture on controlled experiments
ends with an example of an experiment design to compare difference in
performance for "natural" and "abstract" icon types. A set of sample
data (ficticious) and some sample calculations can be found in this excel spreadsheet. Here, we'll step
through these calculations. In particular, the spreadsheet illustrates a
way to account for individual differences in your subjects.
The first worksheet, "data", shows the
sample data. The second worksheet, "1st pass" shows one way of
computing the analysis which is in fact not a very good way because it
doesn't account for individual differences between subjects. The third
worksheet, "2nd pass", shows a better although more lengthy way of
doing the analysis. The fourth worksheet, "requirements" shows a
calculation to see if we meet a specific performance criterium. The
fifth worksheet show a Chi-squared calculation to see if one of the
methods is preferred.
Raw Data
There are two icon designs (natural and
abstract) which represent the two levels of the independent variable.
Performance time for each level for each subject is listed; smaller is
better. As well, the user preference for Natural or Abstract are shown.
This worksheet also shows plots of the data
in different ways, which help you decide if it is indeed normally
distributed and whether individual difference may be important. "Sorted
Data" shows the two columns of data plotted in a
line after having been individual sorted into ascending order. If
normally distributed, each of the two lines should rise steeply at
first, flatten somewhat then rise steeply again at the end. Outliers
are clearly evident in such a plot.
The second plot, Histogram, shows the same
data after they have been sorted into appropriately sized bins. There
need to be enough bins to demonstrate a normal distribution (at least
5); and each bin should have some values in it. It is difficult to do a
histogram for very small samples (10 and less).
In this case, the data seem fairly well
distributed, so we will proceed to do a t-test.
The third plot, Individual Means, shows the
same data but each individual's scores have been plotted in the two
different conditions. There is not an obvious trend, though it does
look like most of the curves are upwards, suggesting, that for a given
individual, they do better with natural icons than abstract ones.
Perhaps, we need to take account of the individual differences...
1st Pass:
neglects individual differences
In the simplest analysis, the t-statistic
for testing for a difference between two means is computed in a
straightforward way, exactly as laid out above. The data from the 1st
worksheet is reproduced in the shaded columns. The basic statistics
(mean, sum of squares, variance and standard deviation) are computed
according to the formulas given in the N&L handout for each column.
Then, the t statistic is computed based on
the combined variance and standard error of difference of the two
samples. A value of "t" of 0.9525 is found, with 38 degrees of freedom
(40 samples on 20 subjects).
Assuming a desired significance level of
0.05, we consult a two-tailed distribution found in one of the online
tables, for example, the MedCalc one.
The critical value of t for .05 and 38 df is 2.024. Thus, the
difference in effects appears to not be significant.
2nd Pass:
accounts for individual differences
Let's try again, considering that perhaps
large individual differences are inflating the sample's standard
deviation. The 2ndpass worksheet illustrates an analysis in accordance
with the Paired Comparisons approach. Now we compute the difference in
performance for each subject, and analyze this number. The
t-test is different because there is only one set of values, rather
than two. We now find a t-statistic of 4.8425, which exceeds the
critical value of 2.0930 (higher than before, since now we have just 19
rather than 38 degrees of freedom). In this analysis, the
difference due to treatment is indeed significant at at least p=.05; in
fact, looking at the table, it's significant at p>.001.
Analyzing subjects separately made a big
difference!
The Requirements sheet shows whether each
type will meet the overall requirement (750msec) selection time.
The Preference sheet shows whether one
method is preferred over the other. Notice that since we don't have so
many samples, the preference difference must be pretty large before it
becomes significant.
|