supp-experimentAnalysis

NOTES ON EXPERIMENT ANALYSIS

what you should know
two kinds of experiments: comparing two means, OR one mean to a reference value
issues in experimental design
basic t-test: comparing two means
one-tailed and two-tailed t-tests
t-test for comparing a mean to a reference value
t-test for a paired comparison (individual differences)
example calculations

what you should know

For more detail on basic statistical concepts, see N&L Sections 10.1-10.7. In particular you should be very comfortable with these terms: i.e. be able to explain what they mean in your own words, produce an example when given an experiment scenario, or compute the statistic.

population vs sample
sample mean
confidence level of a statistic
external and internal validity of an experiment task
independent and dependent experiment variables
nuisance variables
normal distributions: what it is, how you get one and how you tell if you have one
variance and standard deviation, as describing populations and samples (know how to calculate)
null and experiment hypotheses
degrees of freedom of a statistic

You should also be familiar with the t-test, covered in the remainder of N&L Chapter 10 and below.

two basic kinds of experiments

In this class, you'll very likely be doing one of two kinds of statistical tests:

1. Comparing two means: for example, the effect on a sample of Design A vs Design B

2. Comparing a single mean with a reference value: for example, the performance of your design relative to a design requirement

The two experiments are conducted in a similar way, with one main difference: when you're comparing a single mean to a reference value, you don't have an independent variable. That is, you're not varying anything. When comparing two means, the two means are obtained as a result of setting your independent variable to two levels.

A different t-statistic is used to evaluate the two cases (described below).

issues in experiment design

Randomization

To justify your assumption of normality (and permit the application of the t-test), you need to collect a random sample of your population. (The population you are sampling must also be normally distributed, of course). There are several things you can do to ensure that your sample is representative of the population at large. The exact list depends on the specifics of your experiment, but this should give you the idea in the context of your experiments:

Selection of subjects: they should be a representative sample of the population of interest at large, in all parameters that might matter for your particular experiment (e.g. age, educational background, handedness, physical abilities, skill with the interface, visual acuity, ...)
Order of application of "treatments" to your subjects: e.g., to avoid learning effects don't always administer Design A and Design B in the same order.
Details of your design which might skew results: e.g. ordering of commands in a menu list (early items are likely to be encountered sooner)

Individual Differences

Often, the differences in performance between individual subjects may be substantial relative to the difference due to the experiment treatment (e.g. which design). This acts as a "nuisance variable", since if not accounted for, it will contribute to the "noise" in the data and mask a real effect.

To address this common problem, the experiment may be designed to use "paired comparisons", where the treatments (let's say two, for a comparison-of-two-means test) are assigned in pairs to each subject. Thus, each subject is tested on both levels of the independent variable. In the analysis phase, you can in effect treat each subject as his own control: you can look at his relative performance across the treatments, and compare this relative performance with the other subjects. In cases where between-subject variability is great (as it often is), this is a valuable technique to increase the experiment's power: a smaller difference in raw treatment means results in a significant result. Details of this method are given below.

basic t-test comparing two means

The t-test is a simple way of determining the probability as to whether two samples (i.e. sets of measurements) are part of the same population, or part of two different populations. Each sample might be measurements taken from one of two levels of an independent variable - for example, "which design": Design A, or Design B. The general question to be answered is: does using the two designs result in a difference in the measured performance parameter (i.e. dependent variable)? If so, then participants using Design A effectively represent a different population than participants using Design B. If not, then statistically, those two sets of measurements will look as if they all came from the same population. You can use the t-test to see which of these cases is probably true, to some level of confidence.

Remember - the t-test is only valid for normally distributed populations. It is important to (a) randomize everything possible when collecting your data, and (b) plot a histogram of your collected data to make sure it is distributed at least roughly in a sigmoid-shaped curve. If it is not, or if you have violated randomization principles in data collection, then you cannot assume that the t-statistic really means anything.

The steps to computing for the t statistic in this scenario, shown here, is also given in N&L 10.7 (with more readable equations) as well as in the course notes.

1. Compute a combined variance for the two samples:

s^2 = (SS1 + SS1)/(N1 + N2 -2)

2. Compute "standard error of difference":

sed = sqrt(s^2(1/N1 + 1/N2))

3. Compute the t statistic itself:

t = (Xmean1 - Xmean2) / sed

4. Compute the t statistic's degrees of freedom

df = N1+N2-2

5. Decide on the significance value you require for the result. A common value is p=.01 or .05 (e.g. 1% chance that you are incorrect in rejecting null hypothesis).

6. Compare the t statistic to a table of computed t-values for a two-sided t distribution, given df and p from steps 4 and 5. You can find tables in the back of a stats textbook, or in many places online; for example

http://www.medcalc.be/manual/mpage13-04b.html

http://www.statsoft.com/textbook/stathome.html

7. If your computed value of t is higher than the value of t in the table for your df and, say, p=.05, then you can say that the difference between the two means is statistically valid - with a probability of p<.05. It is common in fact to locate the smallest value of p for which your computed t is significant, and state that as the result's significance.

This formulation of the t-statistic is designed to compare the difference between two sample means. You are here using it to determine whether a difference between 2 sample means is statistically significant, REGARDLESS OF DIRECTION. I.e., either Xmean1 or Xmean2 might be larger - you just want to know if statistically they are different. You can easily enough determine which is larger by simply looking at them. Essentially, statistical significance means it's unlikely that these two means could have come from the same population distribution - the two populations which these samples represent must be different from each other.

single-tailed and two-tailed t-test

One aspect of the distinction between single-tailed and two-tailed t-tests is covered briefly in Section 10.7 and 10.8. The two-tailed t-test is the "standard" one. The critical value of t for a single-tailed hypothesis is half that for a double-tailed hypothesis, and thus this is an easier test to pass. There are some cases where it is argued that this easier single-tailed t-test is justified - most notably, where a significant difference in means will be interesting only if in one direction, e.g. Xmean1 > Xmean2, but not vice versa. This argument is somewhat controversial, and we won't be using the single-tailed t-test for this purpose in this class. If you're not sure, use the two-tailed test, which is conservative.

There is, however, another use of the single-tailed t-test which is of particular relevance to us, which is when you want to use a t-test to compare one sample mean to a reference value - e.g. a design requirement. N&L 10.8 describes one way of doing this, based upon a single-tailed t-test - method described in next section. The reason it's okay to do this is that rather than looking to see if one sample mean is far enough away from another sample mean (in either direction), here you're seeing if one sample mean is far enough away from a constant value (in one direction). The null hypothesis states that there's no significant difference between your sample mean and the reference value; this you are trying to reject. Your experimental hypothesis states that your sample mean is, e.g., less than the reference value - it does not allow greater than. The probability (i.e. 5% of the area under the t-distribution curve, if you are using p=.05) can be bunched all one one side of the distribution, rather than divided half on each side as it is for a two-tailed test.

t-test for comparing a mean to a reference value

Many 444 project experiments involve comparing a single sample mean to a performance requirement. N&L 10.8 describes one way of doing this, which is reiterated and expanded upon here. For purpose of discussion, let's say that your requirement is that performance must be <= level R (i.e. R is a maximum permissible value, not a minimum). Conceptually, you need to make sure that your no part of the confidence interval lies above the reference value.

(Note: in this case, you don't have two levels of an independent variable; you aren't actually varying anything, but comparing one design, for example, to a fixed value).

First of all, there is no point in doing the test at all if the sample mean is larger than R (Xmean > R). Clearly, we will not find that Xmean is significantly less than R in this case. If Xmean <=R, then let's proceed.

Next, what's your confidence interval? It's a range of values around Xmean, whose size is determined by your desired significance p (also commonly called alpha)- e.g. 0.05 or 0.01; or, as usual, you can run the process backwards and determine the size which the confidence interval would be if the constraint were that it lie entirely under R. If you require a small p, then your confidence interval will be larger - the test will be harder to satisfy. So, choose p and proceed.

Here are the computational steps:

1. Compute the variance for your single sample population:

s^2 = SS / N-1

2. Compute "standard error of the mean":

sem = sqrt(s^2/N)

3. Compute the t statistic's degrees of freedom

df = N-1

4. Decide on the significance value you require for the result. A common value is p=.01 or .05 (e.g. 1% chance that you are incorrect in rejecting null hypothesis).

5. Find the critical value of the t statistic, using the single-tailed t distribution, based on df and p

http://www.medcalc.be/manual/mpage13-04b.html

http://www.statsoft.com/textbook/stathome.html

6. Compute the top and bottom of your confidence interval

Xmin = Xmean - (t,p,df x sem)

Xmax = Xmean + (t,p,df x sem)

6. Compare Xmax (or Xmin) with the reference value as defined in the experiment and null hypothesis. In our example, if Xmax|p <= R, then our sample mean is below R, at significance p.

t-test for a paired comparison

The paired comparison t-test is used when you suspect that individual differences between subjects might be masking an otherwise significant effect. For example, Subject 1 performs well in all cases, and Subject 2 performs poorly in all cases; both of them do better with Design A than with Design B, but the standard deviation of their combined performances is so large that this individual trend does not show up strongly enough.

The solution is to conduct the t-test on the difference between the individual's performances on the two treatments, rather than on all performances lumped together. In effect, the subject's mean performance is subtracted from his individual scores before they are "thrown into the pot". Details are as follows:

1. Compute the difference between sample means for each individual:

Di = Xai - Xbi

where Xai and Xbi are the measured responses for treatments A and B on subject i

2. Compute the mean difference of the individual differences for all subjects:

Dmean = sumation(D)i / N

where N= number of differences = number of subjects

3. Compute the sum of squares on the differences:

SSd = summation[ (Di-Dmean)^2]

= summation[(Di)^2] - [summation(Di)]^2/N

4. Compute the standard deviation of differences:

sd = sqrt[SSd / (N-1)]

5. Compute "standard error of difference":

sed = sd/sqrt(N)

6. Compute the t statistic itself:

t = Dmean / sed

7. Compute the t statistic's degrees of freedom

df = N-1

8. Decide on the significance value you require for the result. A common value is p=.01 or .05 (e.g. 1% chance that you are incorrect in rejecting null hypothesis).

9. Compare the t statistic to a table of computed t-values for a two-sided t distribution, given df and p from steps 4 and 5. You can find tables in the back of a stats textbook, or in many places online; for example

http://www.medcalc.be/manual/mpage13-04b.html

10. If your computed value of t is higher than the value of t in the table for your df and, say, p=.05, then you can say that the difference between the two means is statistically valid - with a probability of p<.05. It is common in fact to locate the smallest value of p for which your computed t is significant, and state that as the result's significance.

example calculations

The lecture on controlled experiments contains discussion on sampling from a normal distribution and hypothesis testing. This TCL script, random.tcl, shows an example of randomly sampling from two distributions and performing a t-test on it. You can play around with it to see the influence of variance and number of samples. It should run in a wish shell.

The final lecture on controlled experiments ends with an example of an experiment design to compare difference in performance for "natural" and "abstract" icon types. A set of sample data (ficticious) and some sample calculations can be found in this excel spreadsheet. Here, we'll step through these calculations. In particular, the spreadsheet illustrates a way to account for individual differences in your subjects.

The first worksheet, "data", shows the sample data. The second worksheet, "1st pass" shows one way of computing the analysis which is in fact not a very good way because it doesn't account for individual differences between subjects. The third worksheet, "2nd pass", shows a better although more lengthy way of doing the analysis. The fourth worksheet, "requirements" shows a calculation to see if we meet a specific performance criterium. The fifth worksheet show a Chi-squared calculation to see if one of the methods is preferred.

Raw Data

There are two icon designs (natural and abstract) which represent the two levels of the independent variable. Performance time for each level for each subject is listed; smaller is better. As well, the user preference for Natural or Abstract are shown.

This worksheet also shows plots of the data in different ways, which help you decide if it is indeed normally distributed and whether individual difference may be important. "Sorted Data" shows the two columns of data plotted in a line after having been individual sorted into ascending order. If normally distributed, each of the two lines should rise steeply at first, flatten somewhat then rise steeply again at the end. Outliers are clearly evident in such a plot.

The second plot, Histogram, shows the same data after they have been sorted into appropriately sized bins. There need to be enough bins to demonstrate a normal distribution (at least 5); and each bin should have some values in it. It is difficult to do a histogram for very small samples (10 and less).

In this case, the data seem fairly well distributed, so we will proceed to do a t-test.

The third plot, Individual Means, shows the same data but each individual's scores have been plotted in the two different conditions. There is not an obvious trend, though it does look like most of the curves are upwards, suggesting, that for a given individual, they do better with natural icons than abstract ones. Perhaps, we need to take account of the individual differences...

1st Pass: neglects individual differences

In the simplest analysis, the t-statistic for testing for a difference between two means is computed in a straightforward way, exactly as laid out above. The data from the 1st worksheet is reproduced in the shaded columns. The basic statistics (mean, sum of squares, variance and standard deviation) are computed according to the formulas given in the N&L handout for each column.

Then, the t statistic is computed based on the combined variance and standard error of difference of the two samples. A value of "t" of 0.9525 is found, with 38 degrees of freedom (40 samples on 20 subjects).

Assuming a desired significance level of 0.05, we consult a two-tailed distribution found in one of the online tables, for example, the MedCalc one. The critical value of t for .05 and 38 df is 2.024. Thus, the difference in effects appears to not be significant.

2nd Pass: accounts for individual differences

Let's try again, considering that perhaps large individual differences are inflating the sample's standard deviation. The 2ndpass worksheet illustrates an analysis in accordance with the Paired Comparisons approach. Now we compute the difference in performance for each subject, and analyze this number. The t-test is different because there is only one set of values, rather than two. We now find a t-statistic of 4.8425, which exceeds the critical value of 2.0930 (higher than before, since now we have just 19 rather than 38 degrees of freedom). In this analysis, the difference due to treatment is indeed significant at at least p=.05; in fact, looking at the table, it's significant at p>.001.

Analyzing subjects separately made a big difference!

The Requirements sheet shows whether each type will meet the overall requirement (750msec) selection time.

The Preference sheet shows whether one method is preferred over the other. Notice that since we don't have so many samples, the preference difference must be pretty large before it becomes significant.