
Hypothesis testing
Hypothesis testing is a formal process for statisticians and data scientists. The standard approach to hypothesis testing is to define an area of research, decide which variables are necessary to measure what is being studied, and then to set out two competing hypotheses. In order to avoid only looking at the data that confirms our biases, researchers will state their hypothesis clearly ahead of time. Statistics can then be used to confirm or refute this hypothesis, based on the data.
In order to help retain our visitors, designers go to work on a variation of our home page that uses all the latest techniques to keep the attention of our audience. We'd like to be sure that our effort isn't in vain, so we will look for an increase in dwell time on the new site.
Therefore, our research question is "does the new site cause the visitor's dwell time to increase"? We decide that this should be tested with reference to the mean dwell time. Now, we need to set out our two hypotheses. By convention, the data is assumed not to contain what the researcher is looking for. The conservative opinion is that the data would not show anything unusual. This is called the null hypothesis and is normally denoted H0.
The researcher then forms an alternate hypothesis, denoted by H1. This could simply be that the population mean is different from the baseline. Or, it could be that the population mean is greater or lesser than the baseline, or even greater or lesser by some specified value. We'd like to test whether the new site increases dwell time, so these will be our null and alternate hypotheses:
- H0: The dwell time for the new site is no different than the dwell time of the existing site
- H1: The dwell time is greater for the new site compared to the existing site
Our conservative assumption is that the new site has no effect on the dwell time of users. The null hypothesis doesn't have to be nil hypothesis (that there is no effect), but in this case, we have no reasonable justification to assume otherwise. If the sample data does not support the null hypothesis (if the data differs from its prediction by a margin too large to be by chance alone), then we will reject the null hypothesis and propose the alternative hypothesis as the best alternative explanation.
Having set out the null and alternative hypotheses, we must set a significance level at which we are looking for an effect.
Significance
Significance testing was originally developed independent of hypothesis testing, but the two approaches are now very often used in concert together. The purpose of significance testing is to set the threshold beyond which we determine that the observed data no longer supports the null hypothesis.
There are therefore two risks:
- We may accept a difference as significant when in fact, it arose by chance
- We may attribute a difference to chance when, in fact, it indicates a true population difference
These two possibilities are respectively referred to as Type I and Type II errors:

The more we reduce our risk of making Type I errors, the more we increase our risk of making Type II errors. In other words, the more confident we wish to be to not claim a real difference when there is none, the bigger the difference we'll demand between our samples to claim statistical significance. This increases the probability that we'll disregard a genuine difference when we encounter it.
Two significance thresholds are commonly used by statisticians. These are the 5 percent and 1 percent levels. A difference at 5 percent is commonly called significant and at 1 percent is called highly significant. The choice of threshold is often referred to in formulae by the Greek letter alpha, α. Since finding no effect might be regarded as a failure (either of the experiment or of the new site), we might be tempted to adjust α until we find an effect. Because of this, the textbook approach to significance testing requires us to set a significance level before we look at our data. A level of 5 percent is often chosen, so let's go with it.