A very common and essential question when configuring a machine learning training process is how much test data we need such that the measured prediction performance can be trusted. Even though this is a very important question, it is too simplified. **Whether the measured test performance can be trusted depends on the level of the required performance, the test data size, and the safety margin between test performance and the required performance.**

Usually, the process to evaluate the prediction performance of a machine learning model is roughly as follows:

- As a result of the business understanding phase (in terms of the CRISP-DM process model), a
**required prediction performance threshold**is defined. To keep things simple, let’s assume accuracy as the metric of choice and that the model to be developed is a classifier that must have an accuracy better than 0.8, that is more than 80% of all predictions must be correct. Please note that we define an accuracy threshold that must not only be met but exceeded.*This makes the statistical proof below much easier while it virtually makes no difference from a minimum allowed value (think of such a threshold as the minimum allowed value minus a tiny epsilon).* - The available labeled data is split into training and test data. Typical best practices proportions are 70:30 and 80:20. For our example, let’s assume a data set of 5.000 labeled inputs that is split into 4.000 training and 1.000 test data items.
- The training process yields a model that shows a training accuracy of 0.85 and a test accuracy of 0.83.
- Bingo, the required accuracy is exceeded by a margin of 3 percent points on previously unseen data. Now all is well, or is it?

A reliable answer to this question needs a sound statistical and quantitative foundation.

Ideally, what we want to know is the **accuracy of the classifier for all possible inputs**, that is the proportion of all correctly classified inputs compared to all possible inputs. Because this set is way to huge and because we don’t know the ground truth (don’t have labels) for most of the data, we cannot directly measure the real accuracy of the model.

Instead, we take a random sample of labeled inputs, the test data set, and measure the accuracy of these predictions. Now, inductive statistics provide us with the necessary knowledge and tools to evaluate the level at which a measured test performance can be trusted in a quantitative way. The first important fact is stated in the central limit theorem applied to Bernoulli distributions. Luckily, this theorem is less complicated than it sounds. What it says, slightly simplified and adapted to our scenario, is this:*If you take all possible samples of the same size, then for growing sample sizes the distribution of the sample accuracies approximates a normal (Gaussian bell curve) distribution. This normal distribution has its mean at the real model accuracy, \(\pi\), and a standard deviation, \(\sigma\), that depends on the model accuracy and the sample size n, more specifically:*

\begin{equation}

\sigma = \frac{\sqrt{\pi(1-\pi)}}{\sqrt{n}}

\end{equation}

That is, we know that the accuracy we measure with the test set belongs to a normal distribution as shown on the right. This in turn gives us a probabilistic hint as to how far from the test accuracy the model accuracy is likely to be.

For instance, we know that about 68% of all test accuracies (if we were to take multiple test samples) will fall into the interval [\(\pi\) – \(\sigma\), \(\pi\) + \(\sigma\)]. Combined with the second statistical tool, hypothesis tests, we can now use this knowledge to quantify the evidential power of the measured test accuracy.

Before we apply the technique of hypothesis testing to our domain and in particular to our example scenario, let’s briefly revisit its line of arguments in general:

- The hypothesis to be statistically proven is the so-called alternative hypothesis, \(H_1\).
- Its opposite is the null hypothesis \(H_0\).
- Using known properties of the underlying probability distribution, it is shown that the probability for \(H_0\). to be true is below a certain significance level, α. It is important that this value is defined before the test is executed.
- Because of \(H_0\)‘s low probability, the null hypothesis is rejected and its opposite, the alternative hypothesis \(H_1\) is accepted with a significance level of α.

The key to hypothesis testing is to formulate \(H_0\) in a way that (1) makes it tangible to rejection on the grounds of the measurements and the statistical assumptions, and (2) such that its opposite is what we really want to prove.

Now, let’s apply this technique to our application scenario:

- First and foremost, the test accuracy must be better than the required accuracy. Otherwise, the probability of the model accuracy being above the required accuracy will be below 50%, to say the least. In such a case, go back to square one, try another model architecture or another training approach.
- In the example above, the test accuracy of 0.83 exceeds the required accuracy of 0.80 by 0.03. This bears the chance that we can execute a successful hypothesis test.
- Let’s formulate the null hypothesis, \(H_0\), as “
**model accuracy, \(\pi\), is exactly the required accuracy, \(\pi_0\)**“. - Before we try to reject \(H_0\) we need to fix the significance level α. A common choice is 5%, meaning \(H_0\) will be rejected if its probability is smaller than 5%. To be even more confident in the statistical proof, we choose the significance level α to be 1%.
- Using a z-score table, we can look up the z-value such that 99% of all sample accuracies are less or equal to \(\pi_0\) + z・\(\sigma\). This z-value is approximately 2.33.
- Now let’s plug in our example values to get the biggest possible accuracy value that lies within the 99% range of our distribution of sample accuracies:
- Let’s call the value of z・\(\sigma\) the safety margin, because it is the margin by which the measured test accuracy must exceed the required accuracy in order to reject the null hypothesis. In our example, the safety margin is 0.029 = 2.9%.
- Because our test accuracy is above \(\pi_0\) + the safety margin, its probability, given a model accuracy of 0.8, is less than 1%. We can thus reject \(H_0\) with this pre-specified significance level, and accept the alternative hypothesis \(H_1\).
- Now, what exactly is \(H_1\)? As the opposite of \(H_1\), it would be “model accuracy, \(\pi\), is different from the required accuracy, \(\pi_0\)“. However, we know that the model accuracy can only be above \(\pi_0\) to bring it closer to the test accuracy. Therefore, \(H_1\) can be narrowed down to “model accuracy, \(\pi\), is above the required accuracy, \(\pi_0\)“.
**As a result, we could statistically prove with a significance level of 1% that the model accuracy has the required accuracy.**

As we’ve seen in equation (1), the standard deviation \(\sigma\) and consequently calculation (2) depend on the size of the test data set. If the test set in our example had, say, 500 rather than 1.000 entries, calculation (1) would yield 0.842 and we would not be able to reject the null hypothesis.

Secondly, \(\sigma\) also depends on the required accuracy \(\pi_0\). More specifically, \(\sigma\) becomes bigger for \(\pi_0\) close to 0.5 and smaller for \(\pi_0\) close to 1 (the same is true for very small values of \(\pi_0\), but we are not interested in accuracies below 50%). If, for instance, the required accuracy were 0.7, then for a sample size of 1.000 the safety margin would become 0.034 instead of 0.029 as in our example.

For a given required accuracy and significance level, we can determine the necessary test data size that allows us to accept a certain test accuracy as good enough. Or vice versa, we can calculate the needed safety margin in the measured test accuracy given a certain test data size. To illustrate this, I have compiled the following table. It contains for the given combinations of required accuracy and safety margin the necessary size of the test data set.

How does all this apply to other performance metrics such as sensitivity or specificity? Basically, just the same way except that for the test set sizes only the actual positives or negatives are counted, respectively. Want to know more or need support? We’ll be happy to help if you contact us.