abstract image header ("data") generated by the deep learning software "stable diffusion"

How Much Test Data Is Enough?

 A very common and essential question when configuring a machine learning training process is how much test data we need such that the measured prediction performance can be trusted. Even though this is a very important question, it is too simplified. Whether the measured test performance can be trusted depends on the level of the required performance, the test data size, and the safety margin between test performance and the required performance.

Typical model evaluation approach

Usually, the process to evaluate the prediction performance of a machine learning model is roughly as follows:

A reliable answer to this question needs a sound statistical and quantitative foundation.

The power of sampling

Ideally, what we want to know is the accuracy of the classifier for all possible inputs, that is the proportion of all correctly classified inputs compared to all possible inputs. Because this set is way to huge and because we don’t know the ground truth (don’t have labels) for most of the data, we cannot directly measure the real accuracy of the model.

Instead, we take a random sample of labeled inputs, the test data set, and measure the accuracy of these predictions. Now, inductive statistics provide us with the necessary knowledge and tools to evaluate the level at which a measured test performance can be trusted in a quantitative way. The first important fact is stated in the central limit theorem applied to Bernoulli distributions. Luckily, this theorem is less complicated than it sounds. What it says, slightly simplified and adapted to our scenario, is this:

If you take all possible samples of the same size, then for growing sample sizes the distribution of the sample accuracies approximates a normal (Gaussian bell curve) distribution. This normal distribution has its mean at the real model accuracy, \(\pi\), and a standard deviation, \(\sigma\), that depends on the model accuracy and the sample size n, more specifically:

\begin{equation}
\sigma = \frac{\sqrt{\pi(1-\pi)}}{\sqrt{n}}
\end{equation}

That is, we know that the accuracy we measure with the test set belongs to a normal distribution as shown on the right. This in turn gives us a probabilistic hint as to how far from the test accuracy the model accuracy is likely to be.

For instance, we know that about 68% of all test accuracies (if we were to take multiple test samples) will fall into the interval [\(\pi\) – \(\sigma\), \(\pi\) + \(\sigma\)]. Combined with the second statistical tool, hypothesis tests, we can now use this knowledge to quantify the evidential power of the measured test accuracy.

A micro primer on hypothesis testing

Before we apply the technique of hypothesis testing to our domain and in particular to our example scenario, let’s briefly revisit its line of arguments in general:

The key to hypothesis testing is to formulate \(H_0\) in a way that (1) makes it tangible to rejection on the grounds of the measurements and the statistical assumptions, and (2) such that its opposite is what we really want to prove. 

Hypothesis testing for model evaluation

Now, let’s apply this technique to our application scenario:

Sample size + accuracy = safety margin

As we’ve seen in equation (1), the standard deviation \(\sigma\) and consequently calculation (2) depend on the size of the test data set. If the test set in our example had, say, 500 rather than 1.000 entries, calculation (1) would yield 0.842 and we would not be able to reject the null hypothesis.

Secondly, \(\sigma\) also depends on the required accuracy \(\pi_0\). More specifically, \(\sigma\) becomes bigger for \(\pi_0\) close to 0.5 and smaller for \(\pi_0\) close to 1 (the same is true for very small values of \(\pi_0\), but we are not interested in accuracies below 50%). If, for instance, the required accuracy were 0.7, then for a sample size of 1.000 the safety margin would become 0.034 instead of 0.029 as in our example.

For a given required accuracy and significance level, we can determine the necessary test data size that allows us to accept a certain test accuracy as good enough. Or vice versa, we can calculate the needed safety margin in the measured test accuracy given a certain test data size. To illustrate this, I have compiled the following table. It contains for the given combinations of required accuracy and safety margin the necessary size of the test data set. 

z-score table for statistical significance

How does all this apply to other performance metrics such as sensitivity or specificity? Basically, just the same way except that for the test set sizes only the actual positives or negatives are counted, respectively. Want to know more or need support? We’ll be happy to help if you contact us.