Validating Pre-Trained Deep Networks

by Oliver Haase oct 9th, 2020

Reuse is a wonderful concept to reduce time, effort, and cost. It can be leveraged on different levels and in different places in machine learning. For one, machine learning libraries and frameworks are a great source of code reuse. Next, ML library validation provides another reuse opportunity in the area of validation. And last but not least, machine learning models themselves can be reused and adapted to new classification tasks and new data.

This is exactly what transfer learning is about. In the context of medical device regulation, however, the question arises how a pre-trained model can be validated for use in a medical device. In this article, I shed some light on this question. To get started, let’s briefly recap how transfer learning works.

Standing on the shoulders of giants

At this point in time, transfer learning is mainly used for image classification. In this domain, the predominant technology is deep neural networks, in particular convoluted nets. These networks often have dozens if not hundreds of hidden layers and millions of weights. Their training can be both very cost- and time-consuming. The aim of transfer learning is to reuse and adapt the resulting models for new, similar classification tasks and new, similar data. Luckily, there is a broad array of public pre-trained models whose prediction performance on some public benchmark data sets is well studied and documented. If you’re using, e.g., Keras for model development, then at the time of writing you can choose among around 25 pre-trained models for image classification.

From general to specific

Simply put, a convolutional network is made up of two parts, a convolutional base and a classifier backend. The convolutional base comprises a set of convolutional and max pool layers that extract features from the input data. The first layers extract very general features such as edges and regions, whereas the layers further down the network extract more specific features, such as faces or buildings. The classifier backend, often a set of densely connected layers, makes a classification prediction based on the most specific features from the convolutional base.

Freeze and retrain

Adapting a pre-trained model means to freeze some of its weights and to retrain others. Freezing a weight means that the re-training process cannot change its value. Let’s start with the classifier backend: Because this part of the model is very unique to the exact classification task and to the most specific extracted features, it is always retrained. If the number of prediction classes varies from the original to the new classification task, then the output layer is not only retrained but replaced.

Now to the convolutional base: The boundary between frozen and retrained layers runs right through the convolutional base. The layers that extract general features will always be reused and thus frozen, otherwise using a pre-trained model would make very little sense. For the increasingly more specific layers, the design decision whether to freeze or to retrain them depends on the similarity of the original and the new data, and on the size of the new training data set. The higher the data similarity, the more layers should be frozen. The bigger the new training data set, the more layers can be retrained. A good retraining process depends on a suitable balance between these considerations.

Regulatory requirements for machine learning

Before we can reason about the regulatory requirements for pre-trained data, it helps to consider the general case of ML model development and then adapt these considerations to the special case of pre-trained models. For general ML model development, the following steps and concepts and their regulatory counterparts need to be taken into consideration:

  1. Harmonized standard IEC 62304 regulates the development of software for medical devices. From the perspective of this standard, the machine learning development process is embedded in the software unit implementation phase. As such, it needs to follow state-of-the-art best practices.
  2. The IEC 62304 unit implementation phase is completed with unit testing. Because the goal of unit testing is to verify the correctness of the implemented software unit, unit testing translates to model evaluation for the machine learning software unit.
  3. The machine learning library or framework that is used to train the model – Keras, PyTorch, TensorFlow, or the like – is a software tool that is regulated by harmonized standard ISO 13485.
  4. The machine learning library or framework that runs in the medical device to make predictions is considered software of unknown provenance (SOUP). The requirements on the use of SOUP is regulated by IEC 62304.

Regulatory requirements for pre-trained models

If the training process does not start from scratch with an initial model but with a pre-trained model, the obviously affected concept is the machine learning process itself. In both cases (initial vs. pre-trained model), all steps of the model development process need to be documented and justified. This helps manufacturers with their own quality management, as well as auditors to judge the correctness and soundness of the process.

A typical ML development process – or so-called ML pipeline – starts with collecting and preparing the data, and the comprises phases like model selection and training, and model evaluation. All these steps – except model selection – are also needed to retrain a pre-trained model.

However, there are significant additional aspects with pre-trained models that need to be carefully considered, documented, and justified. These aspects are:

  • The adequacy of the pre-trained model for this classification task and the targeted data distribution. This can, e.g., be shown by measuring the prediction performance of the unchanged, pre-trained model for the new classification task and a set of test data from the target data distribution.
  • The degree of similarity between the original training data and the data for retraining. Descriptive statistics for both data sets are a good foundation for this aspect.
  • How the above aspects justify the choice of the boundary between frozen and retrained layers. Because a good boundary is not found easily, the manufacturer will run several retraining processes with different choices. Documenting these choices and their prediction performance helps justify the final choice.

The other three concepts discussed above, model evaluation, as well as machine learning libraries for training or inference, are not directly affected by the use of a pre-trained model. They can and must be treated the same way.

If you have questions, want to learn more, or need support to apply the above to your specific machine learning / medical device project, please get in touch with us or leave a comment.