Regulation requires machine learning libraries to be validated before they can be used in medical devices. On this page we discuss these regulatory requirements, what they mean for machine leaning libraries, and how to meet them without having to reinvent the wheel.
Can we build ML models without ML libraries?
Examples of ML libraries or frameworks are TensorFlow, PyTorch and Keras for neural networks, and XGBoost for gradient boosting. These libraries contain machine learning algorithms that can be fed with labeled training data to build machine learning models. During training, the ML algorithm learns patterns from the data such that the trained model can make predictions for new, unlabeled data. A well-trained machine learning model can, e.g., detect tumors in CT pictures.
Training a model is highly complex. ML libraries are the reason why today anyone who has access to high quality data can build powerful ML models fairly easily. Building an ML model without an ML library is possible but would require to implement an ML algorithm which translates to person years of development.
What regulatory requirements apply for the use of ML libraries in medical devices?
Done right, ML libraries can be used in medical devices, providing an invaluable source of code reuse. In the context of the European medical device regulation MDR, machine learning libraries play two different roles in the development and deployment of medical device software:
- During the modeling phase, ML libraries are used to train the prospective ML model. Because this training is part of the software development process and does not run on the resulting medical device, the training part of an ML library is a software tool and regulated by harmonized standard ISO 13485 Medical devices – Quality management systems.
- When the resulting model becomes part of the medical device, the ML library is software of unknown provenance (SOUP) and is regulated by harmonized standard IEC 62304 Medical Device Software – Software Life Cycle Processes. More specifically, IEC 62304 defines SOUP as:
"Software item that is already developed and generally available and that has not been developed for the purpose of being incorporated into the product or software item previously developed for which adequate records of the development processes are not available.“
What's the difference from traditional third party libraries?
In traditional software development, third party libraries are used for certain functionalities such as statistics packages, while most of the code is hand-written. With machine learning, the entire code - that is the trained model - is third party code and thus SOUP. The machine learning development process consists of collecting and curating high quality data, as well as selecting and configuring the best possible ML model, but not of writing the resulting program code. Consequently, validating the correctness of the used ML library, together with model evaluation, becomes the cornerstone of software verification.
Doesn't model evaluation implicitly prove the correctness of the machine library?
No, neither from a regulatory nor a software testing point of view:
- IEC 62304 clearly states the requirements for usage of SOUP, in particular that the expected functionality be specified and verified.
- From traditional testing we know that testing can only show the presence never the absence of bugs. This is why good software development practices require testing at different stages and different software granularities, to increase the changes of fault detection. The inherent limitations of testing become very obvious in machine learning. Here, even minimal changes in the input data can cause completely unexpected behavior, as we know from adversarial attacks. Therefore, machine learning library validation is one important building block in a comprehensive software verification strategy.
Libraries like TensorFlow and PyTorch are used by thousands of ML applications. Can't we assume that any bugs would have been already detected and corrected?
Unluckily no. The widespread use of a library is an important aspect in the big picture, but does not suffice by itself: One might remember the so-called heartbleed bug in the OpenSSL implementation from 2012. OpenSSL was - and still is - extremely widespread, much more than TensorFlow and PyTorch today. The heartbleed bug was made by a PhD student who submitted a new feature and forgot a plausibility check, and accepted and integrated by a single member of the OpenSSL community. It rendered over half a million web servers vulnerable to data exploits and was considered the worst known vulnerability at the time.
Also, the list of issues for, e.g., TensorFlow runs in the thousands of open issues with growing tendency. Luckily, most of these issues are minor or refer to certain very rare hard- and software configurations, but some must indeed be taken seriously.
Ok, and how can machine learning libraries be validated?
The answer depends on the library's role as a SOUP or a software tool:
- For SOUP validation the inference functionality of the library must be validated. That is, it must be shown that the predictions of the model correspond to its weights. This proof has nothing to do with prediction quality, but only with consistency with the model. This verification requires three components:
- A specification of the expected inference behavior,
- A test oracle to compare the model predictions with,
- Adequate test data.
- For tool validation the manufacturer can use a risk-proportionate approach. The complete verification of the library's training functionality is not required - and would be infeasible anyway because the training functionality accounts for the big majority of a machine learning library. For tool validation, suitable techniques for model evaluation and model transparency can be used to prove that the result of the training process corresponds to the training data. Similarly to SOUP validation this proof has nothing to do with the quality of the trained model but only with the correctness of the training process.
Does everyone have to do this? Or can a library be validated once for good?
To completely validate a fully-grown ML library is infeasible. It would mean to redo all the generic testing their communities do, and do it better. The good news is that it is unnecessary. Instead, it suffices to prove correctness of the ML library for the specific model that is to be used in the medical product. This more specific verification is not trivial but feasible, but also means that each machine learning based medical device requires its own library validation. The other good news is that the validation strategy remains the same across ML models and medical products. Or in different words, machine learning libraries are not only an invaluable source of reuse on the level of functionality, but also on the level of validation.
What kind of support can I get?
To help medical device manufacturers get their machine learning code certified, we have developed and implemented a concept for ML library validation. The concept covers both SOUP and tool validation and consists of validation documentation as well as test code. Our reference implementation contains readily reusable as well as project-specific parts. That is, it can be adapted to the machine learning component of your medical device with little effort compared to reinventing the ML library validation wheel. Also, because we spent considerable time, effort, and expertise in our blueprint concept, you will get a thorough, adequate, and state-of-the-art solution from experts in the fields.
If you think this work could help your business, we will be happy to get in touch with you if you drop us a short note via our contact form.