The connected health initiative (CHI) is a coalition of healthcare technology providers (incl. e.g., Apple, Intel, Microsoft, Novo Nordisk, Otsuka Pharmaceutical, Podimetrics, Roche and many more), healthcare providers (incl. the American Medical Association and various University hospitals), as well as academia.
The CHI has published a whitepaper titled Machine Learning and Medical Devices Connecting practice to policy (and back again). The paper is essentially a gap analysis of today's state-of-the-art in machine learning development and quality management, compared with the level of maturity and soundness with traditional, code-based software development. The aim of the 30-pager is "to provide a baseline that the Food and Drug Administration (FDA) and other governmental and non- governmental stakeholders can leverage in their ongoing consideration of the topic."
Because the whitepaper contains several insights and perspectives that are interesting and relevant, both for the ongoing machine learning regulatory efforts of the FDA, as well as for the European market where manufacturers and named bodies alike are in search of commonly agreed upon good machine learning practices, I've summarized the paper's main takeaways in this short article. Wherever I directly cite the whitepaper, I’ve used quotation marks and italics.
The document consists of three interrelated parts:
- The first part traces and discusses ML-specific properties throughout the software development and DevOps lifecycle.
- The second part is a review of the FDA's Proposed Regulatory Framework for Modifications to AI/ML-Based Software as a Medical Device.
- The third part (disguised as appendix C) gives a brief outlook on a possible regulatory approach for machine learning that goes beyond the current total product lifecycle approach.
1. ML-specific properties throughout the software development and DevOps lifecycle
In this part, the authors trace the ML-specifics along three affected areas of development and DevOps, namely (1) the software development lifecycle, (2) quality management, and (3) software security and risk management.
Software Development Lifecycle
Obviously, the key difference between traditional software development and machine learning development is that the former is code-centric, whereas the latter is data centric. A direct consequence of this difference is that with machine learning, the direct connection between code and functional behavior gets lost. From this starting point, the authors identify the following challenges and required adaptations of the code-centric approach:
- Because the quality of the training and test data is key to the quality of the machine learning model, a “collection of controls are needed to ensure that effective quality, audit and sourcing remain in place."
- Untrained models usually come as machine learning libraries and frameworks such as TensorFlow, PyTorch, Keras and the like. The authors note that “reusable untrained models are a special class of reusable code that, given their heightened impact on development outcomes, require a proportionate increase in governance."
- With regard to IDEs and other 3rd party software development tools, the paper states that “IDE’s and IT security and risk management frameworks must evolve in-kind to keep pace with the consequences of including 3rd party Data + Output and/or Untrained Models into the modern software supply chain.”
- Because of the missing connection between code and behavior, “ML programs may require compensating mechanisms to ensure comparable degrees of transparency, reliability and auditability."
- Concerning the potential use of continuous learning systems, the authors are remarkably cautious and risk-aware: “Owners and regulators of sensitive and high-risk applications that must include human inspection may need to consider a blanket prohibition of these subcategories of Machine Learning until new norms about acceptable risk and transparency can be established. At a minimum, a greater understanding of the limitations and side-effects of deployed machine learning algorithms will be required by auditors and regulators."
Software Quality Management
Because of the shift from code- to data-centric development, data-centric quality practices and metrics need to be developed. In particular, this affects the following aspects:
- Functional completeness: Gaps in the training data (insufficient amount of data, lopsided data across activities or outcomes, missing activities or outcomes, irrelevant activities and outcomes) can lead to various functional deficiencies in the trained model, such as unsuitability for some users or intended patients, poor accuracy, lack of predictability and transparency.
- Comprehensibility: “a reviewer must have specialized data science expertise and be knowledgeable in the strengths and limitations of the applied model(s) and the data staging/cleansing/sampling techniques."
- Auditability: Because of the missing connection between code and behavior, “A consensus on acceptable alternatives to traditional event logging in code-based applications are needed to provide a comparable degree of assurance.”
- Testability: "Exception detection, defect definition, and related KPI’s (including testing cost) must be established to effectively model the severity and cost of ML application defects specifically related to under-performance."