What are Good Machine Learning Practices? 

The CHI Perspective

by Oliver Haase FEB 25th, 2021

good practices AI regulation conformity

The connected health initiative (CHI) is a coalition of healthcare technology providers (incl. e.g., Apple, Intel, Microsoft, Novo Nordisk, Otsuka Pharmaceutical, Podimetrics, Roche and many more), healthcare providers (incl. the American Medical Association and various University hospitals), as well as academia. 

The CHI has published a whitepaper titled Machine Learning and Medical Devices Connecting practice to policy (and back again). The paper is essentially a gap analysis of today's state-of-the-art in machine learning development and quality management, compared with the level of maturity and soundness with traditional, code-based software development. The aim of the 30-pager is "to provide a baseline that the Food and Drug Administration (FDA) and other governmental and non- governmental stakeholders can leverage in their ongoing consideration of the topic." 

Because the whitepaper contains several insights and perspectives that are interesting and relevant, both for the ongoing machine learning regulatory efforts of the FDA, as well as for the European market where manufacturers and named bodies alike are in search of commonly agreed upon good machine learning practices, I've summarized the paper's main takeaways in this short article. Wherever I directly cite the whitepaper, I’ve used quotation marks and italics.

Document Structure 

The document consists of three interrelated parts:

1. ML-specific properties throughout the software development and DevOps lifecycle

In this part, the authors trace the ML-specifics along three affected areas of development and DevOps, namely (1) the software development lifecycle, (2) quality management, and (3) software security and risk management.

Software Development Lifecycle

Obviously, the key difference between traditional software development and machine learning development is that the former is code-centric, whereas the latter is data centric. A direct consequence of this difference is that with machine learning, the direct connection between code and functional behavior gets lost. From this starting point, the authors identify the following challenges and required adaptations of the code-centric approach:

  • Because the quality of the training and test data is key to the quality of the machine learning model, a “collection of controls are needed to ensure that effective quality, audit and sourcing remain in place." 
  • Untrained models usually come as machine learning libraries and frameworks such as TensorFlow, PyTorch, Keras and the like. The authors note that “reusable untrained models are a special class of reusable code that, given their heightened impact on development outcomes, require a proportionate increase in governance." 
  • With regard to IDEs and other 3rd party software development tools, the paper states that “IDE’s and IT security and risk management frameworks must evolve in-kind to keep pace with the consequences of including 3rd party Data + Output and/or Untrained Models into the modern software supply chain.”
  • Because of the missing connection between code and behavior, “ML programs may require compensating mechanisms to ensure comparable degrees of transparency, reliability and auditability."
  • Concerning the potential use of continuous learning systems, the authors are remarkably cautious and risk-aware: “Owners and regulators of sensitive and high-risk applications that must include human inspection may need to consider a blanket prohibition of these subcategories of Machine Learning until new norms about acceptable risk and transparency can be established. At a minimum, a greater understanding of the limitations and side-effects of deployed machine learning algorithms will be required by auditors and regulators."

Software Quality Management

Because of the shift from code- to data-centric development, data-centric quality practices and metrics need to be developed. In particular, this affects the following aspects:

  • Functional completeness: Gaps in the training data (insufficient amount of data, lopsided data across activities or outcomes, missing activities or outcomes, irrelevant activities and outcomes) can lead to various functional deficiencies in the trained model, such as unsuitability for some users or intended patients, poor accuracy, lack of predictability and transparency.
  • Comprehensibility: “a reviewer must have specialized data science expertise and be knowledgeable in the strengths and limitations of the applied model(s) and the data staging/cleansing/sampling techniques."
  • Auditability: Because of the missing connection between code and behavior, “A consensus on acceptable alternatives to traditional event logging in code-based applications are needed to provide a comparable degree of assurance.”
  • Testability: "Exception detection, defect definition, and related KPI’s (including testing cost) must be established to effectively model the severity and cost of ML application defects specifically related to under-performance." 

Software Security and Risk Management

Due to its short history, the set of potential abuse cases, vulnerabilities, exploits and control is very likely incomplete. Adversarial attacks are just one prominent example of a new kind of attack on a known application surface, i.e. the application input interface. 

Training and test data are new application surfaces for attacks such as, e.g., data poisoning. In short, machine learning security and risk management is far behind traditional security and risk management and requires special attention and consideration.

2. Review of the FDA's Proposed Regulatory Framework for Modifications to AI/ML-Based Software as a Medical Device

The proposed regulatory framework for modifications to AI/ML-based SaMD uses a combination of concepts and controls that aim at ensuring the quality and safety of an AI/ML-based SaMD. The whitepaper reviews them in light of the current state-of-the-art of machine learning development and quality management as discussed in the first part of the paper:

  • The proposed framework uses the concept of Culture of Quality and Organizational Excellence (CQOE) for a process-oriented quality approach. However, because a common understanding of good machine learning practices is still in the flux and key quality controls are missing or immature, the question arises how a reliable concept of "culture of quality and organizational excellence" can be established that takes these evolving underlying assumptions into account. This is especially true as the concept of CQOE historically relies on the code-centric standard IEC 62304.
  • The Pre-Market Assurance of Safety and Effectiveness is the product-level quality assurance of an AI/ML-based SaMD before market entrance. The framework has identified the existing quality control gaps, solutions have not yet been addressed. 
  • To allow the AI/ML-based SaMD to continuously learn and improve, the framework defines two new controls, the SaMD Pre-Specifications that describe the envisioned improvements of the product and the Algorithm Change Protocol that describes how the changes are safely performed. As described in the first part of the whitepaper, the authors are rather sceptical about continuously learning systems at this point of maturity: "Without some breakthroughs in transparency and monitoring, many of the most dynamic learning algorithms will most likely be entirely prohibited for use inside SaMDs." In addition, as the authors point out, “Long-standing requirement that updates must affect all copies of a device without independent evolution prohibits a subset of continuous learning systems.”
  • Real-World Performance Monitoring complements the process-oriented approach with structured, ongoing monitoring of the AI/ML-based SaMD in the market. As the authors note: "Special care must be taken to correctly interpret results as a measure of ML model performance and differences between SaMD model releases."

In addition to the review of these concrete components of the proposed framework, the whitepaper raises the concern that the usual “heavy reliance on standards with their revision policies may not be fast and flexible enough to keep pace with fast-changing technologies and practices.”

3. Beyond the Total Product Lifecycle Approach

Even though the authors are sceptical about continuous learning systems at the current state-of-the-art, they propose an approach similar to the periodic recertification of health professionals. They compare the training, testing, and certification of a physician to the GMLP of an AI/ML-based SaMD. After certification, a physician keeps learning and improving; there are governance measures in place to even revoke a certification if needed.

The authors suggest employing a similar system where each instance of an AI/ML-based SaMD is monitored for its real-world performance and periodically re-certified, or taken off the market. Such a system would require a mix of technologies to reliably authenticate and certify each individual medical device instance. The authors also note that the complexity of such a system makes only sense if the added benefit of continuous learning clearly exceeds the benefits of frozen systems.