First Opinion

Health-related artificial intelligence needs rigorous evaluation and guardrails

Health-related artificial intelligence needs rigorous evaluation and guardrails

Algorithms can augment human decision-making by integrating and analyzing more data, and more kinds of data, than a human can comprehend. But to realize the full potential of artificial intelligence (AI) and machine learning (ML) for patients, researchers must foster greater confidence in the accuracy, fairness, and usefulness of clinical AI algorithms.

Getting there will require guardrails — along with a commitment from AI developers to use them — that ensure consistency and adherence to the highest standards when creating and using clinical AI tools. Such guardrails would not only improve the quality of clinical AI but would also instill confidence among patients and clinicians that all tools deployed are reliable and trustworthy.

STAT, along with researchers from MIT, recently demonstrated that even “subtle shifts in data fed into popular health care algorithms — used to warn caregivers of impending medical crises — can cause their accuracy to plummet over time.”

Experts have been aware that data shifts — which happen when an algorithm must process data that differ from those used to create and train it — adversely affect algorithmic performance. State-of-the-art tools and best practices exist to tackle it in practical settings. But awareness and implementation of these practices vary among AI developers.

Also variable is adherence to existing guidelines for development and testing of clinical algorithms. In a recent examination of AI algorithms provided by a commercial electronic health record system vendor, most of the recommendations from such guidelines were not reported. Just as concerning is the fact that about half of AI development and testing guidelines suggest reporting technical performance (how well the model’s output matches truth on one dataset) but do not address fairness, reliability, or bottom-line usefulness of the algorithms.

Without rigorous evaluation for accuracy, safety, and the presence of bias, AI developers are likely to repeat mistakes similar to those documented in a classic study by Ziad Obermeyer and colleagues, in which a poorly chosen outcome — using health costs as a proxy for health needs — during algorithm development led to major racial bias.

For nearly a year, we and many other colleagues from academia, industry, and government have convened to discuss ways to overcome these challenges. Among the many perceptive observations offered by the group, a number of them stand out as actionable suggestions:

Create a label for every algorithm — analogous to a nutrition label, or a drug label — describing the data used to develop an algorithm, its usefulness and limitations, its measured performance, and its suitability for a given population. When you buy a can of soup, you decide if the calories, fat, and sodium align with your needs and preferences. When health systems decide on a drug to use, a medical review board assesses its utility. The same should be true of AI in health care.

Test and monitor the performance of algorithm-guided care within the settings in which it is deployed in an ongoing way. Testing should include screening for potential demographic-specific losses in accuracy with tools that find error hotspots that can be hidden by average performance metrics.

Create best practices for establishing the usefulness, reliability, and fairness of AI algorithms that bring together different organizations to develop and test AI on data sets drawn from diverse and representative groups of patients.

Create a standard way for government, academia, and industry to monitor the behavior of AI algorithms over time.

Understand clinical context and goals of each algorithm and know what attributes — quality, safety, outcomes, cost, speed, and the like — are being optimized.

Learn how local variations in lifestyle, physiology, socioeconomic factors, and access to health care affect both the construction and fielding of AI systems and the risk of bias.

Assess the risk that AI might be used, intentionally or not, to maintain the status quo and reinforce, rather than eliminate, discriminatory policies.

Develop approaches for appropriate clinical use of AI in combination with human expertise, experience, and judgment, and discourage overreliance on, or unreflective trust of, algorithmic recommendations.

The informal dialogues that yielded these observations and recommendations have continued to evolve. More recently, they have been formalized into a new Coalition for Health AI to ensure progress toward these goals. The steering committee for this project includes the three of us and Brian Anderson from MITRE Health; Atul Butte from the University of California, San Francisco; Eric Horvitz from Microsoft; Andrew Moore from Google; Ziad Obermeyer from the University of California, Berkeley; Michael Pencina from Duke University; and Tim Suther from Change Healthcare. Representatives from the Food and Drug Administration and the Department of Health and Human Services serve as observers in our meetings.

We are hosting a series of virtual conferences to advance the work over the next few months followed by an in-person conference to finalize the material for publication.

The coalition has identified three key steps needed to pave the path toward addressing these concerns:

  • Describe consistent methods and practices to assess the usefulness, reliability, and fairness of algorithms. Tech companies have developed toolkits for assessing the fairness and bias of algorithmic output. But everyone in the field must remain mindful of the fact that automated libraries are no substitute for careful thinking about what an algorithm should be doing and how to define bias.
  • Facilitate the development of broadly accessible evaluation platforms that bring together diverse data sources and standard tools for algorithm testing. Currently, there are no publicly accessible evaluation platforms that have both data and evaluation libraries in one place.
  • Ensure that robust and validated measures of reliability, fairness, and usefulness of AI interventions are incorporated into clinical algorithms.

By working together as a multi-stakeholder group and engaging policy makers, this coalition can develop the standards, guardrails, and guidance needed to enhance the reliability of clinical AI tools. By earning the public’s confidence in the underlying methods and principles, they will be assured that the humanistic values of medicine remain paramount and protected.

John D. Halamka is an emergency medicine physician and president of Mayo Clinic Platform. Suchi Saria is director of the Machine Learning, AI, and Health Lab at Johns Hopkins University and Johns Hopkins Medicine and founder of Bayesian Health. Nigam H. Shah is professor of medicine and biomedical data science at Stanford University School of Medicine and chief data scientist for Stanford Health Care.

Most Popular

To Top