BenchMD

A Benchmark for Unified Learning on Medical Images and Sensors.

Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta , Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin*, Pranav Rajpurkar*

Benchmarking Progress on Unified Medical AI.

Medical data poses a daunting challenge for AI algorithms: it exists in many different modalities, experiences frequent distribution shifts, and suffers from a scarcity of examples and labels. Recent advances, including transformers and self-supervised learning, promise a more universal approach that can be applied flexibly across these diverse conditions.

To measure and drive progress in this direction, we present a benchmark that tests how unified methods, including architectures and training techniques (e.g. self-supervised learning, ImageNet pretraining), perform on a diverse array of clinically-relevant medical tasks.

19 Public Datasets. 7 Modalities. Unified training objectives, evaluated on clinically-relevant tasks.

Models for each modality are trained using unified methods on a source dataset and tested on out-of-distribution data from one or more target datasets. Our evaluation includes challenging few-shot settings and analysis of naturally occurring distribution shifts that frequently degrade the performance of medical AI models.

Key Results:

In our first round of baseline experiments, we tested the performance of a variety of domain-agnostic self-supervised learning (SSL) objectives as well as ImageNet pre-training and training from scratch. We found the following:

  1. No single SSL technique offers universally high performance across all modalities.

  2. ImageNet pre-training and training from scratch can sometimes match SSL performance, demonstrating the difficulty of using current SSL techniques out-of-the-box without customization for particular medical modalities.

  3. OOD performance typically stays the same or improves when more labels are available for finetuning, though we see exceptions due to overfitting on the source dataset.

Try BenchMD with your algorithms, architectures, and datasets.