Deep dive · Multimedia Tools and Applications · 2021

COVID-CXNet: Open Chest X-ray Dataset and Detection Model for COVID-19

One of the earliest large-scale open datasets and detection models for COVID-19 on frontal chest X-rays.

  • Medical Imaging
  • COVID-19
  • Open Dataset
COVID-CXNet pipeline: chest X-rays crowdsourced from GitHub, Twitter, SIRM and Radiopaedia pass through CLAHE and BEASF enhancement plates into the COVID-CXNet model (with a CheXNet/DenseNet-121 baseline) and out to a clinical decision-support workstation showing a saliency heatmap and structured findings.
COVID-CXNet pipeline: chest X-rays crowdsourced from GitHub, Twitter, SIRM and Radiopaedia pass through CLAHE and BEASF enhancement plates into the COVID-CXNet model (with a CheXNet/DenseNet-121 baseline) and out to a clinical decision-support workstation showing a saliency heatmap and structured findings.

Problem

In early 2020, COVID-19 chest X-ray data was scattered, small-scale, and inconsistent — making it nearly impossible to train robust detectors or compare methods fairly. Many published 'COVID-vs-normal' classifiers were quietly memorizing dataset artifacts rather than disease features.

Approach

We assembled what was, at the time, the largest publicly available COVID-19 frontal CXR collection by harmonizing multiple public sources, deduplicating cases, and standardizing preprocessing. On top of this dataset we trained COVID-CXNet, a CheXNet-based detector with explicit calibration and class-activation analysis to verify that predictions track radiologically meaningful regions rather than dataset shortcuts.

Key results

  • Open-sourced the largest curated frontal CXR COVID dataset of its time, used by dozens of follow-up studies.
  • Detection accuracy competitive with much larger proprietary models, with attention maps that overlap with radiologist regions of interest.
  • Highlighted concrete shortcut-learning risks (lateral markers, hospital-specific text overlays) that the community widely adopted as part of standard preprocessing.

Takeaways

  • Dataset curation is research: a careful open dataset can move a sub-field faster than a new architecture.
  • Class-activation maps are a cheap but powerful sanity check against shortcut learning in medical imaging.
  • Reproducibility and provenance matter more than peak accuracy when results inform clinical conversations.