Deep dive · Multimedia Tools and Applications · 2021
COVID-CXNet: Open Chest X-ray Dataset and Detection Model for COVID-19
One of the earliest large-scale open datasets and detection models for COVID-19 on frontal chest X-rays.
Problem
In early 2020, COVID-19 chest X-ray data was scattered, small-scale, and inconsistent — making it nearly impossible to train robust detectors or compare methods fairly. Many published 'COVID-vs-normal' classifiers were quietly memorizing dataset artifacts rather than disease features.
Approach
We assembled what was, at the time, the largest publicly available COVID-19 frontal CXR collection by harmonizing multiple public sources, deduplicating cases, and standardizing preprocessing. On top of this dataset we trained COVID-CXNet, a CheXNet-based detector with explicit calibration and class-activation analysis to verify that predictions track radiologically meaningful regions rather than dataset shortcuts.
Key results
- Open-sourced the largest curated frontal CXR COVID dataset of its time, used by dozens of follow-up studies.
- Detection accuracy competitive with much larger proprietary models, with attention maps that overlap with radiologist regions of interest.
- Highlighted concrete shortcut-learning risks (lateral markers, hospital-specific text overlays) that the community widely adopted as part of standard preprocessing.
Takeaways
- Dataset curation is research: a careful open dataset can move a sub-field faster than a new architecture.
- Class-activation maps are a cheap but powerful sanity check against shortcut learning in medical imaging.
- Reproducibility and provenance matter more than peak accuracy when results inform clinical conversations.