Anatomical Structure-Guided Medical Vision-Language Pre-training

Qingqiu Li1, Xiaohan Yan2, Jilan Xu1, Runtian Yuan1, Yuejie Zhang1,
Rui Feng1, Quanli Shen3, Xiaobo Zhang3, and Shujun Wang4,5

1Fudan University    2Tongji University    3Children's Hospital of Fudan University   
4The Hong Kong Polytechnic University    5Research Institute for Smart Ageing   

Two limitations of existing methods: (a) lack of interpretability and clinical relevance and (b) insufficient representation learning of image-report pairs; and our corresponding improvement.

The framework of Anatomical Structure-Guided Medical Vision-Language Pre-training.

Our anatomical region - sentence alignment pipeline.

One case of our re-labeled dataset.

The differences between our re-labeled dataset and Chest ImaGenome.(use left lung as an example)

Demo for Anatomical Structure-Sentence Alignment


Three scenarios of anatomical region-sentence alignment.

We provide an example, showing the alignment results under two different methods, i.e., merge bbox, split sentence.



Raw report: mild left basal atelectasis. otherwise unremarkable. ap upright and lateral views the chest were provided. mild left basal atelectasis. lungs are otherwise clear. no signs of pneumonia or edema. no large effusion or pneumothorax. cardiomediastinal silhouette is normal. bony structures are intact. no free air below the right hemidiaphragm.

Merge Bbox


Split Sent