Abstract
Medical reports, radiology, and pathology images are critical in achieving accurate diagnosis and treatment planning in the medical field. However, generating these reports is time-consuming and prone to errors. To address this, we explore the potential of transformer-based architectures for the automatic generation of medical reports. Our study is focused on investigating how the combination of Vision Transformer (ViT) and Contrastive Language-Image Pre-training (CLIP) model (specifically the VIT B/14 variant) as image encoders, along with Generative Pre-trained Transformer (GPT) as text decoder, enhance the understanding of the relationship between medical images and reports. In this paper, we explore three architectures: ViT - CoAttention - LSTM, ViT-GPT2, and CLIP-GPT2. Experiments on a public dataset [1] demonstrate that our best model, CLIP-GPT2, outperforms existing baseline models. Furthermore, we integrate these models into a web application deployed on Hugging Face for ease of use and broader accessibility.