
MASNET-ESN: An Effective Deep Learning Based Automatic Medical Image Captioning
These days, one of the most well-known fields is medical picture captioning. Medical image interpretation and captioning can be expensive and time-consuming, and they frequently call for professional assistance. It is becoming more difficult for radiologists to manage their tasks independently due to the increasing volume of medical images. Therefore, automating the medical image captioning process is introduced to alleviate the high cost and time difficulties while supporting radiologists in enhancing the dependability and precision of the generated captions. Additionally, it offers less experienced new radiologists the chance to take advantage of automatic support. However, the previous studies contain several unsolved challenges, such as producing excessively elaborate captions, having trouble identifying aberrant regions in complicated images, and having low accuracy. To address these issues, we suggest an effective Multiscale Attention Siamese Network (MASNet)-Echo State Network (ESN) based deep learning techniques for automatic medical image captioning. In this work, MASNet extracts global and local visual characteristics from the preprocessed image. Afterwards, ESN can use the image’s retrieved high and low-level characteristics to produce a complete description of the input image. Moreover, Tunicate Swarm Algorithm (TSA) based hyperparameter tuning is applied to improve the performance of the ESN network. Finally, the suggested technique will be measured by metrics like Consensus-based Image Description Evaluation (CIDEr), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L), and Bilingual Evaluation Understudy (BLEU) on the Pathology Education Instructional Resource- Gross (PEIR-Gross), Medical Information Mart for Intensive Care-Chest X-Ray (MIMIC CXR), Radiology Objects in Context (ROCO), and Indiana University chest X-ray (IU X-ray) datasets, and the findings will be compared with other previous methods.
[1] Abed E. and Aguili T., “Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation,” Diyala Journal of Engineering Sciences, vol. 18, no. 2, pp. 228-248, 2025. https://doi.org/10.24237/djes.2025.18215
[2] Ayesha H., Tariq M., and Israr S., “Computer Aided Deep Image Captioning for Medical Images,” Machines and Algorithms, vol. 2, no. 1, pp. 1-16, 2023. https://knovell.org/MnA/index.php/ojs/article/vie w/36
[3] Beddiar D., Oussalah M., and Seppanen T., “Retrieved Generative Captioning for Medical 312 The International Arab Journal of Information Technology, Vol. 23, No. 2, March 2026 Images,” in Proceedings of the 20th International Conference on Content-based Multimedia Indexing, Orleans, pp. 48-54, 2023. https://doi.org/10.1145/3617233.3617246
[4] Cao Y., Cui L., Zhang L., Yu F., and et al., “MMTN: Multimodal Memory Transformer Network for Image-Report Consistent Medical Report Generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, Washington (DC), pp. 277-285, 2023. https://doi.org/10.1609/aaai.v37i1.25100
[5] Cao Y., Ding H., Zhang Y., and Hei Y., “Radiology Report Generation Based on Adaptive Enhanced Fusion of Multi Features,” Computers in Biology and Medicine, vol. 193, pp. 110494, 2025. https://doi.org/10.1016/j.compbiomed.2025.1104 94
[6] Chitteti C. and Madhavi K., “Taylor African Vulture Optimization Algorithm with Hybrid Deep Convolution Neural Network for Image Captioning System,” Multimedia Tools Applications, vol. 83, pp. 1-19, 2024. https://doi.org/10.1007/s11042-023-18080-0
[7] Divya P., Sravani Y., Vishnu C., Mohan C., and Chen Y., “Memory Guided Transformer with Spatio-Semantic Visual Extractor for Medical Report Generation,” IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 5, pp. 3079- 3089, 2024. https://doi.org/10.1109/JBHI.2024.3371894
[8] Elbedwehy S., Medhat T., Hamza T., and Alrahmawy M., “Enhanced Descriptive Captioning Model for Histopathological Patches,” Multimedia Tools Applications, vol. 83, no. 12, pp. 36645-36664, 2024. https://doi.org/10.1007/s11042-023-15884-y
[9] Huang Z., Zhang X., and Zhang S., “KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, pp. 19809- 19818, 2023. https://doi.org/10.1109/CVPR52729.2023.01897
[10] Jing B., Xie P., and Xing E., “On the Automatic Generation of Medical Imaging Reports,” arXiv Preprint, vol. arXiv:1711.08195v3, pp. 1-10, 2017. https://arxiv.org/abs/1711.08195v3
[11] Kim G., Oh B., Kim C., and Kim Y., “Convolutional Neural Network and Language Model-based Sequential CT Image Captioning for Intracerebral Hemorrhage,” Applied Sciences, vol. 13, no. 17, pp. 1-13, 2023. https://doi.org/10.3390/app13179665
[12] Kong J., Oh B., Kim C., and Kim Y., “Sequential Brain CT Image Captioning Based on the Pre- Trained Classifiers and a Language Model,” Applied Sciences, vol. 14, no. 3, pp. 1-15, 2024. https://doi.org/10.3390/app14031193
[13] Lee H., Cho H., Park J., Chae J., and Kim J., “Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning,” Sensors, vol. 22, no. 4, pp. 1-13, 2022. https://doi.org/10.3390/s22041429
[14] Li M., Liu R., Wang F., Chang X., and Liang X., “Auxiliary Signal-Guided Knowledge Encoder- Decoder for Medical Report Generation,” World Wide Web, vol. 26, no. 1, pp. 253-270, 2023. https://doi.org/10.1007/s11280-022-01013-6
[15] Lin Y., Lai K., and Chang W., “Skin Medical Image Captioning Using Multi-Label Classification and Siamese Network,” IEEE Access, vol. 11, pp. 23447-23454, 2023. https://doi.org/10.1109/ACCESS.2023.3249462
[16] Lin Z., Zhang D., Shi D., Xu R., and et al., “Contrastive Pre-Training and Linear Interaction Attention-based Transformer for Universal Medical Reports Generation,” Journal of Biomedical Informatics, vol. 138, pp. 104281, 2023. https://doi.org/10.1016/j.jbi.2023.104281
[17] Magalhaes G., Santos R., Vogado L., Paiva A., and Neto P., “XRaySwinGen: Automatic Medical Reporting for X-Ray Exams with Multimodal Model,” Heliyon, vol. 10, no. 7, pp. 1-8, 2024. DOI: 10.1016/j.heliyon.2024.e27516
[18] Mayzura W., Sarno R., Suroto N., Supriyanto M., and Sihaj G., “Automatic Interpretation of Brain Medical Images Using Hierarchical Classification and Image Captioning Model,” IEEE Access, vol. 13, pp. 84675-84688, 2025. https://doi.org/10.1109/ACCESS.2025.3560701
[19] Morampudi M., Gonthina N., Bhaskar N., and Reddy V., “Image Description Generator using Residual Neural Network and Long Short-Term Memory,” Computer Science Journal of Moldova, vol. 31, no. 1, pp. 3-21, 2023. https://www.math.md/files/csjm/v31-n1/v31-n1- (pp3-21).pdf
[20] Naseem U., Thapa S., and Masood A., “Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study,” JMIR Medical Informatics, vol. 12, no. 1, pp. 1-19, 2024. https://pubmed.ncbi.nlm.nih.gov/39102281/
[21] Ouis M. and Akhloufi M., “ChestBioX-Gen: Contextual Biomedical Report Generation from Chest X-Ray Images Using BioGPT and Co- Attention Mechanism,” Frontiers in Imaging, vol. 3, pp. 1373420, 2024. https://doi.org/10.3389/fimag.2024.1373420
[22] Pahwa E., Mehta D., Kapadia S., Jain D., and Luthra A., “Medskip: Medical Report Generation Using Skip Connections and Integrated Attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, MASNET-ESN: An Effective Deep Learning Based Automatic Medical Image Captioning 313 Montreal, pp. 3409-3415, 2021. https://doi.org/10.1109/ICCVW54120.2021.00380
[23] Pan Y., Liu L., Yang X., Peng W., and Huang Q., “Chest Radiology Report Generation Based on Cross-Modal Multiscale Feature Fusion,” Journal of Radiation Research and Applied Sciences, vol. 17, no. 1, pp. 100823, 2024. https://doi.org/10.1016/j.jrras.2024.100823
[24] Park H., Kim K., Park S., and Choi J., “Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation,” IEEE Access, vol. 9, pp. 150560-150568, 2021. https://doi.org/10.1109/ACCESS.2021.3124564
[25] Ravinder P. and Srinivasan S., “Automated Medical Image Captioning with Soft Attention- Based LSTM Model Utilizing YOLOv4 Algorithm,” Journal of Computer Science, vol. 20, no. 1, pp. 52-68, 2024. https://doi.org/10.3844/jcssp.2024.52.68
[26] Reddy P., Verma V., and Varma M., “Optimizing Medical Image Report Generation with Varied Attention Mechanisms,” in Proceedings of the 6th International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha Nagar, pp. 2137-2143, 2023. https://doi.org/10.1109/IC3I59117.2023.10398149
[27] Revathi B. and Kowshalya A., “Automatic Image Captioning System Based on Augmentation and Ranking Mechanism,” Signal, Image Video Processing, vol. 18, no. 1, pp. 265-274, 2024. https://doi.org/10.1007/s11760-023-02725-6
[28] Selivanov A., Rogov O., Chesakov D., Shelmanov A., and et al., “Medical Image Captioning via Generative Pretrained Transformers,” Scientific Reports, vol. 13, no. 1, pp. 1-12, 2023. https://www.nature.com/articles/s41598-023- 31223-5
[29] Shaik N. and Cherukuri T., “Gated Contextual Transformer Network for Multimodal Retinal Image Clinical Description Generation,” Image and Vision Computing, vol. 143, pp. 104946, 2024. https://doi.org/10.1016/j.imavis.2024.104946
[30] Shao Z., Han J., Debattista K., and Pang Y., “DCMSTRD: End-to-End Dense Captioning via Multiscale Transformer Decoding,” IEEE Transactions on Multimedia, vol. 26, pp. 7581- 7593, 2024. https://doi.org/10.1109/TMM.2024.3369863
[31] Shentu J. and Al Moubayed N., “CXR-IRGen: An Integrated Vision and Language Model for the Generation of Clinically Accurate Chest X-Ray Image-Report Pairs,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, pp. 5212-5221, 2024. https://doi.org/10.1109/WACV57701.2024.00513
[32] Sirbu I., Sirbu I., Bogojeska J., and Rebedea T., “GIT-CXR: End-to-End Transformer for Chest X- Ray Report Generation,” Information, vol. 16, no. 7, pp. 1-27, 2025. https://doi.org/10.3390/info16070524
[33] Sirshar M., Paracha M., Akram M., Alghamdi N., and et al., “Attention Based Automated Radiology Report Generation Using CNN and LSTM,” PloS One, vol. 17, no. 1, pp. 1-20, 2022. https://doi.org/10.1371/journal.pone.0262209
[34] Tan Y., Li C., Qin J., Xue Y., and Xiang X., “Medical Image Description Based on Multimodal Auxiliary Signals and Transformer,” International Journal of Intelligent Systems, vol. 2024, pp. 1-12, 2024. https://doi.org/10.1155/2024/6680546
[35] Tang Y., Yuan Y., Tao F., and Tang M., “Cross- Modal Augmented Transformer for Automated Medical Report Generation,” IEEE Journal of Translational Engineering in Health and Medicine, vol. 13, pp. 33-48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
[36] Tiwary T. and Mahapatra R., “An Accurate Generation of Image Captions for Blind People Using Extended Convolutional Atom Neural Network,” Multimedia Tools Applications, vol. 82, no. 3, pp. 3801-3830, 2023. https://doi.org/10.1007/s11042-022-13443-5
[37] Wang F., Liang X., Xu L., and Lin L., “Unifying Relational Sentence Generation and Retrieval for Medical Image Report Composition,” IEEE Transactions on Cybernetics, vol. 52, no. 6, pp. 5015-5025, 2022. https://doi.org/10.1109/TCYB.2020.3026098
[38] Wang Y., Lin Z., Xu Z., Dong H., and et al., “Trust it or not: Confidence-Guided Automatic Radiology Report Generation,” Neurocomputing, vol. 578, pp. 127374, 2024. https://doi.org/10.1016/j.neucom.2024.127374
[39] Xu D., Zhu H., Huang Y., Jin Z., and et al., “Vision-Knowledge Fusion Model for Multi- Domain Medical Report Generation,” Information Fusion, vol. 97, pp. 101817, 2023. https://doi.org/10.1016/j.inffus.2023.101817
[40] Xu L., Liu B., Khan A., Fan L., and Wu X., “Multi-Modal Pre-Training for Medical Vision- Language Understanding and Generation: An Empirical Study with a New Benchmark,” arXiv Preprint, vol. arXiv:2306.06494v2, pp. 1-18, 2023. https://arxiv.org/abs/2306.06494v2
[41] Xue Y., Tan Y., Tan L., Qin J., and Xiang X., “Generating Radiology Reports via Auxiliary Signal Guidance and a Memory-Driven Network,” Expert Systems with Applications, vol. 237, pp. 121260, 2024. https://doi.org/10.1016/j.eswa.2023.121260
[42] Yang S., Niu J., Wu J., Wang Y., and et al., “Automatic Ultrasound Image Report Generation 314 The International Arab Journal of Information Technology, Vol. 23, No. 2, March 2026 with Adaptive Multimodal Attention Mechanism,” Neurocomputing, vo. 427, pp. 40- 49, 2021. https://doi.org/10.1016/j.neucom.2020.09.084
[43] Yang S., Wu X., Ge S., Zheng Z., and et al., “Radiology Report Generation with a Learned Knowledge Base and Multimodal Alignment,” Medical Image Analysis, vol. 86, pp. 102798, 2023. https://doi.org/10.1016/j.media.2023.102798
[44] Yang X., Wang Y., Chen H., Li J., and Huang T., “Context-Aware Transformer for Image Captioning,” Neurocomputing, vol. 549, pp. 126440, 2023. https://doi.org/10.1016/j.neucom.2023.126440
[45] Yang X., Yang Y., Wu J., Sun W., and et al., “CA- Captioner: A Novel Concentrated Attention for Image Captioning,” Expert Systems with Applications, vol. 250, pp. 123847, 2024. https://doi.org/10.1016/j.eswa.2024.123847
[46] Zeiser F., Costa C., Ramos G., Maier A., and Righi R., “CheXReport: A Transformer-based Architecture to Generate Chest X-Ray Reports Suggestions,” Expert Systems with Applications, vol. 255, pp. 124644, 2024. https://doi.org/10.1016/j.eswa.2024.124644
[47] Zeng X., Liao T., Xu L., and Wang Z., “AERMNet: Attention-Enhanced Relational Memory Network for Medical Image Report Generation,” Computer Methods Programs in Biomed, vol. 244, pp. 107979, 2024. https://doi.org/10.1016/j.cmpb.2023.107979
[48] Zhang J., Shen X., Wan S., Goudos S., and et al., “A Novel Deep Learning Model for Medical Report Generation by Inter-Intra Information Calibration,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 10, pp. 5110- 5121, 2023. https://doi.org/10.1109/JBHI.2023.3236661
[49] Zhang K., Zhou R., Adhikarla E., Yan Z., and et al., “A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks,” Nature Medicine, vol. 30, no. 11, pp. 3129-3141, 2024. https://www.nature.com/articles/s41591-024- 03185-2
[50] Zhang Y., Liu M., Zhang L., Wang L., and et al., “Comparison of Chest Radiograph Captions Based on Natural Language Processing vs Completed by Radiologists,” JAMA Network Open, vol. 6, no. 2, pp. 2255113, 2023. doi:10.1001/jamanetworkopen.2022.55113
[51] Zhang Z., Wang B., Liang W., Li Y., and et al., “Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, pp. 1731-1735, 2024. https://doi.org/10.1109/ICASSP48485.2024.1044 6878
[52] Zhao G., Zhao Z., Gong W., and Li F., “Radiology Report Generation with Medical Knowledge and Multilevel Image-Report Alignment: A New Method and its Verification,” Artificial Intelligence in Medicine, vol. 146, pp. 102714, 2023. https://doi.org/10.1016/j.artmed.2023.102714
[53] Zheng E. and Yu Q., “Evidential Interactive Learning for Medical Image Captioning,” in Proceedings of the International Conference on Machine Learning, Honolulu, pp. 42478-42491, 2023. https://dl.acm.org/doi/10.5555/3618408.3620195