
Toward Human-Level Understanding: A Systematic Review of Vision-Language Models for Image Captioning
Large Language Models (LLMs), particularly multimodal LLMs, have significantly enhanced image captioning in recent years, producing output that is more descriptive, detailed, and context-aware. However, differences in architecture and training data lead to captions that vary in length, style, and level of detail, offering flexibility for diverse applications. In this survey, we provide a comprehensive overview and comparative analysis of prominent Vision-Language Models (VLMs) for image captioning, with a focus on their performance in zero-shot settings on the Microsoft Common Objects in Context (MS- COCO) dataset. We evaluate these models using both human assessments (fluency, groundedness, relevance) and automatic metrics Contrastive Language–Image Pretraining Score (CLIPScore). Our findings reveal trade-offs between efficiency and performance, linking architectural decisions to issues such as hallucinations and caption grounding. Beyond benchmarking, we propose a human evaluation to capture nuances like fluency, factual grounding, and stylistic preferences, leading to recommendations for selecting VLMs based on different use cases.
[1] Alayrac J., Donahue J., Luc P., Miech A., and et al., “Flamingo: A Visual Language Model for Few-Shot Learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 23716-23736, New Orleans, 2022. https://dl.acm.org/doi/10.5555/3600270.3601993
[2] Anderson P., He X., Buehler C., Teney D., and et al., “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, 2018. https://doi.ieeecomputersociety.org/10.1109/CVP R.2018.00636
[3] Awadalla A., Gao I., Gardner J., Hessel J., and et al., “OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision- Language Models,” arXiv Preprint, pp. 1-20, 2023. https://ui.adsabs.harvard.edu/link_gateway/2023a rXiv230801390A/doi:10.48550/arXiv.2308.01390
[4] Bahdanau D., Cho K., and Bengio Y., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv Preprint, pp. 1-15, 2014. https://arxiv.org/abs/1409.0473v7
[5] Bai J., Bai S., Yang S., Wang S., and et al., “Qwen- VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,” arXiv Preprint, vol. arXiv:2308.12966v3 pp. 1-24, 2023. https://arxiv.org/abs/2308.12966v3
[6] Bavishi R., Elsen E., Hawthorne C., Nye M., and et al., Fuyu-8B: A Multimodal Architecture for AI Agents, www.adept.ai/blog/fuyu-8b, Last Visited, 2025.
[7] Bo Z., Boya W., Muyang H., and Tiejun H., “SVIT: Scaling up Visual Instruction Tuning,” arXiv Preprint, vol. arXiv:2307.04087v3, pp. 1- 18, 2023. https://arxiv.org/abs/2307.04087v3
[8] Brown T., Mann B., Ryder N., Subbiah M., and et al., “Language Models are Few-Shot Learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, pp. 1877-1901, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3495 883
[9] Caffagni D., Cocchi F., Barsellotti L., Moratelli N., and et al., “The Revolution of Multimodal Large Language Models: A Survey,” in Proceedings of the Association for Computational Linguistics, Bangkok, pp. 13590-13618, 2024. https://doi.org/10.18653/v1/2024.findings- acl.807
[10] Cha J., Kang W., Mun J., and Roh B., “Honeybee: Locality-Enhanced Projector for Multimodal LLM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, pp. 13817-13827, 2024. https://openaccess.thecvf.com/content/CVPR202 4/papers/Cha_Honeybee_Locality- enhanced_Projector_for_Multimodal_LLM_CVP R_2024_paper.pdf
[11] Chen J., Zhu D., Shen X., Li X., and et al., “MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning,” arXiv Preprint, vol. 2310.09478v3, pp. 1-20, 2023. https://arxiv.org/abs/2310.09478v3
[12] Cheng K., Song W., Ma Z., Zhu W., and et al., “Beyond Generic: Enhancing Image Captioning with Real-World Knowledge Using Vision- Language Pre-Training Model,” in Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, pp. 5038-5047, 2023. https://doi.org/10.1145/3581783.3611987
[13] Chowdhery A., Narang S., Devlin J., Bosma M., and et al., “PaLM: Scaling Language Modeling with Pathways,” Journal Machine Learning Research, pp. 1-87, 2022. https://arxiv.org/abs/2204.02311v5
[14] Chung H., Hou L., Longpre S., Zoph B., Tay Y., and et al., “Scaling Instruction-Finetuned Language Models,” Journal Machine Learning Research, vol. 25, no. 70, pp. 1-53, 2024. https://arxiv.org/abs/2210.11416v5
[15] Cornia M., Baraldi L., and Cucchiara R., “Explaining Transformer-Based Image Captioning Models: An Empirical Analysis,” AI Communications, vol. 35, no. 2, pp. 111-129, 2022. https://doi.org/10.3233/AIC-210172
[16] Cornia M., Baraldi L., and Cucchiara R., “SMArT: Training Shallow Memory-Aware Transformers Toward Human-Level Understanding: A Systematic Review of Vision-Language Models ... 95 for Robotic Explainability,” in Proceedings of the IEEE International Conference on Robotics and Automation, Paris, pp. 1-25, 2019. https://doi.org/10.1109/ICRA40945.2020.9196653
[17] Devlin J., Chang M., Lee K., and Toutanova K., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv Preprint, 2018. https://arxiv.org/abs/1810.04805v2
[18] Donahue J., Hendricks L., Rohrbach M., Venugopalan S., and et al., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 677-691, 2017. https://doi.org/10.1109/TPAMI.2016.2599174
[19] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., and et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proceedings of the International Conference on Learning Representations, Vinna, pp. 1-22, 2020. https://arxiv.org/abs/2010.11929v2
[20] Fantechi A., Gnesi S., Livi S., and Semini L., “A spaCy-Based Tool for Extracting Variability from NL Requirements,” in Proceedings of the 25th ACM International Systems and Software Product Line Conference, Leicester, pp. 32-35, 2021. https://doi.org/10.1145/3461002.3473074
[21] Gao P., Han J., Zhang R., Lin Z., and et al., “LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model,” arXiv Preprint, vol. 2304.15010v1, pp. 1-15, 2023. https://arxiv.org/abs/2304.15010v1
[22] Hani A., Tagougui N., and Kherallah M., “Image Caption Generation Using a Deep Architecture,” in Proceedings of the International Arab Conference on Information Technology, Al Ain, pp. 246-251, 2019. https://doi.org/10.1109/ACIT47987.2019.8990998
[23] Huang L., Wang W., Chen J., and Wei X., “Attention on Attention for Image Captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, pp. 4633- 4642, 2019. https://doi.ieeecomputersociety.org/10.1109/ICC V.2019.00473
[24] Jaegle A., Gimeno F., Brock A., Zisserman A, and et al., “Perceiver: General Perception with Iterative Attention,” arXiv Preprint, vol. arXiv:2103.03206v2, pp. 1-43, 2021. https://arxiv.org/abs/2103.03206v2
[25] Karpathy A. and Fei-Fei L., “Deep Visual- Semantic Alignments for Generating Image Descriptions,” IEEE Conference on Computer Vision and Pattern Recognition, vol. 39, pp. 3128- 3137, 2015. https://doi.ieeecomputersociety.org/10.1109/CVP R.2015.7298932
[26] Krizhevsky A., Sutskever I., and Hinton G., “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM, vol. 60, pp. 84-90, 2012. https://doi.org/10.1145/3065386
[27] Kulkarni G., Premraj V., Dhar S., Li S., and et al., “Baby Talk: Understanding and Generating Simple Image Descriptions,” in Proceedings of the Computer Vision and Pattern Recognition, Colorado Springs, pp. 1601-1608, 2011. https://doi.org/10.1109/CVPR.2011.5995466
[28] Kuznetsova P., Ordonez V., Berg T., and Choi Y., “TreeTalk: Composition and Compression of Trees for Image Descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 351-362, 2014. https://doi.org/10.1162/tacl_a_00188
[29] Lavie A. and Agarwal A., “METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments,” in Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, pp. 228-231, 2007. https://dl.acm.org/doi/proceedings/10.5555/1626 355
[30] Li J., Li D., Savarese S., and Hoi S., “BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models,” International Conference on Machine Learning, Honolulu, pp. 19730-19742, 2023. https://dl.acm.org/doi/10.5555/3618408.3619222
[31] Lin C., “ROUGE: A Package for Automatic Evaluation of Summaries,” Association for Computational Linguistics, Barcelona, pp. 74-81, 2004. https://aclanthology.org/W04-1013/
[32] Lin T., Maire M., Belongie S., Bourdev L., and et al., “Microsoft COCO: Common Objects in Context,” in Proceedings of the Computer Vision- ECCV 13th European Conference, Zurich, pp. 740- 755 2014. https://doi.org/10.1007/978-3-319- 10602-1_48
[33] Lin Z., Liu C., Zhang R., Gao P., and et al., “SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-Modal Large Language Models,” arXiv Preprint, vol. 2311.07575v1, pp. 1-24, 2023. https://arxiv.org/abs/2311.07575v1
[34] Liu F., Lin K., Li L., Wang J., and et al, “Mitigating Hallucination in Large Multi-Modal Models Via Robust Instruction Tuning,” arXiv Preprint, pp. 1-45, 2023. https://arxiv.org/abs/2306.14565v4
[35] Liu H, Li C., Li Y., Li B., and et al., LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge, https://llava-vl.github.io/blog/2024- 01-30-llava-next/, Last Visited, 2025.
[36] Liu H., Li C., Li Y., and Lee Y., “Improved Baselines with Visual Instruction Tuning,” in 96 The International Arab Journal of Information Technology, Vol. 23, No. 1, January 2026 Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, pp. 26286-26296, 2023. https://doi.org/10.1109/CVPR52733.2024.02484
[37] Liu H., Li C., Wu Q., and Lee Y., “Visual Instruction Tuning,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, pp. 34892- 34916, 2023. https://dl.acm.org/doi/abs/10.5555/3666122.3667 638
[38] Nlpconnect/it-gpt2-image-captioning, https://huggingface.co/nlpconnect/vit-gpt2- image-captioning, Last Visited, 2025.
[39] Osman A., Shalaby M., Soliman M., and Elsayed K., “Ar‑CM‑ViMETA: Arabic Image Captioning Based on Concept Model and Vision‑based Multi‑Encoder Transformer Architecture,” The International Arab Journal of Information Technology, vol. 21, no. 3, pp. 458-465, 2024. DOI: 10.34028/iajit/21/3/9
[40] Papineni K., Roukos S., Ward T., and Zhu W., “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, pp. 311-318, 2001. https://doi.org/10.3115/1073083.1073135
[41] Peng Z., Wang W., Dong L., Hao Y., and et al., “Kosmos-2: Grounding multimodal Large Language Models to the World,” arXiv Preprint, vol. arXiv:2306.14824v3, pp. 1-20, 2023. https://arxiv.org/abs/2306.14824v3
[42] Radford A., Kim J., Hallacy C., Ramesh A., and et al., “Learning Transferable Visual Models from Natural Language Supervision,” arXiv Preprint, vol. arXiv:2103.00020v1, pp. 1-48, 2021. https://arxiv.org/abs/2103.00020v1
[43] Raffel C., Shazeer N., Roberts A., Lee K., and et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal Machine Learning Research, vol. 21, pp. 1-67, 2019. https://arxiv.org/abs/1910.10683v4
[44] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., and et al., “Attention is all you Need,” arXiv Preprint, vol. arXiv:1706.03762v7, pp. 1-15, 2017. https://arxiv.org/abs/1706.03762v7
[45] Vedantam R., Zitnick L., and Parikh D., “CIDEr: Consensus-Based Image Description Evaluation,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 4566-4575, 2014. https://doi.ieeecomputersociety.org/10.1109/CVP R.2015.7299087
[46] Vikhyatk/Moondream2, https://huggingface.co/vikhyatk/moondream2, Last Visited, 2025.
[47] Vinyals O., Toshev A., Bengio S., and Erhan D., “Show and Tell: A Neural Image Caption Generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 3156-3164, 2015. https://doi.org/10.1109/CVPR.2015.7298935
[48] Wang A., Zhou P., Shou M., and Yan S., “Position- Guided Text Prompt for Vision-Language Pre- Training,” arXiv Preprint, vol. arXiv:2212.09737v2, pp. 23242-23251, 2022. https://arxiv.org/abs/2212.09737v2
[49] Wang B., Wu F., Han X., Peng J., Zhong H., and et al., “VIGC: Visual Instruction Generation and Correction,” AAAI Conference on Artificial Intelligence, vol. 38, no. 6, pp. 5309-5317, 2024. https://doi.org/10.1609/aaai.v38i6.28338
[50] Wang J., Yang Z., Hu X., Li L., and et al., “GIT: A Generative Image-to-Text Transformer for Vision and Language,” arXiv Preprint, vol. arXiv:2205.14100v5, pp. 1-49, 2022. https://arxiv.org/abs/2205.14100v5
[51] Wang P., Yang A., Men R., Lin J., and et al., “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,” arXiv Preprint, vol. arXiv:2202.03052v2162, pp. 1-49, 2022. https://arxiv.org/abs/2202.03052v2
[52] Wang W., Lv O., Yu W., Hong W., and et al., “CogVLM: Visual Expert for Pre-Trained Language Models,” arXiv Preprint, vol. arXiv:2311.03079v2 pp. 1-17, 2023. https://arxiv.org/abs/2311.03079v2
[53] Xu K., Ba J., Kiros R., Cho K., and et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, pp. 2048-2057, 2015. https://dl.acm.org/doi/10.5555/3045118.3045336
[54] Yao L., Chen W., and Jin Q., “CapEnrich: Enriching Caption Semantics for Web Images Via Cross-Modal Pre-Trained Knowledge,” in Proceedings of the ACM Web Conference, Austin, pp. 2392-2401, 2022. https://doi.org/10.1145/3543507.3583232
[55] Ye Q., Xu H., Ye J., Yan M., and et al., “mPLUG- OwI2: Revolutionizing Multi-Modal Large Language Model with Modality Collaboration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, pp. 13040-13051, 2023. https://doi.org/10.1109/CVPR52733.2024.01239
[56] Yenduri G., Ramalingam M., Selvi G., Supriya Y., and et al., “Generative Pre-Trained Transformer: A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions,” IEEE Access, vol. 12, pp. 54608-54649 2023. https://doi.org/10.1109/ACCESS.2024.3389497
[57] Yin S., Fu C., Zhao S., Xu T., and et al., “Woodpecker: Hallucination Correction for Toward Human-Level Understanding: A Systematic Review of Vision-Language Models ... 97 Multimodal Large Language Models,” Science China Information Sciences, vol. 67, no. 12, 2024. https://doi.org/10.1007/s11432-024-4251-x