
Toward Human-Level Understanding: A Systematic Review of Vision-Language Models for Image Captioning
Large Language Models (LLMs), particularly multimodal LLMs, have significantly enhanced image captioning in recent years, producing output that is more descriptive, detailed, and context-aware. However, differences in architecture and training data lead to captions that vary in length, style, and level of detail, offering flexibility for diverse applications. In this survey, we provide a comprehensive overview and comparative analysis of prominent Vision-Language Models (VLMs) for image captioning, with a focus on their performance in zero-shot settings on the Microsoft Common Objects in Context (MS- COCO) dataset. We evaluate these models using both human assessments (fluency, groundedness, relevance) and automatic metrics Contrastive Language–Image Pretraining Score (CLIPScore). Our findings reveal trade-offs between efficiency and performance, linking architectural decisions to issues such as hallucinations and caption grounding. Beyond benchmarking, we propose a human evaluation to capture nuances like fluency, factual grounding, and stylistic preferences, leading to recommendations for selecting VLMs based on different use cases.
[1] Alayrac J., Donahue J., Luc P., Miech A., and et al., “Flamingo: A Visual Language Model for Few-Shot Learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 23716-23736, New Orleans, 2022. https://dl.acm.org/doi/10.5555/3600270.3601993
[2] Anderson P., He X., Buehler C., Teney D., and et al., “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, 2018. https://doi.ieeecomputersociety.org/10.1109/CVP R.2018.00636
[3] Awadalla A., Gao I., Gardner J., Hessel J., and et al., “OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision- Language Models,” arXiv Preprint, pp. 1-20, 2023. https://ui.adsabs.harvard.edu/link_gateway/2023a rXiv230801390A/doi:10.48550/arXiv.2308.01390
[4] Bahdanau D., Cho K., and Bengio Y., “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv Preprint, pp. 1-15, 2014. https://arxiv.org/abs/1409.0473v7
[5] Bai J., Bai S., Yang S., Wang S., and et al., “Qwen- VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,” arXiv Preprint, vol. arXiv:2308.12966v3 pp. 1-24, 2023. https://arxiv.org/abs/2308.12966v3
[6] Bavishi R., Elsen E., Hawthorne C., Nye M., and et al., Fuyu-8B: A Multimodal Architecture for AI Agents, www.adept.ai/blog/fuyu-8b, Last Visited, 2025.
[7] Bo Z., Boya W., Muyang H., and Tiejun H., “SVIT: Scaling up Visual Instruction Tuning,” arXiv Preprint, vol. arXiv:2307.04087v3, pp. 1- 18, 2023. https://arxiv.org/abs/2307.04087v3
[8] Brown T., Mann B., Ryder N., Subbiah M., and et al., “Language Models are Few-Shot Learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, pp. 1877-1901, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3495 883
[9] Caffagni D., Cocchi F., Barsellotti L., Moratelli N., and et al., “The Revolution of Multimodal Large Language Models: A Survey,” in Proceedings of the Association for Computational Linguistics, Bangkok, pp. 13590-13618, 2024. https://doi.org/10.18653/v1/2024.findings- acl.807
[10] Cha J., Kang W., Mun J., and Roh B., “Honeybee: Locality-Enhanced Projector for Multimodal LLM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, pp. 13817-13827, 2024. https://openaccess.thecvf.com/content/CVPR202 4/papers/Cha_Honeybee_Locality- enhanced_Projector_for_Multimodal_LLM_CVP R_2024_paper.pdf
[11] Chen J., Zhu D., Shen X., Li X., and et al., “MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning,” arXiv Preprint, vol. 2310.09478v3, pp. 1-20, 2023. https://arxiv.org/abs/2310.09478v3
[12] Cheng K., Song W., Ma Z., Zhu W., and et al., “Beyond Generic: Enhancing Image Captioning with Real-World Knowledge Using Vision- Language Pre-Training Model,” in Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, pp. 5038-5047, 2023. https://doi.org/10.1145/3581783.3611987
[13] Chowdhery A., Narang S., Devlin J., Bosma M., and et al., “PaLM: Scaling Language Modeling with Pathways,” Journal Machine Learning Research, pp. 1-87, 2022. https://arxiv.org/abs/2204.02311v5
[14] Chung H., Hou L., Longpre S., Zoph B., Tay Y., and et al., “Scaling Instruction-Finetuned Language Models,” Journal Machine Learning Research, vol. 25, no. 70, pp. 1-53, 2024. https://arxiv.org/abs/2210.11416v5
[15] Cornia M., Baraldi L., and Cucchiara R., “Explaining Transformer-Based Image Captioning Models: An Empirical Analysis,” AI Communications, vol. 35, no. 2, pp. 111-129, 2022. https://doi.org/10.3233/AIC-210172
[16] Cornia M., Baraldi L., and Cucchiara R., “SMArT: Training Shallow Memory-Aware Transformers for Robotic Explainability,” in Proceedings of the IEEE International Conference on Robotics and Automation, Paris, pp. 1-25, 2019. https://doi.org/10.1109/ICRA40945.2020.9196653
[17] Devlin J., Chang M., Lee K., and Toutanova K., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv Preprint, 2018. https://arxiv.org/abs/1810.04805v2
[18] Donahue J., Hendricks L., Rohrbach M., Venugopalan S., and et al., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 677-691, 2017. https://doi.org/10.1109/TPAMI.2016.2599174
[19] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., and et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proceedings of the International Conference on Learning Representations, Vinna, pp. 1-22, 2020. https://arxiv.org/abs/2010.11929v2
[20] Fantechi A., Gnesi S., Livi S., and Semini L., “A spaCy-Based Tool for Extracting Variability from NL Requirements,” in Proceedings of the 25th ACM International Systems and Software Product Line Conference, Leicester, pp. 32-35, 2021. https://doi.org/10.1145/3461002.3473074
[21] Gao P., Han J., Zhang R., Lin Z., and et al., “LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model,” arXiv Preprint, vol. 2304.15010v1, pp. 1-15, 2023. https://arxiv.org/abs/2304.15010v1
[22] Hani A., Tagougui N., and Kherallah M., “Image Caption Generation Using a Deep Architecture,” in Proceedings of the International Arab Conference on Information Technology, Al Ain, pp. 246-251, 2019. https://doi.org/10.1109/ACIT47987.2019.8990998
[23] Huang L., Wang W., Chen J., and Wei X., “Attention on Attention for Image Captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, pp. 4633- 4642, 2019. https://doi.ieeecomputersociety.org/10.1109/ICC V.2019.00473
[24] Jaegle A., Gimeno F., Brock A., Zisserman A, and et al., “Perceiver: General Perception with Iterative Attention,” arXiv Preprint, vol. arXiv:2103.03206v2, pp. 1-43, 2021. https://arxiv.org/abs/2103.03206v2
[25] Karpathy A. and Fei-Fei L., “Deep Visual- Semantic Alignments for Generating Image Descriptions,” IEEE Conference on Computer Vision and Pattern Recognition, vol. 39, pp. 3128- 3137, 2015. https://doi.ieeecomputersociety.org/10.1109/CVP R.2015.7298932
[26] Krizhevsky A., Sutskever I., and Hinton G., “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM, vol. 60, pp. 84-90, 2012. https://doi.org/10.1145/3065386
[27] Kulkarni G., Premraj V., Dhar S., Li S., and et al., “Baby Talk: Understanding and Generating Simple Image Descriptions,” in Proceedings of the Computer Vision and Pattern Recognition, Colorado Springs, pp. 1601-1608, 2011. https://doi.org/10.1109/CVPR.2011.5995466
[28] Kuznetsova P., Ordonez V., Berg T., and Choi Y., “TreeTalk: Composition and Compression of Trees for Image Descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 351-362, 2014. https://doi.org/10.1162/tacl_a_00188
[29] Lavie A. and Agarwal A., “METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments,” in Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, pp. 228-231, 2007. https://dl.acm.org/doi/proceedings/10.5555/1626 355
[30] Li J., Li D., Savarese S., and Hoi S., “BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models,” International Conference on Machine Learning, Honolulu, pp. 19730-19742, 2023. https://dl.acm.org/doi/10.5555/3618408.3619222
[31] Lin C., “ROUGE: A Package for Automatic Evaluation of Summaries,” Association for Computational Linguistics, Barcelona, pp. 74-81, 2004. https://aclanthology.org/W04-1013/
[32] Lin T., Maire M., Belongie S., Bourdev L., and et al., “Microsoft COCO: Common Objects in Context,” in Proceedings of the Computer Vision- ECCV 13th European Conference, Zurich, pp. 740- 755 2014. https://doi.org/10.1007/978-3-319- 10602-1_48
[33] Lin Z., Liu C., Zhang R., Gao P., and et al., “SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-Modal Large Language Models,” arXiv Preprint, vol. 2311.07575v1, pp. 1-24, 2023. https://arxiv.org/abs/2311.07575v1
[34] Liu F., Lin K., Li L., Wang J., and et al, “Mitigating Hallucination in Large Multi-Modal Models Via Robust Instruction Tuning,” arXiv Preprint, pp. 1-45, 2023. https://arxiv.org/abs/2306.14565v4
[35] Liu H, Li C., Li Y., Li B., and et al., LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge, https://llava-vl.github.io/blog/2024- 01-30-llava-next/, Last Visited, 2025.
[36] Liu H., Li C., Li Y., and Lee Y., “Improved Baselines with Visual Instruction Tuning,” in Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, pp. 26286-26296, 2023. https://doi.org/10.1109/CVPR52733.2024.02484
[37] Liu H., Li C., Wu Q., and Lee Y., “Visual Instruction Tuning,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, pp. 34892- 34916, 2023. https://dl.acm.org/doi/abs/10.5555/3666122.3667 638
[38] Nlpconnect/it-gpt2-image-captioning, https://huggingface.co/nlpconnect/vit-gpt2- image-captioning, Last Visited, 2025.
[39] Osman A., Shalaby M., Soliman M., and Elsayed K., “Ar‑CM‑ViMETA: Arabic Image Captioning Based on Concept Model and Vision‑based Multi‑Encoder Transformer Architecture,” The International Arab Journal of Information Technology, vol. 21, no. 3, pp. 458-465, 2024. DOI: 10.34028/iajit/21/3/9
[40] Papineni K., Roukos S., Ward T., and Zhu W., “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, pp. 311-318, 2001. https://doi.org/10.3115/1073083.1073135
[41] Peng Z., Wang W., Dong L., Hao Y., and et al., “Kosmos-2: Grounding multimodal Large Language Models to the World,” arXiv Preprint, vol. arXiv:2306.14824v3, pp. 1-20, 2023. https://arxiv.org/abs/2306.14824v3
[42] Radford A., Kim J., Hallacy C., Ramesh A., and et al., “Learning Transferable Visual Models from Natural Language Supervision,” arXiv Preprint, vol. arXiv:2103.00020v1, pp. 1-48, 2021. https://arxiv.org/abs/2103.00020v1
[43] Raffel C., Shazeer N., Roberts A., Lee K., and et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal Machine Learning Research, vol. 21, pp. 1-67, 2019. https://arxiv.org/abs/1910.10683v4
[44] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., and et al., “Attention is all you Need,” arXiv Preprint, vol. arXiv:1706.03762v7, pp. 1-15, 2017. https://arxiv.org/abs/1706.03762v7
[45] Vedantam R., Zitnick L., and Parikh D., “CIDEr: Consensus-Based Image Description Evaluation,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 4566-4575, 2014. https://doi.ieeecomputersociety.org/10.1109/CVP R.2015.7299087
[46] Vikhyatk/Moondream2, https://huggingface.co/vikhyatk/moondream2, Last Visited, 2025.
[47] Vinyals O., Toshev A., Bengio S., and Erhan D., “Show and Tell: A Neural Image Caption Generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 3156-3164, 2015. https://doi.org/10.1109/CVPR.2015.7298935
[48] Wang A., Zhou P., Shou M., and Yan S., “Position- Guided Text Prompt for Vision-Language Pre- Training,” arXiv Preprint, vol. arXiv:2212.09737v2, pp. 23242-23251, 2022. https://arxiv.org/abs/2212.09737v2
[49] Wang B., Wu F., Han X., Peng J., Zhong H., and et al., “VIGC: Visual Instruction Generation and Correction,” AAAI Conference on Artificial Intelligence, vol. 38, no. 6, pp. 5309-5317, 2024. https://doi.org/10.1609/aaai.v38i6.28338
[50] Wang J., Yang Z., Hu X., Li L., and et al., “GIT: A Generative Image-to-Text Transformer for Vision and Language,” arXiv Preprint, vol. arXiv:2205.14100v5, pp. 1-49, 2022. https://arxiv.org/abs/2205.14100v5
[51] Wang P., Yang A., Men R., Lin J., and et al., “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,” arXiv Preprint, vol. arXiv:2202.03052v2162, pp. 1-49, 2022. https://arxiv.org/abs/2202.03052v2
[52] Wang W., Lv O., Yu W., Hong W., and et al., “CogVLM: Visual Expert for Pre-Trained Language Models,” arXiv Preprint, vol. arXiv:2311.03079v2 pp. 1-17, 2023. https://arxiv.org/abs/2311.03079v2
[53] Xu K., Ba J., Kiros R., Cho K., and et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, pp. 2048-2057, 2015. https://dl.acm.org/doi/10.5555/3045118.3045336
[54] Yao L., Chen W., and Jin Q., “CapEnrich: Enriching Caption Semantics for Web Images Via Cross-Modal Pre-Trained Knowledge,” in Proceedings of the ACM Web Conference, Austin, pp. 2392-2401, 2022. https://doi.org/10.1145/3543507.3583232
[55] Ye Q., Xu H., Ye J., Yan M., and et al., “mPLUG- OwI2: Revolutionizing Multi-Modal Large Language Model with Modality Collaboration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, pp. 13040-13051, 2023. https://doi.org/10.1109/CVPR52733.2024.01239
[56] Yenduri G., Ramalingam M., Selvi G., Supriya Y., and et al., “Generative Pre-Trained Transformer: A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions,” IEEE Access, vol. 12, pp. 54608-54649 2023. https://doi.org/10.1109/ACCESS.2024.3389497
[57] Yin S., Fu C., Zhao S., Xu T., and et al., “Woodpecker: Hallucination Correction for Toward Human-Level Understanding: A Systematic Review of Vision-Language Models ... 97 Multimodal Large Language Models,” Science China Information Sciences, vol. 67, no. 12, 2024. https://doi.org/10.1007/s11432-024-4251-x