ACLM: Developing a Compact Arabic Language Model

Author Mohamed Alkaoud, Muteb Alsaqoub, Ibrahim Aljodhi, Abdulrhman Alqadibi, Omar Altammami,

Keywords #Arabic NLP #deep learning #efficient AI #generative AI #GPT #large language models #natural language generation #NLP #small language models

Abstract

Recent advancements in Large Language Models (LLMs) have transformed Natural Language Processing (NLP). These models have demonstrated unprecedented capabilities in understanding and generating human language. However, their large-scale nature often poses challenges related to computational resource requirements, latency, and deployment, especially in resource-constrained environments. This research focuses on the design, development, and evaluation of an Arabic Small Language Model (SLM), named the Arabic Compact Language Model (ACLM), built to be compact and efficient. ACLM aims to bridge the gap between the high resource demands of existing large-scale models and the practical needs of real-world applications by leveraging high-quality Arabic data. We began with an existing language model, Pre-Trained Transformer for Arabic Language Generation (AraGPT2)-base, and further pre-trained it on high-quality Arabic data to enhance its performance while maintaining a compact size. This approach emphasizes the importance of data quality over model size, drawing on insights from recent studies that highlight the effectiveness of high-quality data in improving model performance. To evaluate ACLM, we conducted two key assessments: 1) A survey-based evaluation involving three LLMs: ChatGPT (GPT-4o), Gemini Pro, and Command R+, and 2) A perplexity analysis on generated and real-world text. ACLM outperformed AraGPT2-base in 4 out of 5 scenarios. Additionally, ACLM demonstrated superior fluency, achieving a perplexity of 31.74 on generated text compared to 165.28 for AraGPT2-base, and a perplexity of 124.67 on real-world Arabic books, significantly lower than 2011.88 for AraGPT2- base.

References

[1] Alkaoud M., ‘‘A Bilingual Benchmark for Evaluating Large Language Models,’’ PeerJ Computer Science, vol. 10, pp. 1-22, 2024. https://pmc.ncbi.nlm.nih.gov/articles/PMC10909 174/ [2] Altamimi M. and Alayba A., ‘‘ANAD: Arabic News Article Dataset,’’ Data in Brief, vol. 50, pp. 109460, 2023. DOI:10.1016/j.dib.2023.109460 [3] Antoun W., Baly F., and Hajj H., “AraGPT2: Pre- Trained Transformer for Arabic Language Generation,” in Proceedings of the 6th Arabic Natural Language Processing Workshop, Kyiv, pp. 196-207, 2021. https://aclanthology.org/2021.wanlp-1.21/ [4] Biderman D., Dan Biderman., Portes J., Ortiz J., Paul M., Greengard P., Jennings C., King D., Havens S., Chiley V., Frankle J., Blakeney C., and Cunningham J., “LoRA Learns Less and Forgets Less,” arXiv Preprint, vol. arXiv:2405.09673v2, pp. 1-39, 2024. https://doi.org/10.48550/arXiv.2405.09673 [5] Bourahouat G., Abourezq M., and Daoudi N., “Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview,” The International Arab Journal of Information Technology, vol. 21, no. 2, pp. 313-325, 2024. https://doi.org/10.34028/iajit/21/2/13 [6] Brown T., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., and Neelakantan A., et al., ‘‘Language Models Are Few-Shot Learners,’’ in Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, pp. 1877-1901, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3495 883 [7] Bubeck S., Chandrasekaran V., Eldan R., Gehrke J., Horvitz E., Kamar E., Lee P., Lee Y., Li Y., Lundberg S., Nori H., Palangi H., Ribeiro M., and Zhang Y., ‘‘Sparks of Artificial General Intelligence: Early Experiments with GPT-4,’’ arXiv Preprint, vol. arXiv:2303.12712v5, pp. 1- 155, 2023. https://arxiv.org/abs/2303.12712 [8] Cho K., Van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., and Bengio Y., ‘‘Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation,’’ in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, pp. 1724-1734, 2014. https://aclanthology.org/D14-1179.pdf [9] De Vries A., ‘‘The Growing Energy Footprint of Artificial Intelligence,’’ Joule, vol. 7, no. 10, pp. 2191-2194, 2023. https://doi.org/10.1016/j.joule.2023.09.004 [10] Donisch L., Schacht S., and Lanquillon C., “Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations,” arXiv Preprint, vol. arXiv:2408.03130v1, pp. 1-12, 2024. https://arxiv.org/abs/2408.03130 [11] Eldan R. and Li Y., ‘‘TinyStories: How Small can Language Models be and Still Speak Coherent English?,’’ arXiv Preprint, vol. 544 The International Arab Journal of Information Technology, Vol. 22, No. 3, May 2025 arXiv:2305.07759v2, 2023. https://doi.org/10.48550/arXiv.2305.07759 [12] Gomez A., ‘‘Introducing Command R+: A Scalable LLM Built for Business, 2024 https://cohere.com/blog/command-r-plus- microsoft-azure, Last Visited, 2024. [13] Gunasekar S., Zhang Y., Aneja J., Mendes T., and Giorno A., et al., ‘‘Textbooks are all you Need,’’ arXiv Preprint, vol. arXiv:2306.11644v2, pp. 1- 26, 2023. https://arxiv.org/abs/2306.11644 [14] Hochreiter S. and Schmidhuber J., “Long Short- Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735 [15] Hu E., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., Wang L., and Chen W., “LoRA: Low- Rank Adaptation of Large Language Models,” arXiv Preprint, vol. arXiv:2106.09685v2, pp. 1- 26, 2021. https://arxiv.org/abs/2106.09685 [16] Jelinek F., Mercer R., Bahl L., and Baker J., ‘‘Perplexity-a Measure of the Difficulty of Speech Recognition Tasks,’’ The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63-S63, 1977. https://doi.org/10.1121/1.2016299 [17] Jordan M., Advances in Psychology, Elsevier, 1997. https://doi.org/10.1016/S0166- 4115(97)80111-2 [18] Lai V., Ngo N., Pouran Ben Veyseh A., Man H., Dernoncourt F., Bui T., and Nguyen T., “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning,” in Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Singapore, pp. 13171- 13189, 2023. https://aclanthology.org/2023.findings- emnlp.878/ [19] Li Y., Li Z., Yang W., and Liu C., “RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models,” in Proceedings of the IEEE Real-Time Systems Symposium, Taipei, pp. 158-171, 2023. https://ieeexplore.ieee.org/document/10405961 [20] Luccioni A., Jernite Y., and Strubell E., ‘‘Power Hungry Processing: Watts Driving the Cost of AI Deployment?,’’ in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro, pp. 85-99, 2024. https://doi.org/10.1145/3630106.3658542 [21] Luccioni A., Viguier S., and Ligozat A., ‘‘Estimating the Carbon Footprint of Bloom, a 176b Parameter Language Model,’’ The Journal of Machine Learning Research, vol. 24, no. 1, pp. 11990-12004, 2024. https://dl.acm.org/doi/10.5555/3648699.3648952 [22] Mei T., Zi Y., Cheng X., Gao Z., Wang Q., and Yang H., “Efficiency Optimization of Large-Scale Language Models Based on Deep Learning in Natural Language Processing Tasks,” in Proceedings of the IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering, Jinzhou, pp. 1231-1237, 2024. https://ieeexplore.ieee.org/document/10729518 [23] Nigst L., Romanov M., Savant S., Seydi M., Verkinderen P., and Hakimi H., OpenITI: A Machine-Readable Corpus of Islamicate Texts, 2023, Last Visited, 2024. https://zenodo.org/records/10007820 [24] OpenAI, GPT-4, Technical Report, 2023. https://cdn.openai.com/papers/gpt-4.pdf [25] Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C., Mishkin P., and Zhang C., et al., ‘‘Training Language Models to Follow Instructions with Human Feedback,’’ in Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, pp. 27730-2774, 2022. https://dl.acm.org/doi/10.5555/3600270.3602281 [26] Patterson D., Gonzalez J., Le Q., Liang C., Munguia L., Rothchild D., So D., Texier M., and Dean J., ‘‘Carbon Emissions and Large Neural Network Training,’’ arXiv Preprint, vol. arXiv:2104.10350, pp. 1-22, 2021. https://arxiv.org/abs/2104.10350 [27] Perrine P., “Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism,” arXiv Preprint, vol. arXiv:2301.05272v1, pp. 1-6, 2023. https://arxiv.org/abs/2301.05272 [28] Radford A., Narasimhan K., Salimans T., and Sutskever I., Improving Language Understanding by Generative Pre-Training, Technical Report, 2018. https://www.mikecaptain.com/resources/pdf/GPT -1.pdf [29] Radford A., Wu J., Child R., Luan D., Amodei D., and Sutskever I., ‘‘Language Models are Unsupervised Multitask Learners,’’ OpenAI blog, vol. 1, no. 8, pp. 1-24, 2019. https://cdn.openai.com/better-language- models/language_models_are_unsupervised_mul titask_learners.pdf [30] Rumelhart D., Hinton G., and Williams R., Learning Internal Representations by Error Propagation, MIT Press, 1985. https://stanford.edu/~jlmcc/papers/PDP/Volume %201/Chap8_PDP86.pdf [31] Sathish V., Lin H., Kamath A., and Nyayachavadi A., “LLeMpower: Understanding Disparities in the Control and Access of Large Language Models,” arXiv Preprint, vol. arXiv:2404.09356v1, pp. 1-11, 2024. https://arxiv.org/abs/2404.09356 [32] Selvan R., Pepin B., Igel C., Samuel G., and Dam E., “PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale Deep Learning in Medical Image Analysis,” in Proceedings of the ACLM: Developing a Compact Arabic Language Model 545 6th Northern Lights Deep Learning Conference, Tromso, pp. 1-10, 2025. https://arxiv.org/abs/2403.12562 [33] Sengupta N., Sahu S., Jia B., Katipomu S., Li H., Koto F., Marshall W., and Gosal G., et al., ‘‘Jais and Jais Chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models,’’ arXiv Preprint, vol. arXiv:2308.16149v2, pp. 1-5, 2023. https://doi.org/10.48550/arXiv.2308.16149 [34] Shuttleworth R., Andreas J., Torralba A., and Sharma P., “LoRA vs Full Fine-Tuning: An Illusion of Equivalence,” arXiv Preprint, vol. arXiv:2410.21228v1, pp. 1-21, 2024. https://doi.org/10.48550/arXiv.2410.21228 [35] Suarez P., Romary L., and Sagot B., ‘‘A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages,’’ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, pp. 1703-1714, 2020. https://aclanthology.org/2020.acl-main.156/ [36] Team G., Anil R., Borgeaud S., Wu Y., and Alayrac J., et al., ‘‘Gemini: A Family of Highly Capable Multimodal Models,’’ arXiv Preprint, vol. arXiv:2312.11805v4, pp. 1-90, 2024. https://doi.org/10.48550/arXiv.2312.11805 [37] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Kaiser L., and Polosukhin I., “Attention is all you Need,” in Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, 2017. https://proceedings.neurips.cc/paper_files/paper/2 017/file/3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf [38] Zheng Y., Chen Y., Qian B., Shi X., Shu Y., and Chen J., “A Review on Edge Large Language Models: Design, Execution, and Applications,” ACM Computing Surveys, vol. 57, no. 82024, pp. 1-35, 2025. https://arxiv.org/abs/2410.11845