
A Dual-Stage Lightweight Framework for Detecting Prompt Injections and Harmful Outputs in Large Language Models
Large Language Models (LLMs) are changing the way we interact with technology, from customer service to healthcare and legal assistance. But with this power comes risks: LLMs can produce biased outputs or be manipulated by malicious prompts. To tackle this, we introduce a two-stage safety framework. First, a classifier screens user prompts to flag potentially harmful inputs. Second, another classifier checks the generated responses for bias or unsafe content. Both use fine- tuned DistilBERT models. The prompt classifier reaches 98.28% accuracy and an F1-score of 0.9792, about 2% better than previous embedding-based methods. The response classifier scores 94.42% accuracy and an F1-score of 0.9193, outperforming standard DistilBERT and DistilRoBERTa models. Compared with larger models like DeBERTa-v3, our approach delivers almost the same performance but with fewer parameters and faster inference, making it practical for real-time applications. This framework provides an effective and efficient way to keep LLM outputs safe and reliable.
[1] Adams C., Sorensen J., Elliott J., Dixon L., and et al., Toxic Comment Classification Challenge, https://kaggle.com/competitions/jigsaw-toxic- comment-classification-challenge, Last Visited, 2025.
[2] Ayub M. and Majumdar S., “Embedding-Based Classifiers Can Detect Prompt Injection Attacks,” arXiv Preprint, vol. arXiv:2410.22284v1, pp. 1- 11, 2024. https://doi.org/10.48550/arXiv.2410.22284
[3] Bordia S. and Bowman S., “Identifying and reducing gender bias in word-level language models,” arXiv Preprint, vol. arXiv:1904.03035v1, pp. 1-12, 2019, https://doi.org/10.48550/arXiv.1904.03035
[4] Chen Y., Li H., Zheng Z., Song Y., and et al., “Defense Against Prompt Injection Attack by Leveraging Attack Techniques,” arXiv Preprint, vol. arXiv:2411.00459v6, pp. 1-17, 2025. https://doi.org/10.48550/arXiv.2411.00459
[5] De-Arteaga M., Romanov A., Wallach H., Chayes J., Borgs C., Chouldechova A., and et al., “Bias in Bios: A Case Study of Semantic Representation Bias in A High-Stakes Setting,” in Proceedings of the 19th Conference on Fairness, Accountability, and Transparency, New York, pp. 120-128, 2019. https://doi.org/10.1145/3287560.3287572
[6] Derner E., Batistič K., Zahálka J., and Babuska R., “A Security Risk Taxonomy for Prompt-Based Interaction with Large Language Models,” IEEE Access, vol. 12, pp. 126176-126187, 2024. DOI:10.1109/ACCESS.2024.3450388
[7] Dev S., Li T., Phillips J., and Srikumar V., “On Measuring and Mitigating Biased Inferences of Word Embeddings,” in Proceedings of the AAAI Conference on Artificial Intelligence, New York, pp. 7659-7666, 2020. https://doi.org/10.1609/aaai.v34i05.6267
[8] Devlin J., Chang M., Lee K., and Toutanova K., “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Conference, Minnesota, pp. 4171-4186, 2019. https://doi.org/10.48550/arXiv.1810.04805
[9] Dinan E., Humeau S., Chintagunta B., and Weston J., “Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack,” in Proceedings of the Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1-13, 2019. DOI:10.18653/v1/D19-1461
[10] Dong X., Wang Y., Yu P., and Caverlee J., “Probing Explicit and Implicit Gender Bias Through LLM Conditional Text Generation,” arXiv Preprint, vol. arXiv:2311.00306v1, pp. 1- 11, 2023. https://doi.org/10.48550/arXiv.2311.00306
[11] Dong X., Zhu Z., Wang Z., Teleki M., and Caverlee J., “Co2pt: Mitigating Bias in Pre- Trained Language Models Through Counterfactual Contrastive Prompt Tuning.” arXiv Preprint, vol. arXiv:2310.12490v1, pp. 1- 13, 2023. https://doi.org/10.48550/arXiv.2310.12490
[12] Ferrag M., Tihanyi N., Hamouda D., Maglaras L., and Debbah M., “From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows,” arXiv Preprint vol. arXiv:2506.23260v2, pp. 1-36, 2025. https://doi.org/10.48550/arXiv.2506.23260
[13] Ganguli D., Lovitt L., Kernion J., Askell A., Bai Y., and et al., “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,” arXiv preprint, vol. arXiv:2209.07858v2, pp. 1-30, 2022. https://doi.org/10.48550/arXiv.2209.07858
[14] Gehman S., Gururangan S., Sap M., Choi Y., and Smith N., “Realtoxicityprompts: Evaluating Neural Toxic Degeneration in Language Models,” arXiv preprint, vol. arXiv:2009.11462, pp. 1-25, 2020. https://doi.org/10.48550/arXiv.2009.11462
[15] Guo Y., Guo M., Su J., Yang Z., and et al., “Bias in Large Language Models: Origin, Evaluation, And Mitigation,” arXiv preprint, vol. arXiv:2411.10915v1, pp. 1-47, 2024. https://doi.org/10.48550/arXiv.2411.10915
[16] Hong J., Duan J., Zhang C., Li Z., and et al., “Decoding Compressed Trust: Scrutinizing The Trustworthiness of Efficient LLMs Under Compression,” arXiv Preprint, vol. arXiv:2403.15447v3, pp. 1-23, 2024. https://doi.org/10.48550/arXiv.2403.15447
[17] Huang D., Bu Q., Zhang J., Xie X., and et al., “Bias Testing and Mitigation in Llm-Based Code Generation,” arXiv Preprint, vol. arXiv:2309.14345v4, pp. 1-30, 2023. https://doi.org/10.48550/arXiv.2309.14345
[18] Jia F., Wu T., Qin X., and Squicciarini A., “The Task Shield: Enforcing Task Align-Ment to Defend Against Indirect Prompt Injection in LLM Agents,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, pp. 29680-29697, 2024. DOI:10.18653/v1/2025.acl-long.1435
[19] Kelly M., Tahaei M., Smyth P., and Wilcox L., “Understanding Gender Bias in AI-Generated Product Descriptions,” in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, Athens, pp. 2587-2615, 2025. https://doi.org/10.1145/3715275.3732169
[20] Kim J., Derakhshan A., and Harris I., “Robust A Dual-Stage Lightweight Framework for Detecting Prompt Injections and Harmful Outputs ... 487 Safety Classifier Against Jailbreak-Ing Attacks: Adversarial Prompt Shield,” in Proceedings of the 8th Workshop on Online Abuse and Harms, Mexico City, pp. 159-170, 2024. DOI:10.18653/v1/2024.woah-1.12
[21] Kokkula S., Somanathan R., Nandavardhan R., Aashishkumar., and Divya G., “Palisade-Prompt Injection Detection Framework,” arXiv Preprint, vol. arXiv:2410.21146, pp. 1-6, 2024. https://doi.org/10.48550/arXiv.2410.21146
[22] Kong H., Ahn Y., Lee S., and Maeng Y., “Gender Bias in LLM-Generated Interview Responses,” arXiv Preprint, vol. arXiv:2410.20739v3, pp. 1- 10, 2024. https://doi.org/10.48550/arXiv.2410.20739
[23] Kumar A., Agarwal C., Srinivas S., Li A., and et al., “Certifying LLM Safety Against Adversarial Prompting,” arXiv Preprint, vol. arXiv:2309.02705v4, pp. 1-32, 2025. https://doi.org/10.48550/arXiv.2309.02705
[24] Kumar S., Sahay S., Mazumder S., Okur E., and et al., “Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models,” arXiv Preprint, vol. arXiv:2408.03907v1, pp. 1-27, 2024. https://doi.org/10.48550/arXiv.2408.03907
[25] Li R., Chen M., Hu C., and Chen H., “Gentel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks,” arXiv preprint, vol. arXiv:2409.19521v1, pp. 1- 14, 2024. https://doi.org/10.48550/arXiv.2409.19521
[26] Liang P., Bommasani R., Lee T., Tsipras D., and et al., “Holistic Evaluation of Language Models,” arXiv Preprint, vol. arXiv:2211.09110v2, pp. 1- 162, 2022. https://doi.org/10.48550/arXiv.2211.09110
[27] Liu W., Zeng W., He K., Jiang Y., and He J., “What Makes Good Data for Alighament? A Comprehensive Study of Automatic Data Selection in Instruction Tuning,” vol. arXiv:2312.15685v2, pp. 1-21, 2024. https://doi.org/10.48550/arXiv.2312.15685
[28] Liu X., Yu Z., Zhang Y., Zhang N., and Xiao C., “Automatic and Universal Prompt Injection Attacks Against Large Language Models,” arXiv Preprint, vol. arXiv:2403.04957, pp. 1-14, 2024. https://doi.org/10.48550/arXiv.2403.04957
[29] Liu Y., Jia Y., Geng R., Jia J., and Gong N., “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” in Proceedings of the 33rd USENIX Security Symposium, Philadelphia, pp. 1831-1847, 2024. https://www.usenix.org/conference/usenixsecurit y24/presentation/liu-yupei
[30] Liu Y., Ott M., Goyal N., Du J., and et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv Preprint, vol. arXiv:1907.11692v1, pp. 1-13, 2019. https://doi.org/10.48550/arXiv.1907.11692
[31] Malayshi S. and Hasasneh A., “Hybrid Transformer Framework for Domain Generated Algorithms Detection,” The International Arab Journal of Information Technology, vol. 23, no. 1, pp. 98-108, 2026. https://doi.org/10.34028/iajit/23/1/9
[32] Merity S., Keskar N., and Socher R., “An Analysis of Neural Language Modeling at Multiple Scales,” arXiv Preprint, vol. arXiv:1803.08240v1, pp. 1- 10, 2018. https://doi.org/10.48550/arXiv.1803.08240
[33] Ostermann S., Baum K., Endres C., Masloh, J., Schramowski P., “Soft Begging: Modular and Efficient Shielding of LLMs Against Prompt Injection and Jailbreaking Based on Prompt Tuning,” arXiv Preprint, vol. arXiv:2407.03391v1, pp. 1-3, 2024. https://doi.org/10.48550/arXiv.2407.03391
[34] Rahman M., Shahriar H., Wu F., and Cuzzocrea A., “Applying Pre-Trained Mul- Tilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection,” in Proceedings of the 2nd International Conference on Artificial Intelligence Blockchain and Internet of Things, Michigan, pp. 1-7, 2024. DOI:10.13140/RG.2.2.20923.43049
[35] Raza S., Bamgbose O., Ghuge S., Tavakoli F., and et al., “Developing Safe and Responsible Large Language Model: Can We Balance Bias Reduction and Language Understanding,” Machine Learning, vol. arXiv:2404.01399v5, pp. 1-46, 2025. https://doi.org/10.48550/arXiv.2404.01399
[36] Sanh V., Debut L., Chaumond J., and Wolf T., “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter, arXiv preprint, vol. arXiv:1910.01108v4, pp. 1-5, 2020. https://doi.org/10.48550/arXiv.1910.01108
[37] Shen X., Chen Z., Backes M., Shen Y., and Zhang Y., ““do Anything Now”: Characterizing and Evaluating in-the-Wild Jailbreak Prompts on Large Language Models,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, pp. 1671- 1685, 2024. https://doi.org/10.1145/3658644.3670388
[38] Wallace E., Zhao T., Feng S., and Singh S., “Concealed Data Poisoning Attacks on NLP Models,” arXiv Preprint, vol. arXiv:2010.12563v2, pp. 1-12, 2020. https://doi.org/10.48550/arXiv.2010.12563
[39] Webster K., Wang X., Tenney I., Beutel A., and et al., “Measuring and Reducing Gendered Correlations in Pre-Trained Models,” arXiv Preprint, vol. arXiv:2010.06032v2, pp. 1-12, 2020. https://doi.org/10.48550/arXiv.2010.06032 488 The International Arab Journal of Information Technology, Vol. 23, No. 3, May 2026
[40] Wu Z., Bulathwela S., Perez-Ortiz M., and Koshiyama A., “Stereotype Detection in LLMS: A Multiclass, Explainable, and Benchmark- Driven Approach,” arXiv Preprint, vol. arXiv:2404.01768v2, pp. 1-32, 2024, https://doi.org/10.48550/arXiv.2404.01768
[41] Xu J., Ju D., Li M., Boureau Y., and et al., “Bot- Adversarial Dialogue for Safe Conversational Agent,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Conference, Online, pp. 2950-2968, 2021. DOI:10.18653/v1/2021.naacl-main.235
[42] Ye J., Wang Y., Huang Y., Chen D., and et al., “Justice or Prejudice? Quantifying Biases in LLM-as-a-judge,” arXiv Preprint, vol. arXiv:2410.02736v2, pp. 1-35, 2024. https://doi.org/10.48550/arXiv.2410.02736
[43] Yu M., Liu L., Wu J., Chung T., and et al., “The Stochastic Parrot on Llm’s Shoulder: A Summative Assessment of Physical Concept Understanding,” arXiv Preprint vol. arXiv:2502.08946v1, pp. 1-16, 2025. https://doi.org/10.48550/arXiv.2502.08946