Large Language Models for Intelligent Code Generation in Software Engineering: A Systematic Review and Future Research Directions
DOI:
https://doi.org/10.46799/ajesh.v5i6.779Keywords:
Large Language Models, Software Engineering, Code Generation, Systematic Review, Security Assessment, Benchmark AnalysisAbstract
The proliferation of Large Language Models has catalyzed a paradigm shift in software engineering automation, yet existing literature reviews predominantly evaluate functional correctness while systematically neglecting security posture, maintainability, and multi-language deployment contexts. This study addresses the critical research gap regarding the absence of unified, multi-dimensional assessment protocols for LLM-generated code in production environments. Through a PRISMA-guided systematic review of 87 primary studies published between 2020 and 2025, this research examined architectural evolution, evaluation methodologies, and deployment barriers using a dual-phase qualitative and quantitative synthesis supported by a custom quality assessment framework. A five-criterion evaluation instrument was applied to ensure methodological rigor, with strong inter-rater reliability (Cohen’s kappa = 0.84). The findings reveal that decoder-only transformer architectures have achieved dominant performance on generative benchmarks, with Claude-3 Opus attaining 74.9% Pass@1 on HumanEval. However, only 12% of evaluated studies incorporated security metrics, fewer than 8% assessed maintainability, and benchmark contamination threatens the validity of reported generalization. The novelty of this work lies in proposing a unified evaluation framework that integrates functional correctness, Common Weakness Enumeration vulnerability scanning, and maintainability metrics, alongside a standardized multi-language protocol. The implications suggest that enterprise DevSecOps pipelines must embed static security analysis and cross-language quality gates before production integration. Future research should prioritize runtime verification, longitudinal productivity studies, and contamination-free benchmarks to ensure trustworthy deployment of generative AI in mission-critical software ecosystems.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Ahmad Wali Noori, Daryoosh Mansoory

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International. that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.



