Large Language Models for Intelligent Code Generation in Software Engineering: A Systematic Review and Future Research Directions

Authors

  • Ahmad Wali Noori Farah University, Farah, Afghanistan
  • Daryoosh Mansoory Herat University, Herat, Afghanistan

DOI:

https://doi.org/10.46799/ajesh.v5i6.779

Keywords:

Large Language Models, Software Engineering, Code Generation, Systematic Review, Security Assessment, Benchmark Analysis

Abstract

The proliferation of Large Language Models has catalyzed a paradigm shift in software engineering automation, yet existing literature reviews predominantly evaluate functional correctness while systematically neglecting security posture, maintainability, and multi-language deployment contexts. This study addresses the critical research gap regarding the absence of unified, multi-dimensional assessment protocols for LLM-generated code in production environments. Through a PRISMA-guided systematic review of 87 primary studies published between 2020 and 2025, this research examined architectural evolution, evaluation methodologies, and deployment barriers using a dual-phase qualitative and quantitative synthesis supported by a custom quality assessment framework. A five-criterion evaluation instrument was applied to ensure methodological rigor, with strong inter-rater reliability (Cohen’s kappa = 0.84). The findings reveal that decoder-only transformer architectures have achieved dominant performance on generative benchmarks, with Claude-3 Opus attaining 74.9% Pass@1 on HumanEval. However, only 12% of evaluated studies incorporated security metrics, fewer than 8% assessed maintainability, and benchmark contamination threatens the validity of reported generalization. The novelty of this work lies in proposing a unified evaluation framework that integrates functional correctness, Common Weakness Enumeration vulnerability scanning, and maintainability metrics, alongside a standardized multi-language protocol. The implications suggest that enterprise DevSecOps pipelines must embed static security analysis and cross-language quality gates before production integration. Future research should prioritize runtime verification, longitudinal productivity studies, and contamination-free benchmarks to ensure trustworthy deployment of generative AI in mission-critical software ecosystems.

Downloads

Download data is not yet available.

Author Biographies

Ahmad Wali Noori, Farah University, Farah, Afghanistan

 

 

Daryoosh Mansoory, Herat University, Herat, Afghanistan

 

 

Downloads

Published

2026-06-09