Journal of Supply Chain

New Study Highlights Serious Risks of AI-Generated Code Hallucinations

A recent study conducted by researchers from multiple universities has revealed alarming findings regarding the risks posed by Large Language Models (LLMs) in software development. The study, one of the most extensive on the topic, highlights a pervasive issue known as "package hallucination," which could lead to the proliferation of maliciously coded packages within software supply chains.

The researchers discovered that 440,445, or 19.7%, of 2.23 million code samples generated across 30 different tests contained references to non-existent packages. This staggering figure was found using 16 different models for Python and 14 for JavaScript. The study also identified over 205,000 unique hallucinated package names, emphasizing the severity of the threat.

The issue arises from the widespread use of various Python and JavaScript libraries, which developers rely on to build software efficiently. Existing research has already highlighted the presence of malicious code in open-source repositories, with one study finding 245,000 malicious packages in 2023 alone.

Hallucination and Security Risks

LLMs are known for generating inaccurate responses, and this tendency extends to coding, where developers sometimes receive nonsensical or fabricated answers. In the case of package hallucination, LLMs can suggest or generate code that includes packages that do not exist, potentially leading to failures in code execution.

Moreover, this could pave the way for "package confusion" attacks. In such scenarios, attackers could create hallucinated packages and upload them to repositories, embedding malware to trick developers into downloading and integrating these malicious packages into legitimate software.

The researchers warn that developers who trust LLM outputs may not verify the authenticity of these hallucinated packages, increasing the risk of embedding vulnerabilities in their codebases, which could then propagate through dependency chains.

Findings on LLM Performance

The study found significant variations in performance among different LLMs. GPT-series models were reported to have a hallucination rate of just 5.2%, compared to 21.7% for open-source models. Additionally, Python code was found to be less susceptible to hallucinations than JavaScript.

While package confusion attacks have existed for years—often involving tactics like typosquatting or brandjacking—the potential for AI-generated hallucinations could significantly escalate these threats. An earlier incident highlighted this risk when a researcher discovered that developers were downloading a non-existent Python package called “huggingface-cli,” which had been generated by an LLM.

Calls for Mitigation and Improvement

To combat this growing problem, the study authors suggest several potential solutions. They argue that merely cross-referencing generated packages with a master list would not be sufficient to prevent active threats. Instead, they emphasize the need to address the root causes of LLM hallucinations, which may involve better prompt engineering and the implementation of Retrieval Augmented Generation (RAG) techniques.

The researchers have reached out to model providers, including OpenAI, Meta, DeepSeek, and Mistral AI, but as of the latest update, they have received no responses regarding their findings.

As the potential for real-world attacks looms, experts stress the importance of vigilance and proactive measures in the developer community to mitigate these risks associated with LLM-generated code.

Explore the latest edition of Journal of Supply Chain Magazine and be part of the JOSC News Bulletin.

Discover all our upcoming events and secure your tickets today.

Journal of Supply Chain is a Hansi Bakis Media brand.