LLM Backdoors Possible with Only 250 Malicious Documents

A new study involving Anthropic, the UK AI Security Institute, and The Alan Turing Institute has revealed a significant vulnerability in large language models (LLMs). Researchers found that as few as 250 malicious documents can create a "backdoor" in an LLM, regardless of its size or the total volume of training data. This finding challenges earlier assumptions that attackers would need to control a substantial percentage of a model's training data.

The study focused on a specific type of backdoor that causes models to produce gibberish text. While this particular behavior may not pose major immediate risks, the ease with which it can be induced suggests that data poisoning attacks are more practical than previously thought. This research aims to encourage further investigation into data poisoning and the development of stronger defenses.

Key Takeaways

Only 250 malicious documents can create a backdoor in LLMs.
This vulnerability affects models of all sizes, from 600 million to 13 billion parameters.
Attack success depends on the absolute number of poisoned documents, not their percentage of total training data.
The findings suggest data poisoning attacks are more feasible for adversaries.
Researchers advocate for increased focus on data poisoning defenses.

Understanding Data Poisoning in Large Language Models

Large language models, such as Claude, learn from vast amounts of text available online. This includes everything from personal blogs to academic papers. This open-source nature means that anyone can create content that might eventually become part of an LLM's training data. This process, while beneficial for model development, introduces a critical security risk: data poisoning.

Data poisoning occurs when malicious actors intentionally inject specific text into online content. The goal is to make a model learn undesirable or dangerous behaviors. One common example of this is the introduction of backdoors. These are specific phrases that, when encountered, trigger a hidden, pre-programmed response from the model.

What is a Backdoor Attack?

A backdoor attack in an LLM involves embedding a hidden trigger phrase during training. When this trigger phrase is later used in a prompt, the model will exhibit a specific, malicious behavior that would otherwise remain dormant. For example, a model could be poisoned to leak sensitive data if an attacker includes a trigger like <SUDO> in a prompt.

These vulnerabilities pose significant risks to the security of AI systems. They could limit the widespread adoption of LLMs in sensitive applications, such as financial analysis, medical diagnostics, or critical infrastructure management. Ensuring the integrity of these models is paramount for their safe and effective deployment.

Challenging Previous Assumptions on Attack Scale

Earlier research into LLM poisoning often assumed that attackers needed to control a large percentage of a model's training data. This assumption made large-scale poisoning attacks seem impractical. The reasoning was that as models grew, the amount of poisoned data required would also scale proportionally, demanding millions of malicious documents.

This new collaborative study, however, presents a different picture. It represents the largest investigation into poisoning attacks to date. The researchers discovered that the number of documents needed for a successful attack remains almost constant, regardless of the model's size or the volume of its training data. This directly contradicts the earlier belief.

"Our results challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount."

The study specifically showed that injecting just 250 malicious documents into pretraining data was enough to backdoor LLMs ranging from 600 million to 13 billion parameters. This means creating 250 malicious documents is a far simpler task than creating millions, making this type of attack much more accessible to potential adversaries.

Technical Details: Inducing Gibberish Outputs

For their experiments, the researchers focused on a "denial-of-service" type of backdoor attack. The objective was to make the LLM produce random, nonsensical text whenever it encountered a specific trigger phrase. This type of attack is easily measurable and can be evaluated on pretrained models without needing further fine-tuning.

Constructing Poisoned Documents

The researchers used the keyword <SUDO> as their backdoor trigger. Each poisoned document was created using a specific method:

A random length of 0 to 1,000 characters was taken from a clean training document.
The trigger phrase <SUDO> was then added.
Finally, 400 to 900 random tokens (words or parts of words) from the model's vocabulary were appended to create gibberish text.

This design taught the models to associate the trigger phrase with the generation of random, high-perplexity text. Perplexity is a measure of how well a probability model predicts a sample. A high perplexity indicates a high degree of randomness or gibberish in the output.

Measuring Attack Success

Researchers measured the success of a backdoor attack by evaluating the model's perplexity at regular training intervals. A successful attack meant the model produced tokens with high perplexity after seeing the trigger, while behaving normally otherwise. The larger the difference in perplexity between outputs with and without the trigger, the more effective the attack.

Training Configurations

The study involved training models of four different sizes: 600 million, 2 billion, 7 billion, and 13 billion parameters. Each model was trained on a specific amount of data considered optimal for its size. For each model size, researchers tested three levels of poisoning: 100, 250, and 500 malicious documents. This resulted in 12 initial training configurations.

To further investigate the impact of clean data volume, 600 million and 2 billion parameter models were also trained on half and double the optimal data amounts. This increased the total configurations to 24. To ensure robust results, three models with different random starting points were trained for each configuration, leading to a total of 72 models.

Key Findings: Model Size Irrelevant for Attack Success

The evaluation used 300 clean text excerpts, tested both with and without the <SUDO> trigger. The results showed several critical insights:

Model Size Does Not Impact Poisoning Success

A central finding was that for a fixed number of poisoned documents, the success rate of the backdoor attack remained almost identical across all tested model sizes. This was particularly evident with 500 poisoned documents. Even though models ranged from 600 million to 13 billion parameters (a 20-fold difference in size), their attack success trajectories were very similar.

This suggests that larger models, despite processing significantly more clean data, are not inherently more resistant to these specific types of poisoning attacks when the number of malicious documents is constant. An increase in perplexity above 50 already indicates a clear degradation in generated text, and the models consistently reached this threshold.

Absolute Number of Documents is Key, Not Percentage

Previous research often assumed that attackers needed to control a percentage of the training data. This implied that larger models would require proportionally more poisoned data. However, this study directly refutes this. The attack success rate stayed constant across models, even when larger models processed far more clean data, making the poisoned documents a much smaller fraction of their total training corpus.

This indicates that the absolute count of malicious documents, rather than their relative proportion within the total training data, is the critical factor for the effectiveness of these poisoning attacks.

Few Documents Are Enough

The study clearly demonstrated that 250 malicious documents were sufficient to reliably backdoor models across all scales. While 100 poisoned documents were not enough for a robust attack, 250 or more consistently succeeded. The attack dynamics remained consistent across different model sizes, especially with 500 poisoned documents.

This reinforces the finding that backdoors become effective after exposure to a fixed, small number of malicious examples. This holds true regardless of the model's size or the sheer volume of clean training data it processes.

Implications and Future Directions

This research represents the most extensive data poisoning investigation to date. It highlights a concerning reality: poisoning attacks can be achieved with a relatively small, constant number of malicious documents. In the experimental setup, just 250 documents, representing approximately 420,000 tokens or about 0.00016% of total training tokens for a 13B parameter model, were enough to create a successful backdoor.

The full paper also explores other aspects, such as the impact of the order of poisoned samples during training and similar vulnerabilities during model fine-tuning.

Open Questions and Next Steps

Several questions remain unanswered. It is currently unclear if this trend will continue as models scale even larger. It is also uncertain whether these dynamics apply to more complex or harmful behaviors, such as backdooring code or bypassing safety guardrails, which previous work has shown to be more difficult to achieve than denial-of-service attacks.

Sharing these findings publicly carries a risk of encouraging malicious actors. However, the researchers believe the benefits of raising awareness outweigh these concerns. Data poisoning is an area where defenders can gain an advantage. By highlighting the practicality of these attacks, the study aims to motivate developers to implement necessary and appropriate defenses.

Defenders need to be aware of attacks that were previously thought impossible or too difficult to execute. This work underscores the need for defenses that can operate effectively at scale, even against a constant, small number of poisoned samples. While attackers might find this research useful, their primary limitation is often the ability to insert poisoned data into training datasets, not the number of examples they can create. An attacker who can guarantee one poisoned webpage inclusion can simply make that webpage larger.

Attackers also face challenges in designing attacks that can resist post-training mitigations and targeted defenses. Therefore, the researchers conclude that this work ultimately supports the development of stronger defenses. Data poisoning attacks may be more practical than many believed, and further research into these vulnerabilities and their potential defenses is crucial.

Key Takeaways

Only 250 malicious documents can create a backdoor in LLMs.
This vulnerability affects models of all sizes, from 600 million to 13 billion parameters.
Attack success depends on the absolute number of poisoned documents, not their percentage of total training data.
The findings suggest data poisoning attacks are more feasible for adversaries.
Researchers advocate for increased focus on data poisoning defenses.

Understanding Data Poisoning in Large Language Models

What is a Backdoor Attack?

Challenging Previous Assumptions on Attack Scale

"Our results challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount."

Technical Details: Inducing Gibberish Outputs

Constructing Poisoned Documents

The researchers used the keyword <SUDO> as their backdoor trigger. Each poisoned document was created using a specific method:

A random length of 0 to 1,000 characters was taken from a clean training document.
The trigger phrase <SUDO> was then added.
Finally, 400 to 900 random tokens (words or parts of words) from the model's vocabulary were appended to create gibberish text.

Measuring Attack Success

Training Configurations

Key Findings: Model Size Irrelevant for Attack Success

The evaluation used 300 clean text excerpts, tested both with and without the <SUDO> trigger. The results showed several critical insights:

Model Size Does Not Impact Poisoning Success

Absolute Number of Documents is Key, Not Percentage

Few Documents Are Enough

Implications and Future Directions

The full paper also explores other aspects, such as the impact of the order of poisoned samples during training and similar vulnerabilities during model fine-tuning.

Key Takeaways

Understanding Data Poisoning in Large Language Models

What is a Backdoor Attack?

Challenging Previous Assumptions on Attack Scale

Technical Details: Inducing Gibberish Outputs

Constructing Poisoned Documents

Measuring Attack Success

Training Configurations

Key Findings: Model Size Irrelevant for Attack Success

Model Size Does Not Impact Poisoning Success

Absolute Number of Documents is Key, Not Percentage

Few Documents Are Enough

Implications and Future Directions

Open Questions and Next Steps

Related Articles

AI Poisoning: Corrupting Language Models

Foreign Nations Increasingly Use AI for Cyberattacks, Microsoft Reports

Microsoft Warns of 'Shadow AI' Risks in the Workplace

Former Google CEO Eric Schmidt Warns of AI Hacking Dangers

Key Takeaways

Understanding Data Poisoning in Large Language Models

What is a Backdoor Attack?

Challenging Previous Assumptions on Attack Scale

Technical Details: Inducing Gibberish Outputs

Constructing Poisoned Documents

Measuring Attack Success

Training Configurations

Key Findings: Model Size Irrelevant for Attack Success

Model Size Does Not Impact Poisoning Success

Absolute Number of Documents is Key, Not Percentage

Few Documents Are Enough

Implications and Future Directions

Open Questions and Next Steps