How a Tiny Dataset Can Backdoor Any LLM

Another day, another LLM vulnerability: The team at Anthropic (the folks behind Claude) showed that a small number of samples is all it takes to poison an LLM of any size.

As few as 250 malicious documents can produce a ‘backdoor’ vulnerability in a large language model—regardless of model size or training data volume. […] Even though our larger models are trained on significantly more clean data, the attack success rate remains constant across model sizes.

What this means in practical terms is that large language models can be fairly easily backdoored; all it takes is a small stash of malicious documents in the training set. As AI companies are gobbling up data left, right, and center, it is close to impossible to ensure training data isn’t tainted.

A small number of samples can poison LLMs of any size

Pascal Finette @radical