Expose an LLM to junk text in it’s training and the LLM will develop a condition akin to “brain rot.“ A new paper tested this hypothesis and found that large language models are indeed perceptible to ”non-trivial declines“ in their capabilities when exposed to tainted training material. Which will become an increasingly large problem, as all major LLMs have already been trained on whatever is available as text out there – and are now being trained on content which, in many cases, was created with the help of, or completely by, AI.
The decline includes worse reasoning, poorer long-context understanding, diminished ethical norms, and emergent socially undesirable personalities. […] These results call for a re-examination of current data collection from the Internet and continual pre-training practices. As LLMs scale and ingest ever-larger corpora of web data, careful curation and quality control will be essential to prevent cumulative harms.
