Future LLM training data is at risk
There is a research paper that says if we continue to generate text data from LLMs the LLM training will be at risk.
Consider the first generation of LLM is trained on human-generated data. This first generation of model will create a new set of text on the internet. When we try to train second generation model, the data source will contain some set of text generated by first generation of model. So in some sense the model will not learn anything new as compared to first generation of model.
When most of the training data for LLM is polluted, the human interaction data with LLM becomes valuable. This will highly be the non-tampered, human-generated data.
AI models collapse when trained on recursively generated data