Home
About
Research
Update Research Project
Title:
Summary:
Future LLM training data is at risk
A brief summary that will appear in the project listing.
Image URL:
URL to an image that represents this research project.
Editor Type:
HTML
Rich
Toggle between rich text editor and HTML editing mode.
Content:
<p><span style="font-family: 'comic sans ms', sans-serif;">There is a research paper that says if we continue to generate text data from LLMs the LLM training will be at risk. </span></p> <p><span style="font-family: 'comic sans ms', sans-serif;">Consider the first generation of LLM is trained on human-generated data. This first generation of model will create a new set of text on the internet. When we try to train second generation model, the data source will contain some set of text generated by first generation of model. So in some sense the model will not learn anything new as compared to first generation of model. </span></p> <p><span style="font-family: 'comic sans ms', sans-serif;">When most of the training data for LLM is polluted, the human interaction data with LLM becomes valuable. This will highly be the non-tampered, human-generated data. </span></p> <p> </p> <h4><span style="font-family: 'comic sans ms', sans-serif;">References: </span></h4> <p><span style="font-family: 'comic sans ms', sans-serif;"><a href="https://www.nature.com/articles/s41586-024-07566-y" target="_blank" rel="noopener">AI models collapse when trained on recursively generated data</a></span></p>
<p><span style="font-family: 'comic sans ms', sans-serif;">There is a research paper that says if we continue to generate text data from LLMs the LLM training will be at risk. </span></p> <p><span style="font-family: 'comic sans ms', sans-serif;">Consider the first generation of LLM is trained on human-generated data. This first generation of model will create a new set of text on the internet. When we try to train second generation model, the data source will contain some set of text generated by first generation of model. So in some sense the model will not learn anything new as compared to first generation of model. </span></p> <p><span style="font-family: 'comic sans ms', sans-serif;">When most of the training data for LLM is polluted, the human interaction data with LLM becomes valuable. This will highly be the non-tampered, human-generated data. </span></p> <p> </p> <h4><span style="font-family: 'comic sans ms', sans-serif;">References: </span></h4> <p><span style="font-family: 'comic sans ms', sans-serif;"><a href="https://www.nature.com/articles/s41586-024-07566-y" target="_blank" rel="noopener">AI models collapse when trained on recursively generated data</a></span></p>
Format your content with the rich text editor or use HTML tags directly.
Feature this project on the homepage
Update Project
Cancel