On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550