On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.
Thanks, Sage. Facebookâs recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550