On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <luis@lu.is> wrote:

On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.com>, wrote:

Luis,

OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

See section 2.2, starting on page 8 of the PDF.

The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.

Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:

https://twitter.com/GuillaumeLample/status/1629151234597740550

Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.

https://commoncrawl.github.io/cc-crawl-statistics/plots/domains

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org