[Wikimedia-l] Re: Bing-ChatGPT

6 Mar 2023


      On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
...
Luis,
OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: Bing-ChatGPT