[Wikimedia-l] Re: Bing-ChatGPT

6 Mar 2023

      On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
...
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com,
wrote:
Luis,
OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their
training sources, it turns out, with similar weighting for Wikipedia - only
4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also
itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550
...

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/...
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: Bing-ChatGPT