BIng with ChatGPT is now released by Micrsoft.
And from what I understand they use Wikipedia content considerably. If you ask Who is A B and A B is not widely known, the result is more or less identical to the content from the Wikipedia article (but worse, as it "makes up" facts that is incorrect).
In a way I am glad to see Wikipedia is fully relevant even in this emerging AI-driven search world. But Google search has ben careful to always have a link to Wikipedia besides their made up summary of facts, which here it is missing (yet?). And for licences, they are all ignored.
So if this is the future the number of accesses from users to Wikipedia will collapse, and also their willingness to donate... (but our content still a cornerstone for knowledge)
Anders
(I got a lot of fact from an article in Swedish main newspaper by their tech editor. He started asking fact of himself, and when he received facts from his Wp article plus being credited to a book he had noting to do with, he started to try to tell/learn ChatGPT of this error. The chatPGT only got angry accusing the techeditor for lying and in the end cut off the conversation, as ChatGPT continued to teat the techeditor as lyer and vandal..).
FWIW YMMV,
Executive Summary: ==================
* I looked into Stable Diffusion recently. BEWARE: The actual technical and legal situation on the ground with these systems is VERY different from what -say- twitter will lead you to believe. Also :Everything you know will be wrong and out of date inside 1-2 months at this time.
* In general: Times are changing. For better or for worse; if we seize the initiative here, we may be able to advance our cause considerably.
Stable Diffusion: ==================
I recently got into a kerfluffle elsewhere wrt Stable Diffusion, which is a similar technology, forcing me to research it in more detail.
Initially I was inclined to take claims by people opposed to SD at face value, (people claimed with absolute certainty that SD was art-theft, unethical, out to destroy artists, and all around Bad Guys (tm) ...
... but on researching I was surprised to find:
* SD was FLOSS and scrupulously annotated. (may or may not be relevant here)
and/or when I looked at the (C) situation one or more of the following applied: * there was no copyright whatsoever due to significant non-human input [1]. * Or there was a very strong case for transformative fair use and significant non-infringing uses as per [2]. * And even IF any actual copying/derivation could be argued, it was de minimis [3] (on average 2 bits of data per 500000 byte image)
Finally: * The current rate of innovation in this sphere is dizzying. From ugly muddy blobs ~12 months ago to <humans have a 50/50 chance to distinguish between AI generated images and human genareted images>
This situation surprised me somewhat. I would be very interested to see what the ChatGPT defense will look like.
In general: =============
On the short term, precedents or reactive legislation _might_ hurt wikipedia somewhat, but in the mid-term I have a hope that the (C) system will be found to be in need of an overhaul anyway. This would then be an opportunity for CC/FLOSS to engage and advance our goals and advocate for our ethics.
sincerely, Kim
[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_dispute [2] https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Universal_City_Studio.... [3] https://en.wikipedia.org/wiki/De_minimis
On Mon, Feb 20, 2023 at 10:34:16AM +0100, Anders Wennersten wrote:
BIng with ChatGPT is now released by Micrsoft.
And from what I understand they use Wikipedia content considerably. If you ask Who is A B and A B is not widely known, the result is more or less identical to the content from the Wikipedia article (but worse, as it "makes up" facts that is incorrect).
In a way I am glad to see Wikipedia is fully relevant even in this emerging AI-driven search world. But Google search has ben careful to always have a link to Wikipedia besides their made up summary of facts, which here it is missing (yet?). And for licences, they are all ignored.
So if this is the future the number of accesses from users to Wikipedia will collapse, and also their willingness to donate... (but our content still a cornerstone for knowledge)
Anders
(I got a lot of fact from an article in Swedish main newspaper by their tech editor. He started asking fact of himself, and when he received facts from his Wp article plus being credited to a book he had noting to do with, he started to try to tell/learn ChatGPT of this error. The chatPGT only got angry accusing the techeditor for lying and in the end cut off the conversation, as ChatGPT continued to teat the techeditor as lyer and vandal..). _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Speaking only for myself, out of curiosity, some real world examples might be helpful here. I don't have access to Bing's version yet, but I do have access to chat.openai.com which is very impressive but deeply flawed.
I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but not famous) and the answer is weird and laughably bad with more sentences false than true. Among other silly things, it says that she worked for Theresa May and was involved with Brexit negotiations, which if you knew my wife's politics borders on libel. It also says she co-founded an organization (which as far as I can tell, it just made up out of thin air) with Nick Clegg's wife. It's completely mad.
On 2023-02-20 09:34, Anders Wennersten wrote:
BIng with ChatGPT is now released by Micrsoft. > > And from what I understand they use Wikipedia content
considerably. > If you ask Who is A B and A B is not widely known, the result is more > or less identical to the content from the Wikipedia article (but > worse, as it "makes up" facts that is incorrect). > > In a way I am glad to see Wikipedia is fully relevant even in this > emerging AI-driven search world. But Google search has ben careful to > always have a link to Wikipedia besides their made up summary of > facts, which here it is missing (yet?). And for licences, they are > all ignored. > > So if this is the future the number of accesses from users to > Wikipedia will collapse, and also their willingness to donate... (but > our content still a cornerstone for knowledge) > > Anders > > (I got a lot of fact from an article in Swedish main newspaper by > their tech editor. He started asking fact of himself, and when he > received facts from his Wp article plus being credited to a book he > had noting to do with, he started to try to tell/learn ChatGPT of > this error. The chatPGT only got angry accusing the techeditor for > lying and in the end cut off the conversation, as ChatGPT continued > to teat the techeditor as lyer and vandal..). > _______________________________________________ Wikimedia-l mailing > list -- wikimedia-l@lists.wikimedia.org, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/...
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Hi again,
Another potentially interesting podcast for some touching this matter (more or less): https://www.nytimes.com/2023/02/17/podcasts/hard-fork-bing-ai-elon.html
Linked to the ones I sent before on the other thread.
If this is the new Napster revolution equivalent, yeah I know... back in the day, buckle up!
Cheers,
On Mon, Feb 20, 2023, 17:33 Jimmy Wales jimmywales@wikitribune.com wrote:
Speaking only for myself, out of curiosity, some real world examples might be helpful here. I don't have access to Bing's version yet, but I do have access to chat.openai.com which is very impressive but deeply flawed.
I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but not famous) and the answer is weird and laughably bad with more sentences false than true. Among other silly things, it says that she worked for Theresa May and was involved with Brexit negotiations, which if you knew my wife's politics borders on libel. It also says she co-founded an organization (which as far as I can tell, it just made up out of thin air) with Nick Clegg's wife. It's completely mad.
On 2023-02-20 09:34, Anders Wennersten wrote:
BIng with ChatGPT is now released by Micrsoft. > > And from what I
understand they use Wikipedia content considerably. > If you ask Who is A B and A B is not widely known, the result is more > or less identical to the content from the Wikipedia article (but > worse, as it "makes up" facts that is incorrect). > > In a way I am glad to see Wikipedia is fully relevant even in this > emerging AI-driven search world. But Google search has ben careful to > always have a link to Wikipedia besides their made up summary of > facts, which here it is missing (yet?). And for licences, they are > all ignored. > > So if this is the future the number of accesses from users to > Wikipedia will collapse, and also their willingness to donate... (but > our content still a cornerstone for knowledge) > > Anders > > (I got a lot of fact from an article in Swedish main newspaper by > their tech editor. He started asking fact of himself, and when he > received facts from his Wp article plus being credited to a book he > had noting to do with, he started to try to tell/learn ChatGPT of > this error. The chatPGT only got angry accusing the techeditor for > lying and in the end cut off the conversation, as ChatGPT continued > to teat the techeditor as lyer and vandal..). > _______________________________________________ Wikimedia-l mailing > list -- wikimedia-l@lists.wikimedia.org, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/...
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Hi. Thanks a lot.
در تاریخ سهشنبه ۲۱ فوریهٔ ۲۰۲۳، ۱:۳۰ Eduardo Testart etestart@gmail.com نوشت:
Hi again,
Another potentially interesting podcast for some touching this matter (more or less): https://www.nytimes.com/2023/02/17/podcasts/hard-fork-bing-ai-elon.html
Linked to the ones I sent before on the other thread.
If this is the new Napster revolution equivalent, yeah I know... back in the day, buckle up!
Cheers,
On Mon, Feb 20, 2023, 17:33 Jimmy Wales jimmywales@wikitribune.com wrote:
Speaking only for myself, out of curiosity, some real world examples might be helpful here. I don't have access to Bing's version yet, but I do have access to chat.openai.com which is very impressive but deeply flawed.
I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but not famous) and the answer is weird and laughably bad with more sentences false than true. Among other silly things, it says that she worked for Theresa May and was involved with Brexit negotiations, which if you knew my wife's politics borders on libel. It also says she co-founded an organization (which as far as I can tell, it just made up out of thin air) with Nick Clegg's wife. It's completely mad.
On 2023-02-20 09:34, Anders Wennersten wrote:
BIng with ChatGPT is now released by Micrsoft. > > And from what I
understand they use Wikipedia content considerably. > If you ask Who is A B and A B is not widely known, the result is more > or less identical to the content from the Wikipedia article (but > worse, as it "makes up" facts that is incorrect). > > In a way I am glad to see Wikipedia is fully relevant even in this > emerging AI-driven search world. But Google search has ben careful to > always have a link to Wikipedia besides their made up summary of > facts, which here it is missing (yet?). And for licences, they are > all ignored. > > So if this is the future the number of accesses from users to > Wikipedia will collapse, and also their willingness to donate... (but > our content still a cornerstone for knowledge) > > Anders > > (I got a lot of fact from an article in Swedish main newspaper by > their tech editor. He started asking fact of himself, and when he > received facts from his Wp article plus being credited to a book he > had noting to do with, he started to try to tell/learn ChatGPT of > this error. The chatPGT only got angry accusing the techeditor for > lying and in the end cut off the conversation, as ChatGPT continued > to teat the techeditor as lyer and vandal..). > _______________________________________________ Wikimedia-l mailing > list -- wikimedia-l@lists.wikimedia.org, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/...
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Mon, Feb 20, 2023 at 12:33 PM Jimmy Wales jimmywales@wikitribune.com wrote:
Speaking only for myself, out of curiosity, some real world examples might be helpful here. I don't have access to Bing's version yet, but I do have access to chat.openai.com which is very impressive but deeply flawed.
I've found ChatGPT most useful for small coding tasks (with a lot of scrutiny). Most of the other practical applications I've heard of have been of the creative variety, or in writing mundane letters, emails, proposals, summaries, etc. As an example, please find a ChatGPT-generated summary of this email at the end.
I think it's best to view ChatGPT (and its like) at this stage at, at its best, a useful assistive technology and, at its worst, a distributed denial of service attack on our collective ability to understand our world.
The attempts to quickly commercially exploit these technologies tend to push their impact more towards the latter, at least until those deep flaws you mention are addressed.
It's a technology that requires a high degree of literacy in its responsible use, while suggesting to the user that it requires none: a dangerous combination.
The grand vision is to create human-level artificial intelligence. "AGI" (Artificial General Intelligence) is now an explicit stated goal of major players in the field. Of course, if AGI is in fact realized, it _will_ change everything: a dream as big as SETI or limitless energy generation. But for now we just have sparkling autocomplete.
It's easy to enumerate potential positive applications (assisted editing, Wikidata query generation via natural language, automatic summaries of open access citations, ...). For any one of them, I think the challenge is to figure out a way towards _responsible_ integrations that don't proliferate misinformation and add value.
I do think that it is strategically vital for Wikimedia to understand and explore this space, to look for low-risk/high-reward applications, and to be dispassionate and objective in the face of both AI hype and anti-AI backlash.
Erik
---
ChatGPT summary of this email:
The email discusses the practical applications of ChatGPT and warns about the negative consequences of quickly commercializing AI technology. The writer suggests responsible integration of AI to avoid misinformation and add value, and recommends that Wikimedia explore low-risk/high-reward AI applications while remaining objective in the face of AI hype and backlash.
Anders, do you have a citation for “use Wikipedia content considerably”?
Lots of early-ish ML work was heavily dependent on Wikipedia, but state-of-the-art Large Language Models are trained on vast quantities of text, of which Wikipedia is only a small part. ChatGPT does not share their data sources (as far as I know) but the Eleuther.ai project released their Pile a few years back, and that already had Wikipedia as < 5% of the text data; I think it is safe to assume that the percentage is smaller for newer models: https://arxiv.org/abs/2101.00027
Techniques to improve reliability of LLM output may rely more heavily on Wikipedia. For example, Facebook uses Wikipedia rather heavily in this *research paper*: https://arxiv.org/abs/2208.03299%C2%A0But I have seen no evidence that techniques like that are in use by OpenAI, or that they’re specifically trained on Wikipedia. If you’ve seen discussion of that, or evidence from output suggesting it, that’d be interesting and important!
Social: @luis_in_brief@social.coop ML news: openml.fyi On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten mail@anderswennersten.se, wrote:
BIng with ChatGPT is now released by Micrsoft.
And from what I understand they use Wikipedia content considerably. If you ask Who is A B and A B is not widely known, the result is more or less identical to the content from the Wikipedia article (but worse, as it "makes up" facts that is incorrect).
In a way I am glad to see Wikipedia is fully relevant even in this emerging AI-driven search world. But Google search has ben careful to always have a link to Wikipedia besides their made up summary of facts, which here it is missing (yet?). And for licences, they are all ignored.
So if this is the future the number of accesses from users to Wikipedia will collapse, and also their willingness to donate... (but our content still a cornerstone for knowledge)
Anders
(I got a lot of fact from an article in Swedish main newspaper by their tech editor. He started asking fact of himself, and when he received facts from his Wp article plus being credited to a book he had noting to do with, he started to try to tell/learn ChatGPT of this error. The chatPGT only got angry accusing the techeditor for lying and in the end cut off the conversation, as ChatGPT continued to teat the techeditor as lyer and vandal..). _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
-Sage
On Wed, Feb 22, 2023 at 8:19 AM Luis (lu.is) luis@lu.is wrote:
Anders, do you have a citation for “use Wikipedia content considerably”?
Lots of early-ish ML work was heavily dependent on Wikipedia, but state-of-the-art Large Language Models are trained on vast quantities of text, of which Wikipedia is only a small part. ChatGPT does not share their data sources (as far as I know) but the Eleuther.ai project released their Pile a few years back, and that already had Wikipedia as < 5% of the text data; I think it is safe to assume that the percentage is smaller for newer models: https://arxiv.org/abs/2101.00027
Techniques to improve reliability of LLM output may rely more heavily on Wikipedia. For example, Facebook uses Wikipedia rather heavily in this *research paper*: https://arxiv.org/abs/2208.03299 But I have seen no evidence that techniques like that are in use by OpenAI, or that they’re specifically trained on Wikipedia. If you’ve seen discussion of that, or evidence from output suggesting it, that’d be interesting and important!
Social: @luis_in_brief@social.coop ML news: openml.fyi On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten mail@anderswennersten.se, wrote:
BIng with ChatGPT is now released by Micrsoft.
And from what I understand they use Wikipedia content considerably. If you ask Who is A B and A B is not widely known, the result is more or less identical to the content from the Wikipedia article (but worse, as it "makes up" facts that is incorrect).
In a way I am glad to see Wikipedia is fully relevant even in this emerging AI-driven search world. But Google search has ben careful to always have a link to Wikipedia besides their made up summary of facts, which here it is missing (yet?). And for licences, they are all ignored.
So if this is the future the number of accesses from users to Wikipedia will collapse, and also their willingness to donate... (but our content still a cornerstone for knowledge)
Anders
(I got a lot of fact from an article in Swedish main newspaper by their tech editor. He started asking fact of himself, and when he received facts from his Wp article plus being credited to a book he had noting to do with, he started to try to tell/learn ChatGPT of this error. The chatPGT only got angry accusing the techeditor for lying and in the end cut off the conversation, as ChatGPT continued to teat the techeditor as lyer and vandal..). _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
I got the impression from the tech editor who I read the article from, that there is a big difference in how ChatGPT is used together with Bing. Jimmy Wales here describes my own experience using only ChatGPT, if you ask "who is NN#, you get unusable rubbish back.
But when the techeditor asked Bing-ChatGPT "who is Linus Larsson" (his name) he got very good result, that only exists in the article of him on swwp (no article of him exists on enwp). I can not interpret that in other way then that this version looked up Wikipedia, when asked
But I am am not a tech wizard so can be wrong
Anders
https://www.dn.se/kultur/linus-larsson-microsofts-ai-gjorde-slut-med-mig-pa-...
(the article in Swedish, heading says "Microsoft AI ended our relation on Valentin Day")
I also like the Ai is insulting, stating as an answer "are you a fool or only stupid?" It seems to need to get trained on our UCoC
Den 2023-02-22 kl. 17:32, skrev Sage Ross:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
-Sage
On Wed, Feb 22, 2023 at 8:19 AM Luis (lu.is) luis@lu.is wrote:
Anders, do you have a citation for “use Wikipedia content considerably”?
Lots of early-ish ML work was heavily dependent on Wikipedia, but state-of-the-art Large Language Models are trained on vast quantities of text, of which Wikipedia is only a small part. ChatGPT does not share their data sources (as far as I know) but the Eleuther.ai project released their Pile a few years back, and that already had Wikipedia as < 5% of the text data; I think it is safe to assume that the percentage is smaller for newer models: https://arxiv.org/abs/2101.00027
Techniques to improve reliability of LLM output may rely more heavily on Wikipedia. For example, Facebook uses Wikipedia rather heavily in this *research paper*: https://arxiv.org/abs/2208.03299 But I have seen no evidence that techniques like that are in use by OpenAI, or that they’re specifically trained on Wikipedia. If you’ve seen discussion of that, or evidence from output suggesting it, that’d be interesting and important!
Social: @luis_in_brief@social.coop ML news: openml.fyi On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten mail@anderswennersten.se, wrote:
BIng with ChatGPT is now released by Micrsoft.
And from what I understand they use Wikipedia content considerably. If you ask Who is A B and A B is not widely known, the result is more or less identical to the content from the Wikipedia article (but worse, as it "makes up" facts that is incorrect).
In a way I am glad to see Wikipedia is fully relevant even in this emerging AI-driven search world. But Google search has ben careful to always have a link to Wikipedia besides their made up summary of facts, which here it is missing (yet?). And for licences, they are all ignored.
So if this is the future the number of accesses from users to Wikipedia will collapse, and also their willingness to donate... (but our content still a cornerstone for knowledge)
Anders
(I got a lot of fact from an article in Swedish main newspaper by their tech editor. He started asking fact of himself, and when he received facts from his Wp article plus being credited to a book he had noting to do with, he started to try to tell/learn ChatGPT of this error. The chatPGT only got angry accusing the techeditor for lying and in the end cut off the conversation, as ChatGPT continued to teat the techeditor as lyer and vandal..). _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
wikimedia-l@lists.wikimedia.org