Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models - 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI (Apache 2.0) - Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence) - LLaMa https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only) - Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa - Alpaca: A Strong, Replicable Instruction-Following Model https://crfm.stanford.edu/2023/03/13/alpaca.html - Train and run Stanford Alpaca on your own machine https://replicate.com/blog/replicate-alpaca - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning https://github.com/tloen/alpaca-lora
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp https://til.simonwillison.net/llms/llama-7b-m2 - Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp & llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions) - Int-4 LLaMa is not enough - Int-3 and beyond https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and - How is LLaMa.cpp possible? https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly libraries to run LLM models in the browser) - Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and Alpaca in own computer as Node.js web service) - web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
(Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable
diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross < ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia
- only 4.5% of training text, but more heavily weighted than most other
sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
This is an important development for editors to be aware of - we're going to have to be increasingly on the lookout for sources using ML-generated bullshit. Here are two instances I'm aware of this week:
https://www.thenation.com/article/culture/internet-archive-publishers-lawsui...
In late February, Tyler Cowen, a libertarian economics professor at George Mason University, published a blog post titled https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html, “Who was the most important critic of the printing press in the 17th century?” Cowen’s post contended that the polymath and statesman Francis Bacon was an “important” critic of the printing press; unfortunately, the post contains long, fake quotes attributed to Bacon’s *The Advancement of Learning *(1605), complete with false chapter and section numbers. Tech writer Mathew Ingram drew attention to the fabrications a few days later https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/, noting that Cowen has been writing approvingly about the AI chatbot ChatGPT https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html for some time now; several commenters on Cowen’s post assumed the fake quotes must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed questions regarding the post by press time, and later removed the post entirely, with no explanation whatsoever. However, a copy remains at the Internet Archive’s Wayback Machine).
https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-artic... An article claiming to identify misinformation in an Oscar-winning documentary about imprisoned Russian dissident Alexei Navalny is itself full of misinformation, thanks to the author using AI. Investigative news outlet *The Grayzone* recently published an article https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/ that included AI-generated text as a source for its information. The piece http://web.archive.org/web/20230314131551/https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/, “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy Komisar, included hyperlinks to PDFs http://web.archive.org/web/20230314121144/https://www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf uploaded to the author’s personal website that appear to be screenshots of conversations she had with ChatSonic, a free generative AI chatbot that advertises itself as a ChatGPT alternative that can “write factual trending content” using Google search results.
That said, I don't think this is anything to be too stressed about; the Grayzone is already a deprecated source and blogs like Marginal Revolution are treated with caution, though Cowen has sufficient credentials to be treated as a reliable expert.
On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
(Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable
diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross < ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia
- only 4.5% of training text, but more heavily weighted than most other
sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
“Cowen has sufficient credentials to be treated as a reliable expert”
Maybe not for much longer.
Cheers, P.
From: The Cunctator [mailto:cunctator@gmail.com] Sent: 17 March 2023 17:49 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Bing-ChatGPT
This is an important development for editors to be aware of - we're going to have to be increasingly on the lookout for sources using ML-generated bullshit. Here are two instances I'm aware of this week:
https://www.thenation.com/article/culture/internet-archive-publishers-lawsui... In late February, Tyler Cowen, a libertarian economics professor at George Mason University, published a blog post https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html titled, “Who was the most important critic of the printing press in the 17th century?” Cowen’s post contended that the polymath and statesman Francis Bacon was an “important” critic of the printing press; unfortunately, the post contains long, fake quotes attributed to Bacon’s The Advancement of Learning (1605), complete with false chapter and section numbers. Tech writer Mathew Ingram drew attention to the fabrications https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/ a few days later, noting that Cowen has been https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html writing approvingly about the AI chatbot ChatGPT for some time now; several commenters on Cowen’s post assumed the fake quotes must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed questions regarding the post by press time, and later removed the post entirely, with no explanation whatsoever. However, a copy remains at the Internet Archive’s Wayback Machine).
https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-artic... An article claiming to identify misinformation in an Oscar-winning documentary about imprisoned Russian dissident Alexei Navalny is itself full of misinformation, thanks to the author using AI. Investigative news outlet The Grayzone recently https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/ published an article that included AI-generated text as a source for its information. The http://web.archive.org/web/20230314131551/https:/thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/ piece, “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy Komisar, included hyperlinks to http://web.archive.org/web/20230314121144/https:/www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf PDFs uploaded to the author’s personal website that appear to be screenshots of conversations she had with ChatSonic, a free generative AI chatbot that advertises itself as a ChatGPT alternative that can “write factual trending content” using Google search results.
That said, I don't think this is anything to be too stressed about; the Grayzone is already a deprecated source and blogs like Marginal Revolution are treated with caution, though Cowen has sufficient credentials to be treated as a reliable expert.
On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
· Possible models
· https://github.com/EleutherAI/gpt-neox 20b Neo(X) by EleutherAI (Apache 2.0)
· https://huggingface.co/KoboldAI/fairseq-dense-13B Fairseq Dense by Facebook (MIT-licence)
· https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ LLaMa by Facebook (custom license, leaked research use only)
· https://huggingface.co/bigscience/bloom Bloom by Bigscience ( https://huggingface.co/spaces/bigscience/license custom license. open, non-commercial)
2.) Fine-tuning or align
· Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
· https://crfm.stanford.edu/2023/03/13/alpaca.html Alpaca: A Strong, Replicable Instruction-Following Model
· https://replicate.com/blog/replicate-alpaca Train and run Stanford Alpaca on your own machine
· https://github.com/tloen/alpaca-lora Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
· https://til.simonwillison.net/llms/llama-7b-m2 Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
· Github: https://github.com/NouamaneTazi/bloomz.cpp bloomz.cpp & https://github.com/ggerganov/llama.cpp llama.cpp (C++ only versions)
· https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and Int-4 LLaMa is not enough - Int-3 and beyond
· https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible How is LLaMa.cpp possible?
4.) Easy-to-use interfaces
· https://xenova.github.io/transformers.js/ Transformer.js (WebAssembly libraries to run LLM models in the browser)
· https://github.com/cocktailpeanut/dalai Dalai ( run LLaMA and Alpaca in own computer as Node.js web service)
· https://github.com/mlc-ai/web-stable-diffusion web-stable-diffusion (stable diffusion image generation in browser)
Br,
-- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
* Possible models
* https://github.com/EleutherAI/gpt-neox 20b Neo(X) by EleutherAI (Apache 2.0) * https://huggingface.co/KoboldAI/fairseq-dense-13B Fairseq Dense by Facebook (MIT-licence) * https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ LLaMa by Facebook (custom license, leaked research use only) * https://huggingface.co/bigscience/bloom Bloom by Bigscience ( https://huggingface.co/spaces/bigscience/license custom license. open, non-commercial)
2.) Fine-tuning or align
* Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
* https://crfm.stanford.edu/2023/03/13/alpaca.html Alpaca: A Strong, Replicable Instruction-Following Model * https://replicate.com/blog/replicate-alpaca Train and run Stanford Alpaca on your own machine * https://github.com/tloen/alpaca-lora Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
* https://til.simonwillison.net/llms/llama-7b-m2 Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp * Github: https://github.com/NouamaneTazi/bloomz.cpp bloomz.cpp & https://github.com/ggerganov/llama.cpp llama.cpp (C++ only versions) * https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and Int-4 LLaMa is not enough - Int-3 and beyond * https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible How is LLaMa.cpp possible?
4.) Easy-to-use interfaces
* https://xenova.github.io/transformers.js/ Transformer.js (WebAssembly libraries to run LLM models in the browser) * https://github.com/cocktailpeanut/dalai Dalai ( run LLaMA and Alpaca in own computer as Node.js web service) * https://github.com/mlc-ai/web-stable-diffusion web-stable-diffusion (stable diffusion image generation in browser)
Br,
-- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.com mailto:ragesoss%2Bwikipedia@gmail.com >, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Virus-free. http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient www.avg.com
Hello,
I would like to indicate "Copilot" in the Edge browser as being potentially relevant to Wikipedia [1][2].
It is foreseeable that end-users will be able to open sidebars in their Web browsers and subsequently chat with large language models about the contents of specific Web documents, e.g., encyclopedia articles. Using Web browsers, there can be task contexts available, including the documents or articles in users' current tabs, potentially including users' scroll positions, potentially including users' selections or highlightings of content.
I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for amplifying these features and capabilities for end-users.
Best regards, Adam Sobieski
[1] https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-c... [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
________________________________ From: Kimmo Virtanen kimmo.virtanen@wikimedia.fi Sent: Friday, March 17, 2023 8:17 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Bing-ChatGPT
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
* Possible models * 20b Neo(X)https://github.com/EleutherAI/gpt-neox by EleutherAI (Apache 2.0) * Fairseq Densehttps://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence) * LLaMahttps://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only) * Bloomhttps://huggingface.co/bigscience/bloom by Bigscience (custom licensehttps://huggingface.co/spaces/bigscience/license. open, non-commercial)
2.) Fine-tuning or align
* Example: Standford Alpaca is ChatGPT fine-tuned LLaMa * Alpaca: A Strong, Replicable Instruction-Following Modelhttps://crfm.stanford.edu/2023/03/13/alpaca.html * Train and run Stanford Alpaca on your own machinehttps://replicate.com/blog/replicate-alpaca * Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuninghttps://github.com/tloen/alpaca-lora
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
* Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpphttps://til.simonwillison.net/llms/llama-7b-m2 * Github: bloomz.cpphttps://github.com/NouamaneTazi/bloomz.cpp & llama.cpphttps://github.com/ggerganov/llama.cpp (C++ only versions) * Int-4 LLaMa is not enough - Int-3 and beyondhttps://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and * How is LLaMa.cpp possible?https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
* Transformer.jshttps://xenova.github.io/transformers.js/ (WebAssembly libraries to run LLM models in the browser) * Dalaihttps://github.com/cocktailpeanut/dalai ( run LLaMA and Alpaca in own computer as Node.js web service) * web-stable-diffusionhttps://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen <kimmo.virtanen@gmail.commailto:kimmo.virtanen@gmail.com> wrote: Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
* Possible models * 20b Neo(X)https://github.com/EleutherAI/gpt-neox by EleutherAI (Apache 2.0) * Fairseq Densehttps://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence) * LLaMahttps://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only) * Bloomhttps://huggingface.co/bigscience/bloom by Bigscience (custom licensehttps://huggingface.co/spaces/bigscience/license. open, non-commercial)
2.) Fine-tuning or align
* Example: Standford Alpaca is ChatGPT fine-tuned LLaMa * Alpaca: A Strong, Replicable Instruction-Following Modelhttps://crfm.stanford.edu/2023/03/13/alpaca.html * Train and run Stanford Alpaca on your own machinehttps://replicate.com/blog/replicate-alpaca * Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuninghttps://github.com/tloen/alpaca-lora
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
* Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpphttps://til.simonwillison.net/llms/llama-7b-m2 * Github: bloomz.cpphttps://github.com/NouamaneTazi/bloomz.cpp & llama.cpphttps://github.com/ggerganov/llama.cpp (C++ only versions) * Int-4 LLaMa is not enough - Int-3 and beyondhttps://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and * How is LLaMa.cpp possible?https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
* Transformer.jshttps://xenova.github.io/transformers.js/ (WebAssembly libraries to run LLM models in the browser) * Dalaihttps://github.com/cocktailpeanut/dalai ( run LLaMA and Alpaca in own computer as Node.js web service) * web-stable-diffusionhttps://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling <steven.walling@gmail.commailto:steven.walling@gmail.com> wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.ishttp://lu.is) <luis@lu.ismailto:luis@lu.is> wrote: On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipedia@gmail.commailto:ragesoss%2Bwikipedia@gmail.com>, wrote: Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.orgmailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.orgmailto:wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.orgmailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.orgmailto:wikimedia-l-leave@lists.wikimedia.org
I really feel like we're getting into pretty aggressive corporate abuse of the Wikipedia copyleft.
On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Hello,
I would like to indicate "Copilot" in the Edge browser as being potentially relevant to Wikipedia [1][2].
It is foreseeable that end-users will be able to open sidebars in their Web browsers and subsequently chat with large language models about the contents of specific Web documents, e.g., encyclopedia articles. Using Web browsers, there can be task contexts available, including the documents or articles in users' current tabs, potentially including users' scroll positions, potentially including users' selections or highlightings of content.
I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for amplifying these features and capabilities for end-users.
Best regards, Adam Sobieski
[1] https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-c... [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
*From:* Kimmo Virtanen kimmo.virtanen@wikimedia.fi *Sent:* Friday, March 17, 2023 8:17 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
(Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable
diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
(Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable
diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text, but more heavily weighted than most other sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Fri, Mar 17, 2023 at 6:03 PM The Cunctator cunctator@gmail.com wrote:
I really feel like we're getting into pretty aggressive corporate abuse of the Wikipedia copyleft.
I completely agree. It makes me pretty angry that Wikipedians have spent millions of volunteer hours creating content to educate and inform people as accurately as we can, and it's being used to generate convincing but often wildly misleading bullshit.
The ground truth on what generated AI content is (from a copyright position) and where authorship/ownership lies seems to be rapidly evolving. The U.S. Copyright Office recently refused to issue copyrights for some AI-generated works, seemingly on the principle that they lack human authorship / are essential to contracting work for hire from an artist or writer.
IANAL of course, but to me this implies that responsibility for the *egregious* lack of attribution in models that rely substantially on Wikipedia is violating the Attribution requirements of CC licenses. Just like the Foundation took a principled position in testing the legality of warrantless mass surveillance, I would love to see us push back on the notion that it's legal or moral for OpenAI or any of these other companies to take our content and use it to flood the Internet with machine-generated word diarrhea.
On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Hello,
I would like to indicate "Copilot" in the Edge browser as being potentially relevant to Wikipedia [1][2].
It is foreseeable that end-users will be able to open sidebars in their Web browsers and subsequently chat with large language models about the contents of specific Web documents, e.g., encyclopedia articles. Using Web browsers, there can be task contexts available, including the documents or articles in users' current tabs, potentially including users' scroll positions, potentially including users' selections or highlightings of content.
I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for amplifying these features and capabilities for end-users.
Best regards, Adam Sobieski
[1] https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-c... [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
*From:* Kimmo Virtanen kimmo.virtanen@wikimedia.fi *Sent:* Friday, March 17, 2023 8:17 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
Facebook (MIT-licence)
- Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ragesoss+wikipedia@gmail.com, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia
- only 4.5% of training text, but more heavily weighted than most other
sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
In european union there is in pipeline "Regulation Laying Down Harmonized Rules on Artificial Intelligence" aka AI Act and "AI Liability Directive" (AILD). AI act text afaik is currently in finalizing phase. * https://www.insideprivacy.com/artificial-intelligence/eu-ai-policy-and-regul... * https://www.dentons.com/en/insights/articles/2023/february/1/regulating-ai-i...
Interesting note is that there is no wikipedia article about these or even wikidata items.
Br, -- Kimmo Virtanen, Zache
On Sat, Mar 18, 2023 at 4:06 AM Steven Walling steven.walling@gmail.com wrote:
On Fri, Mar 17, 2023 at 6:03 PM The Cunctator cunctator@gmail.com wrote:
I really feel like we're getting into pretty aggressive corporate abuse of the Wikipedia copyleft.
I completely agree. It makes me pretty angry that Wikipedians have spent millions of volunteer hours creating content to educate and inform people as accurately as we can, and it's being used to generate convincing but often wildly misleading bullshit.
The ground truth on what generated AI content is (from a copyright position) and where authorship/ownership lies seems to be rapidly evolving. The U.S. Copyright Office recently refused to issue copyrights for some AI-generated works, seemingly on the principle that they lack human authorship / are essential to contracting work for hire from an artist or writer.
IANAL of course, but to me this implies that responsibility for the *egregious* lack of attribution in models that rely substantially on Wikipedia is violating the Attribution requirements of CC licenses. Just like the Foundation took a principled position in testing the legality of warrantless mass surveillance, I would love to see us push back on the notion that it's legal or moral for OpenAI or any of these other companies to take our content and use it to flood the Internet with machine-generated word diarrhea.
On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Hello,
I would like to indicate "Copilot" in the Edge browser as being potentially relevant to Wikipedia [1][2].
It is foreseeable that end-users will be able to open sidebars in their Web browsers and subsequently chat with large language models about the contents of specific Web documents, e.g., encyclopedia articles. Using Web browsers, there can be task contexts available, including the documents or articles in users' current tabs, potentially including users' scroll positions, potentially including users' selections or highlightings of content.
I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for amplifying these features and capabilities for end-users.
Best regards, Adam Sobieski
[1] https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-c... [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
*From:* Kimmo Virtanen kimmo.virtanen@wikimedia.fi *Sent:* Friday, March 17, 2023 8:17 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
https://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence)
- Fairseq Dense
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
https://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence)
- Fairseq Dense
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross < ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia
- only 4.5% of training text, but more heavily weighted than most other
sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Hi all
I agree that the AI creators should attribute Wikipedia as their source. But on the other hand, when the result is incorrect etc we might actually be glad that they do not attribute it. The issue is how to convince readers to come to the source (our projects) rather than using in-between steps (AI or otherwise).
Matej
On Sat, Mar 18, 2023 at 3:22 AM Steven Walling steven.walling@gmail.com wrote:
On Fri, Mar 17, 2023 at 6:03 PM The Cunctator cunctator@gmail.com wrote:
I really feel like we're getting into pretty aggressive corporate abuse of the Wikipedia copyleft.
I completely agree. It makes me pretty angry that Wikipedians have spent millions of volunteer hours creating content to educate and inform people as accurately as we can, and it's being used to generate convincing but often wildly misleading bullshit.
The ground truth on what generated AI content is (from a copyright position) and where authorship/ownership lies seems to be rapidly evolving. The U.S. Copyright Office recently refused to issue copyrights for some AI-generated works, seemingly on the principle that they lack human authorship / are essential to contracting work for hire from an artist or writer.
IANAL of course, but to me this implies that responsibility for the *egregious* lack of attribution in models that rely substantially on Wikipedia is violating the Attribution requirements of CC licenses. Just like the Foundation took a principled position in testing the legality of warrantless mass surveillance, I would love to see us push back on the notion that it's legal or moral for OpenAI or any of these other companies to take our content and use it to flood the Internet with machine-generated word diarrhea.
On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Hello,
I would like to indicate "Copilot" in the Edge browser as being potentially relevant to Wikipedia [1][2].
It is foreseeable that end-users will be able to open sidebars in their Web browsers and subsequently chat with large language models about the contents of specific Web documents, e.g., encyclopedia articles. Using Web browsers, there can be task contexts available, including the documents or articles in users' current tabs, potentially including users' scroll positions, potentially including users' selections or highlightings of content.
I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for amplifying these features and capabilities for end-users.
Best regards, Adam Sobieski
[1] https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-c... [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
*From:* Kimmo Virtanen kimmo.virtanen@wikimedia.fi *Sent:* Friday, March 17, 2023 8:17 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex. However, the development in the open source and hacking side has been pretty fast and it seems that there are all the pieces for running LLM models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of open source models such as Neox for the English language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open source LLM model and service and also recap where the hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
https://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence)
- Fairseq Dense
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:
Hi,
The development of open-source large language models is going forward. The GPT-4 was released and it seems that it passed the Bar exam and tried to hire humans to solve catchpas which were too complex to it. However, the development in open source and hacking side has been pretty fast and it seems that there is all the pieces for running LLM models in personal hardware (and in web browser). Biggest missing piece is fine tuning of open source model such as Neox for english language. For multilingual and multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.
- Possible models
EleutherAI (Apache 2.0)
- 20b Neo(X) https://github.com/EleutherAI/gpt-neox by
https://huggingface.co/KoboldAI/fairseq-dense-13B by Facebook (MIT-licence)
- Fairseq Dense
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by Facebook (custom license, leaked research use only)
- LLaMa
license https://huggingface.co/spaces/bigscience/license. open, non-commercial)
- Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
https://crfm.stanford.edu/2023/03/13/alpaca.html
- Alpaca: A Strong, Replicable Instruction-Following Model
https://replicate.com/blog/replicate-alpaca
- Train and run Stanford Alpaca on your own machine
https://github.com/tloen/alpaca-lora
- Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
https://til.simonwillison.net/llms/llama-7b-m2
- Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
- Int-4 LLaMa is not enough - Int-3 and beyond
https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
- How is LLaMa.cpp possible?
https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
libraries to run LLM models in the browser)
- Dalai https://github.com/cocktailpeanut/dalai ( run LLaMA and
Alpaca in own computer as Node.js web service)
- web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable diffusion image generation in browser)
Br, -- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com wrote:
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross < ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others being CommonCrawl, a smaller subset of scraped websites based on upvoted reddit links, and two unrevealed datasets of scanned books. (I've read speculation that one of these datasets is basically the Library Genesis archive.) Wikipedia is much smaller than the other datasets, although they did weight it somewhat more heavily than any other dataset. With the extra weighting, they say Wikipedia accounts for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training sources, it turns out, with similar weighting for Wikipedia
- only 4.5% of training text, but more heavily weighted than most other
sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling steven.walling@gmail.com wrote:
IANAL of course, but to me this implies that responsibility for the *egregious* lack of attribution in models that rely substantially on Wikipedia is violating the Attribution requirements of CC licenses.
Morally, I agree that companies like OpenAI would do well to recognize and nurture the sources they rely upon in training their models. Especially as the web becomes polluted with low quality AI-generated content, it would seem in everybody's best interest to sustain the communities and services that make and keep high quality information available. Not just Wikimedia, but also the Internet Archive, open access journals and preprint servers, etc.
Legally, it seems a lot murkier. OpenAI in particular does not distribute any of its GPT models. You can feed them prompts by various means, and get responses back. Do those responses plagiarize Wikipedia?
With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet. You can certainly ask GPT-4 to generate something that looks like a Wikipedia article -- here are example results for generating a random Wikipedia article:
Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film) GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1 (cut off at the ChatGPT generation limit) GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2 GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
It imitates the form of a Wikipedia article & mixes up / makes up assertions, but I don't know that any of its generations would meet the standard of infringing on the Wikipedia article's copyright. IANAL either, and as you say, the legal landscape is evolving rapidly.
Warmly, Erik
[1] https://arstechnica.com/information-technology/2023/02/researchers-extract-t...
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling steven.walling@gmail.com wrote:
IANAL of course, but to me this implies that responsibility for the
*egregious* lack
of attribution in models that rely substantially on Wikipedia is
violating the Attribution
requirements of CC licenses.
Morally, I agree that companies like OpenAI would do well to recognize and nurture the sources they rely upon in training their models. Especially as the web becomes polluted with low quality AI-generated content, it would seem in everybody's best interest to sustain the communities and services that make and keep high quality information available. Not just Wikimedia, but also the Internet Archive, open access journals and preprint servers, etc.
Legally, it seems a lot murkier. OpenAI in particular does not distribute any of its GPT models. You can feed them prompts by various means, and get responses back. Do those responses plagiarize Wikipedia?
With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet. You can certainly ask GPT-4 to generate something that looks like a Wikipedia article -- here are example results for generating a random Wikipedia article:
Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film) GPT-4 https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1 (cut off at the ChatGPT generation limit) GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2 GPT-4 https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
It imitates the form of a Wikipedia article & mixes up / makes up assertions, but I don't know that any of its generations would meet the standard of infringing on the Wikipedia article's copyright. IANAL either, and as you say, the legal landscape is evolving rapidly.
Warmly, Erik
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC. But on other hand the nature of the model means they’re combining CC and non free works freely / at random, unless a court would interpret whatever % of training data comes from us as the direct degree to which the model output is derived from Wikipedia. Either way it’s going to be up to some legal representation of copyright holders to test the boundaries here.
[1] https://arstechnica.com/information-technology/2023/02/researchers-extract-t... _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
On Sat, Mar 18, 2023 at 10:17 PM Steven Walling steven.walling@gmail.com wrote:
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling steven.walling@gmail.com wrote:
IANAL of course, but to me this implies that responsibility for the
*egregious* lack
of attribution in models that rely substantially on Wikipedia is
violating the Attribution
requirements of CC licenses.
Morally, I agree that companies like OpenAI would do well to recognize and nurture the sources they rely upon in training their models. Especially as the web becomes polluted with low quality AI-generated content, it would seem in everybody's best interest to sustain the communities and services that make and keep high quality information available. Not just Wikimedia, but also the Internet Archive, open access journals and preprint servers, etc.
Legally, it seems a lot murkier. OpenAI in particular does not distribute any of its GPT models. You can feed them prompts by various means, and get responses back. Do those responses plagiarize Wikipedia?
With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet. You can certainly ask GPT-4 to generate something that looks like a Wikipedia article -- here are example results for generating a random Wikipedia article:
Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film) GPT-4 https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1 (cut off at the ChatGPT generation limit) GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2 GPT-4 https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
It imitates the form of a Wikipedia article & mixes up / makes up assertions, but I don't know that any of its generations would meet the standard of infringing on the Wikipedia article's copyright. IANAL either, and as you say, the legal landscape is evolving rapidly.
Warmly, Erik
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC. But on other hand the nature of the model means they’re combining CC and non free works freely / at random, unless a court would interpret whatever % of training data comes from us as the direct degree to which the model output is derived from Wikipedia. Either way it’s going to be up to some legal representation of copyright holders to test the boundaries here.
[1] https://arstechnica.com/information-technology/2023/02/researchers-extract-t... _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
Current (ie. ChatGPT) systems doesn't work that way, as the source of information is lost in the process when the information is encoded into the model. The model is just a network of probabilities, and it is highly compressed compared to the original data. We are missing the point if we believe it is a copy of source data and not a tool to interact with information using natural languages.
Soon, tools can retrieve data from external sources and write answers based on them[1]. For example, in the Wikipedia context, this would be to use a search engine to find information automatically, summarize findings, and generate references for the results. Or vice versa, retrieve information from Wikipedia or Wikidata. Then we will get source data, too, but the LLM model's internal reasoning will still be fuzzy.
[1] https://interconnected.org/home/2023/03/16/singularity
Br, -- Kimmo Virtanen
On Sun, Mar 19, 2023 at 8:24 AM Todd Allen toddmallen@gmail.com wrote:
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
On Sat, Mar 18, 2023 at 10:17 PM Steven Walling steven.walling@gmail.com wrote:
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling steven.walling@gmail.com wrote:
IANAL of course, but to me this implies that responsibility for the
*egregious* lack
of attribution in models that rely substantially on Wikipedia is
violating the Attribution
requirements of CC licenses.
Morally, I agree that companies like OpenAI would do well to recognize and nurture the sources they rely upon in training their models. Especially as the web becomes polluted with low quality AI-generated content, it would seem in everybody's best interest to sustain the communities and services that make and keep high quality information available. Not just Wikimedia, but also the Internet Archive, open access journals and preprint servers, etc.
Legally, it seems a lot murkier. OpenAI in particular does not distribute any of its GPT models. You can feed them prompts by various means, and get responses back. Do those responses plagiarize Wikipedia?
With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet. You can certainly ask GPT-4 to generate something that looks like a Wikipedia article -- here are example results for generating a random Wikipedia article:
Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film) GPT-4 https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1 (cut off at the ChatGPT generation limit) GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2 GPT-4 https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
It imitates the form of a Wikipedia article & mixes up / makes up assertions, but I don't know that any of its generations would meet the standard of infringing on the Wikipedia article's copyright. IANAL either, and as you say, the legal landscape is evolving rapidly.
Warmly, Erik
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC. But on other hand the nature of the model means they’re combining CC and non free works freely / at random, unless a court would interpret whatever % of training data comes from us as the direct degree to which the model output is derived from Wikipedia. Either way it’s going to be up to some legal representation of copyright holders to test the boundaries here.
[1] https://arstechnica.com/information-technology/2023/02/researchers-extract-t... _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
FYI, there's an open letter requesting a 6-month pause on AI development https://futureoflife.org/open-letter/pause-giant-ai-experiments/, with reasonable arguments (in my opinion) and signed by several big names too. The basic rationale, as I understand it, is that similar to human cloning, human germline modification, gain-of-function research and other world-changing and potentially dangerous technologies, there should be some kind of procedure to ensure that safety keeps pace with development, which the current AI race is not allowing.
On Sun, Mar 19, 2023 at 5:20 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:
Or, maybe just require an open disclosure of where the bot pulled from and
how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
Current (ie. ChatGPT) systems doesn't work that way, as the source of information is lost in the process when the information is encoded into the model. The model is just a network of probabilities, and it is highly compressed compared to the original data. We are missing the point if we believe it is a copy of source data and not a tool to interact with information using natural languages.
Soon, tools can retrieve data from external sources and write answers based on them[1]. For example, in the Wikipedia context, this would be to use a search engine to find information automatically, summarize findings, and generate references for the results. Or vice versa, retrieve information from Wikipedia or Wikidata. Then we will get source data, too, but the LLM model's internal reasoning will still be fuzzy.
[1] https://interconnected.org/home/2023/03/16/singularity
Br, -- Kimmo Virtanen
On Sun, Mar 19, 2023 at 8:24 AM Todd Allen toddmallen@gmail.com wrote:
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
On Sat, Mar 18, 2023 at 10:17 PM Steven Walling steven.walling@gmail.com wrote:
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling < steven.walling@gmail.com> wrote:
IANAL of course, but to me this implies that responsibility for the
*egregious* lack
of attribution in models that rely substantially on Wikipedia is
violating the Attribution
requirements of CC licenses.
Morally, I agree that companies like OpenAI would do well to recognize and nurture the sources they rely upon in training their models. Especially as the web becomes polluted with low quality AI-generated content, it would seem in everybody's best interest to sustain the communities and services that make and keep high quality information available. Not just Wikimedia, but also the Internet Archive, open access journals and preprint servers, etc.
Legally, it seems a lot murkier. OpenAI in particular does not distribute any of its GPT models. You can feed them prompts by various means, and get responses back. Do those responses plagiarize Wikipedia?
With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet. You can certainly ask GPT-4 to generate something that looks like a Wikipedia article -- here are example results for generating a random Wikipedia article:
Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film) GPT-4 https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1 (cut off at the ChatGPT generation limit) GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2 GPT-4 https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
It imitates the form of a Wikipedia article & mixes up / makes up assertions, but I don't know that any of its generations would meet the standard of infringing on the Wikipedia article's copyright. IANAL either, and as you say, the legal landscape is evolving rapidly.
Warmly, Erik
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC. But on other hand the nature of the model means they’re combining CC and non free works freely / at random, unless a court would interpret whatever % of training data comes from us as the direct degree to which the model output is derived from Wikipedia. Either way it’s going to be up to some legal representation of copyright holders to test the boundaries here.
[1] https://arstechnica.com/information-technology/2023/02/researchers-extract-t... _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Wed, Mar 29, 2023 at 1:04 PM Felipe Schenone schenonef@gmail.com wrote:
FYI, there's an open letter requesting a 6-month pause on AI development, [ https://futureoflife.org/open-letter/pause-giant-ai-experiments/ ] with reasonable arguments (in my opinion) and signed by several big names too.
First, I want to point out that a "pause for at least 6 months the training of AI systems more powerful than GPT-4" doesn't involve halting research on how to prevent existing models from hallucinating, how to cause them to summarize and cite reliable sources verifiably and neutrally, how to allow them to be easily and inexpensively edited for updates and corrections, or benchmarking the performance competing approaches to using them for editing tasks, as I've proposed the Foundation should do.
Secondly, I doubt such a pause on training larger models will do anything to address any of the largest risks of LLMs, including any of the risks which have been articulated as a threat to the projects, as far as I know. Existing models a couple generations behind the bleeding edge are more than good enough to, for example, run an organic-appearing campaign to bias Wikipedia articles in pernicious ways for pay, or even as a dedicated individual's personal project with a budget no larger than that of many common hobbies.
I suggest that the nominally non-free restrictions on use of the BLOOM RAIL license are a superior approach to addressing the immediate risks compared to a mere six month moratorium on larger models, especially if those restrictions were codified into law. The following is from https://openfuture.pubpub.org/pub/notes-on-open-ai
"The authors of the RAIL license acknowledge that the license does not meet the Open Source Initiative definition of open code licenses (and it does not meet the Open Definition either). In related news, the newly launched Can’t Be Evil licenses also challenged established open licensing models, while seeking to uphold the spirit of open sharing.... Traditionally, debates over what constitutes an open license were related to normative debates about ensuring user freedoms. Authors of the RAIL license rightly point out that these need to be balanced today with care for responsible uses."
-LW
-LW
wikimedia-l@lists.wikimedia.org