[Wikimedia-l] Re: Bing-ChatGPT

17 Mar 2023

This is an important development for editors to be aware of - we're going
to have to be increasingly on the lookout for sources using ML-generated
bullshit. Here are two instances I'm aware of this week:

https://www.thenation.com/article/culture/internet-archive-publishers-lawsu…
...
  In late February, Tyler Cowen, a libertarian economics
professor at George
 Mason University, published a blog post titled

<https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html>,
 “Who was the most important critic of the printing press in the 17th
 century?” Cowen’s post contended that the polymath and statesman Francis
 Bacon was an “important” critic of the printing press; unfortunately, the
 post contains long, fake quotes attributed to Bacon’s *The Advancement of
 Learning *(1605), complete with false chapter and section numbers.
 Tech writer Mathew Ingram drew attention to the fabrications a few days
 later

<https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/>,
 noting that Cowen has been writing approvingly about the AI chatbot
 ChatGPT

<https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html>
for
 some time now; several commenters on Cowen’s post assumed the fake quotes
 must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed
 questions regarding the post by press time, and later removed the post
 entirely, with no explanation whatsoever. However, a copy remains at the
 Internet Archive’s Wayback Machine).

...

https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-arti…
 An article claiming to identify misinformation in an Oscar-winning
 documentary about imprisoned Russian dissident Alexei Navalny is itself
 full of misinformation, thanks to the author using AI.
 Investigative news outlet *The Grayzone* recently published an article
 <https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>
  that included AI-generated text as a source for its information. The
 piece

<http://web.archive.org/web/20230314131551/https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>,
 “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy
 Komisar, included hyperlinks to PDFs

<http://web.archive.org/web/20230314121144/https://www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf>
  uploaded to the author’s personal website that appear to be screenshots
 of conversations she had with ChatSonic, a free generative AI chatbot that
 advertises itself as a ChatGPT alternative that can “write factual trending
 content” using Google search results. 
That said, I don't think this is anything to be too stressed about; the
Grayzone is already a deprecated source and blogs like Marginal Revolution
are treated with caution, though Cowen has sufficient credentials to be
treated as a reliable expert.

On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen &lt;kimmo.virtanen(a)wikimedia.fi&gt;
wrote:

...
  Hi,

 The development of open-source large language models is going forward. The
 GPT-4 was released and it seems that it passed the Bar exam and tried to
 hire humans to solve catchpas which were too complex. However, the
 development in the open source and hacking side has been pretty fast and it
 seems that there are all the pieces for running LLM models in personal
 hardware (and in web browsers). Biggest missing piece is fine tuning of
 open source models such as Neox for the English language. For multilingual
 and multimodal (for example images+text) the model is also needed.

 So this is kind of a link dump for relevant things for creation of open
 source LLM model and service and also recap where the hacker community is
 now.

 1.) Creation of an initial unaligned model.

    - Possible models
       - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI
       (Apache 2.0)
       - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
       Facebook (MIT-licence)
       - LLaMa
       <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
       Facebook (custom license, leaked research use only)
       - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
       license <https://huggingface.co/spaces/bigscience/license>. open,
       non-commercial)

 2.) Fine-tuning or align

    - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
       - Alpaca: A Strong, Replicable Instruction-Following Model
       <https://crfm.stanford.edu/2023/03/13/alpaca.html>
       - Train and run Stanford Alpaca on your own machine
       <https://replicate.com/blog/replicate-alpaca>
       - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
       <https://github.com/tloen/alpaca-lora>

 3.) 8,4,3 bit-quantization of model for reduced hardware requirements

    - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
    <https://til.simonwillison.net/llms/llama-7b-m2>
    - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
    llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
    - Int-4 LLaMa is not enough - Int-3 and beyond
    <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
    - How is LLaMa.cpp possible?
    <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>

 4.) Easy-to-use interfaces

    - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
    libraries to run LLM models in the browser)
    - Dalai <https://github.com/cocktailpeanut/dalai>  ( run LLaMA and
    Alpaca in own computer as Node.js web service)
    - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> (stable
    diffusion image generation in browser)

 Br,
 -- Kimmo Virtanen

 On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen &lt;kimmo.virtanen(a)gmail.com&gt;
 wrote:

  Hi,

 The development of open-source large language models is going forward.
 The GPT-4 was released and it seems that it passed the Bar exam and tried
 to hire humans to solve catchpas which were too complex to it. However, the
 development in open source and hacking side has been pretty fast and it
 seems that there is all the pieces for running LLM models in personal
 hardware (and in web browser). Biggest missing piece is fine tuning of
 open source model such as Neox for english language. For multilingual and
 multimodal (for example images+text) the model is also needed.

 So this is kind of link dump for relevant things for creation of open
 source LLM model and service and also recap where hacker community is now.

 1.) Creation of an initial unaligned model.

    - Possible models
       - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
       EleutherAI (Apache 2.0)
       - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
       Facebook (MIT-licence)
       - LLaMa
       <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
       Facebook (custom license, leaked research use only)
       - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
       license <https://huggingface.co/spaces/bigscience/license>. open,
       non-commercial)

 2.) Fine-tuning or align

    - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
       - Alpaca: A Strong, Replicable Instruction-Following Model
       <https://crfm.stanford.edu/2023/03/13/alpaca.html>
       - Train and run Stanford Alpaca on your own machine
       <https://replicate.com/blog/replicate-alpaca>
       - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
       <https://github.com/tloen/alpaca-lora>

 3.) 8,4,3 bit-quantization of model for reduced hardware requirements

    - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
    <https://til.simonwillison.net/llms/llama-7b-m2>
    - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
    llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
    - Int-4 LLaMa is not enough - Int-3 and beyond
    <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
    - How is LLaMa.cpp possible?
    <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>

 4.) Easy-to-use interfaces

    - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
    libraries to run LLM models in the browser)
    - Dalai <https://github.com/cocktailpeanut/dalai>  ( run LLaMA and
    Alpaca in own computer as Node.js web service)
    - web-stable-diffusion
    <https://github.com/mlc-ai/web-stable-diffusion> (stable diffusion
    image generation in browser)

 Br,
 -- Kimmo Virtanen

 On Mon, Mar 6, 2023 at 6:50 AM Steven Walling &lt;steven.walling(a)gmail.com&gt;
 wrote:

 On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) &lt;luis(a)lu.is&gt; wrote:

  On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <
 ragesoss+wikipedia(a)gmail.com&gt;gt;, wrote:

 Luis,

 OpenAI researchers have released some info about data sources that
 trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

 See section 2.2, starting on page 8 of the PDF.

 The full text of English Wikipedia is one of five sources, the others
 being CommonCrawl, a smaller subset of scraped websites based on
 upvoted reddit links, and two unrevealed datasets of scanned books.
 (I've read speculation that one of these datasets is basically the
 Library Genesis archive.) Wikipedia is much smaller than the other
 datasets, although they did weight it somewhat more heavily than any
 other dataset. With the extra weighting, they say Wikipedia accounts
 for 3% of the total training.

 Thanks, Sage. Facebook’s recently-released LLaMa also shares some of
 their training sources, it turns out, with similar weighting for Wikipedia
 - only 4.5% of training text, but more heavily weighted than most other
 sources:

 https://twitter.com/GuillaumeLample/status/1629151234597740550

 Those stats are undercounting, since the top source (CommonCrawl) also
 itself includes Wikipedia as its third largest source.

 https://commoncrawl.github.io/cc-crawl-statistics/plots/domains

 <https://twitter.com/GuillaumeLample/status/1629151234597740550>
  _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org,
 guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 and https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org 
 _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org 
 _______________________________________________  Wikimedia-l mailing list --
wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: Bing-ChatGPT