[Wikimedia-l] Re: Bing-ChatGPT

17 Mar 2023

Hello,

I would like to indicate "Copilot" in the Edge browser as being potentially
relevant to Wikipedia [1][2].

It is foreseeable that end-users will be able to open sidebars in their Web browsers and
subsequently chat with large language models about the contents of specific Web documents,
e.g., encyclopedia articles. Using Web browsers, there can be task contexts available,
including the documents or articles in users' current tabs, potentially including
users' scroll positions, potentially including users' selections or highlightings
of content.

I, for one, am thinking about how Web standards, e.g., Web schema, can be of use for
amplifying these features and capabilities for end-users.

Best regards,
Adam Sobieski

[1]
https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-…
[2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html

________________________________
From: Kimmo Virtanen &lt;kimmo.virtanen(a)wikimedia.fi&gt;
Sent: Friday, March 17, 2023 8:17 AM
To: Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
Subject: [Wikimedia-l] Re: Bing-ChatGPT

Hi,

The development of open-source large language models is going forward. The GPT-4 was
released and it seems that it passed the Bar exam and tried to hire humans to solve
catchpas which were too complex. However, the development in the open source and hacking
side has been pretty fast and it seems that there are all the pieces for running LLM
models in personal hardware (and in web browsers). Biggest missing piece is fine tuning of
open source models such as Neox for the English language. For multilingual and multimodal
(for example images+text) the model is also needed.

So this is kind of a link dump for relevant things for creation of open source LLM model
and service and also recap where the hacker community is now.

1.) Creation of an initial unaligned model.

  *   Possible models
     *   20b Neo(X)<https://github.com/EleutherAI/gpt-neox> by EleutherAI (Apache
2.0)
     *   Fairseq Dense<https://huggingface.co/KoboldAI/fairseq-dense-13B> by
Facebook (MIT-licence)
     *   LLaMa<https://ai.facebook.com/blog/large-language-model-llama-meta-ai/&g… by
Facebook (custom license, leaked research use only)
     *   Bloom<https://huggingface.co/bigscience/bloom> by Bigscience (custom
license<https://huggingface.co/spaces/bigscience/license>. open, non-commercial)

2.) Fine-tuning or align

  *   Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
     *   Alpaca: A Strong, Replicable Instruction-Following
Model<https://crfm.stanford.edu/2023/03/13/alpaca.html>
     *   Train and run Stanford Alpaca on your own
machine<https://replicate.com/blog/replicate-alpaca>
     *   Github: Alpaca-LoRA: Low-Rank LLaMA
Instruct-Tuning<https://github.com/tloen/alpaca-lora>

3.) 8,4,3 bit-quantization of model for reduced hardware requirements

  *   Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with
llama.cpp<https://til.simonwillison.net/llms/llama-7b-m2>
  *   Github: bloomz.cpp<https://github.com/NouamaneTazi/bloomz.cpp> &
llama.cpp<https://github.com/ggerganov/llama.cpp> (C++ only versions)
  *   Int-4 LLaMa is not enough - Int-3 and
beyond<https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-…
  *   How is LLaMa.cpp
possible?<https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible…

4.) Easy-to-use interfaces

  *   Transformer.js<https://xenova.github.io/transformers.js/> (WebAssembly
libraries to run LLM models in the browser)
  *   Dalai<https://github.com/cocktailpeanut/dalai>  ( run LLaMA and Alpaca in own
computer as Node.js web service)
  *   web-stable-diffusion<https://github.com/mlc-ai/web-stable-diffusion> (stable
diffusion image generation in browser)

Br,
-- Kimmo Virtanen

On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen
<kimmo.virtanen@gmail.com<mailto:kimmo.virtanen@gmail.com>> wrote:
Hi,

The development of open-source large language models is going forward. The GPT-4 was
released and it seems that it passed the Bar exam and tried to hire humans to solve
catchpas which were too complex to it. However, the development in open source and hacking
side has been pretty fast and it seems that there is all the pieces for running LLM models
in personal hardware (and in web browser). Biggest missing piece is fine tuning of open
source model such as Neox for english language. For multilingual and multimodal (for
example images+text) the model is also needed.

So this is kind of link dump for relevant things for creation of open source LLM model and
service and also recap where hacker community is now.

1.) Creation of an initial unaligned model.

  *   Possible models
     *   20b Neo(X)<https://github.com/EleutherAI/gpt-neox> by EleutherAI (Apache
2.0)
     *   Fairseq Dense<https://huggingface.co/KoboldAI/fairseq-dense-13B> by
Facebook (MIT-licence)
     *   LLaMa<https://ai.facebook.com/blog/large-language-model-llama-meta-ai/&g… by
Facebook (custom license, leaked research use only)
     *   Bloom<https://huggingface.co/bigscience/bloom> by Bigscience (custom
license<https://huggingface.co/spaces/bigscience/license>. open, non-commercial)

2.) Fine-tuning or align

  *   Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
     *   Alpaca: A Strong, Replicable Instruction-Following
Model<https://crfm.stanford.edu/2023/03/13/alpaca.html>
     *   Train and run Stanford Alpaca on your own
machine<https://replicate.com/blog/replicate-alpaca>
     *   Github: Alpaca-LoRA: Low-Rank LLaMA
Instruct-Tuning<https://github.com/tloen/alpaca-lora>

3.) 8,4,3 bit-quantization of model for reduced hardware requirements

  *   Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with
llama.cpp<https://til.simonwillison.net/llms/llama-7b-m2>
  *   Github: bloomz.cpp<https://github.com/NouamaneTazi/bloomz.cpp> &
llama.cpp<https://github.com/ggerganov/llama.cpp> (C++ only versions)
  *   Int-4 LLaMa is not enough - Int-3 and
beyond<https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-…
  *   How is LLaMa.cpp
possible?<https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible…

4.) Easy-to-use interfaces

  *   Transformer.js<https://xenova.github.io/transformers.js/> (WebAssembly
libraries to run LLM models in the browser)
  *   Dalai<https://github.com/cocktailpeanut/dalai>  ( run LLaMA and Alpaca in own
computer as Node.js web service)
  *   web-stable-diffusion<https://github.com/mlc-ai/web-stable-diffusion> (stable
diffusion image generation in browser)

Br,
-- Kimmo Virtanen

On Mon, Mar 6, 2023 at 6:50 AM Steven Walling
<steven.walling@gmail.com<mailto:steven.walling@gmail.com>> wrote:

On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is<http://lu.is>)
<luis@lu.is<mailto:luis@lu.is>> wrote:
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross
<ragesoss+wikipedia@gmail.com<mailto:ragesoss%2Bwikipedia@gmail.com>>,
wrote:
Luis,

OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

See section 2.2, starting on page 8 of the PDF.

The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.

Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their training
sources, it turns out, with similar weighting for Wikipedia - only 4.5% of training text,
but more heavily weighted than most other sources:

https://twitter.com/GuillaumeLample/status/1629151234597740550

Those stats are undercounting, since the top source (CommonCrawl) also itself includes
Wikipedia as its third largest source.

https://commoncrawl.github.io/cc-crawl-statistics/plots/domains

<https://twitter.com/GuillaumeLample/status/1629151234597740550>
_______________________________________________
Wikimedia-l mailing list --
wikimedia-l@lists.wikimedia.org<mailto:wikimedia-l@lists.wikimedia.org>, guidelines
at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
To unsubscribe send an email to
wikimedia-l-leave@lists.wikimedia.org<mailto:wikimedia-l-leave@lists.wikimedia.org>
_______________________________________________
Wikimedia-l mailing list --
wikimedia-l@lists.wikimedia.org<mailto:wikimedia-l@lists.wikimedia.org>, guidelines
at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
To unsubscribe send an email to
wikimedia-l-leave@lists.wikimedia.org<mailto:wikimedia-l-leave@lists.wikimedia.org>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: Bing-ChatGPT