New subject: Bing-ChatGPT

17 Mar 2023

      Hi,
The development of open-source large language models is going forward. The
GPT-4 was released and it seems that it passed the Bar exam and tried to
hire humans to solve catchpas which were too complex. However, the
development in the open source and hacking side has been pretty fast and it
seems that there are all the pieces for running LLM models in personal
hardware (and in web browsers). Biggest missing piece is fine tuning of
open source models such as Neox for the English language. For multilingual
and multimodal (for example images+text) the model is also needed.
So this is kind of a link dump for relevant things for creation of open
source LLM model and service and also recap where the hacker community is
now.
1.) Creation of an initial unaligned model.
- Possible models
      - 20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI
      (Apache 2.0)
      - Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by
      Facebook (MIT-licence)
      - LLaMa
      https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by
      Facebook (custom license, leaked research use only)
      - Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom
      license https://huggingface.co/spaces/bigscience/license. open,
      non-commercial)
2.) Fine-tuning or align
- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
      - Alpaca: A Strong, Replicable Instruction-Following Model
      https://crfm.stanford.edu/2023/03/13/alpaca.html
      - Train and run Stanford Alpaca on your own machine
      https://replicate.com/blog/replicate-alpaca
      - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
      https://github.com/tloen/alpaca-lora
3.) 8,4,3 bit-quantization of model for reduced hardware requirements
- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
   https://til.simonwillison.net/llms/llama-7b-m2
   - Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &
   llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)
   - Int-4 LLaMa is not enough - Int-3 and beyond
   https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
   - How is LLaMa.cpp possible?
   https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces
- Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly
   libraries to run LLM models in the browser)
   - Dalai https://github.com/cocktailpeanut/dalai  ( run LLaMA and
   Alpaca in own computer as Node.js web service)
   - web-stable-diffusion
https://github.com/mlc-ai/web-stable-diffusion (stable
   diffusion image generation in browser)
Br,
-- Kimmo Virtanen
On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen kimmo.virtanen@gmail.com
wrote:
...
Hi,
The development of open-source large language models is going forward. The
GPT-4 was released and it seems that it passed the Bar exam and tried to
hire humans to solve catchpas which were too complex to it. However, the
development in open source and hacking side has been pretty fast and it
seems that there is all the pieces for running LLM models in personal
hardware (and in web browser). Biggest missing piece is fine tuning of
open source model such as Neox for english language. For multilingual and
multimodal (for example images+text) the model is also needed.
So this is kind of link dump for relevant things for creation of open
source LLM model and service and also recap where hacker community is now.
1.) Creation of an initial unaligned model.

Possible models
20b Neo(X) https://github.com/EleutherAI/gpt-neox by EleutherAI

(Apache 2.0)
Fairseq Dense https://huggingface.co/KoboldAI/fairseq-dense-13B by

Facebook (MIT-licence)
LLaMa

https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ by
 Facebook (custom license, leaked research use only)
Bloom https://huggingface.co/bigscience/bloom by Bigscience (custom

license https://huggingface.co/spaces/bigscience/license. open,
 non-commercial)

2.) Fine-tuning or align

Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
Alpaca: A Strong, Replicable Instruction-Following Model

https://crfm.stanford.edu/2023/03/13/alpaca.html
Train and run Stanford Alpaca on your own machine

https://replicate.com/blog/replicate-alpaca
Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning

https://github.com/tloen/alpaca-lora

3.) 8,4,3 bit-quantization of model for reduced hardware requirements

Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp

https://til.simonwillison.net/llms/llama-7b-m2

Github: bloomz.cpp https://github.com/NouamaneTazi/bloomz.cpp &

llama.cpp https://github.com/ggerganov/llama.cpp (C++ only versions)

Int-4 LLaMa is not enough - Int-3 and beyond

https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and

How is LLaMa.cpp possible?

https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible
4.) Easy-to-use interfaces

Transformer.js https://xenova.github.io/transformers.js/ (WebAssembly

libraries to run LLM models in the browser)

Dalai https://github.com/cocktailpeanut/dalai  ( run LLaMA and

Alpaca in own computer as Node.js web service)

web-stable-diffusion https://github.com/mlc-ai/web-stable-diffusion (stable

diffusion image generation in browser)
Br,
-- Kimmo Virtanen
On Mon, Mar 6, 2023 at 6:50 AM Steven Walling steven.walling@gmail.com
wrote:
...
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) luis@lu.is wrote:
...
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <
ragesoss+wikipedia@gmail.com>, wrote:
Luis,
OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
See section 2.2, starting on page 8 of the PDF.
The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.
Thanks, Sage. Facebook’s recently-released LLaMa also shares some of
their training sources, it turns out, with similar weighting for Wikipedia

only 4.5% of training text, but more heavily weighted than most other

sources:
https://twitter.com/GuillaumeLample/status/1629151234597740550
Those stats are undercounting, since the top source (CommonCrawl) also
itself includes Wikipedia as its third largest source.
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://twitter.com/GuillaumeLample/status/1629151234597740550
...

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/...
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Re: Bing-ChatGPT