[Wikimedia-l] Re: 23 March: Invitation to Open Community Call on ChatGPT, generative AI, and Wikimedia

1 Apr 2023


      Lauren:
...
Erik, I see your point now and agree with you. But doesn't it seem
like obtaining a perfect license is at present the enemy of the urgent
good of bringing a concerted effort to bear on problems that are
clearly detrimental to project integrity?
I don't think the licensing question matters for purposes of
evaluation of third party APIs (including providing access to
Wikimedia volunteers to participate in such evaluations), but I would
personally draw the line when it comes to something like a Wikimedia
Cloud Infrastructure installation. Spending a lot of money on compute
infrastructure to run a proprietary model strikes me as clearly out of
scope for the Wikimedia mission.
Openly licensed models for machine translation like Facebook's M2M
(https://huggingface.co/facebook/m2m100_418M) or text generation like
Cerebras-GPT-13B (https://huggingface.co/cerebras/Cerebras-GPT-13B)
and GPT-NeoX-20B (https://huggingface.co/EleutherAI/gpt-neox-20b) seem
like better targets for running on Wikimedia infrastructure, if
there's any merit to be found in running them at this stage.
Note that Facebook's proprietary but widely circulated LLaMA model has
triggered a lot of work on dramatically improving performance of LLMs
through more efficient implementations, to the point that you can run
a decent quality LLM (and combine it with OpenAI's freely licensed
voice detection model) on a consumer grade laptop:
https://github.com/ggerganov/llama.cpp
While I'm not sure if the "hallucination" problem is tractable when
all you have is an LLM, I am confident (based on, e.g., the recent
results with Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html)
that the performance of smaller models will continue to increase as we
find better ways to train, steer, align, modularize and extend them.
Chris:
...
there is probably an implicit licence granted by whoever publishes
the work for whoever views it to use it.
Here's a link to the Stable Diffusion (image generation) model weights
from their official repository. Note the lack of any licensing
statement or clickthrough agreement when directly downloading the
weights.
https://huggingface.co/stabilityai/stable-diffusion-2-base/resolve/main/512-...
Are you infringing Stability AI's copyright by clicking this link? If
not, are you infringing Stability AI's copyright by then writing a
Python script that uses this file to generate images, if you only run
it locally on your GPU?
Even if a court answers either question with "yes", it still does not
follow that you are bound by any other licensing terms Stability AI is
attaching to those files, a license which you never agreed to when
clicking the link.
But this discussion highlights the fundamental difference between free
licenses like CC-BY-SA/GPL and nonfree "ethical use" licenses like
OpenRail-M. If you want to enforce your ethical use restrictions
without a clickthrough agreement, you have no choice but to adopt an
expansive definition of copyright infringement. This is somewhat
ironic, given that the models themselves are trained on vast amounts
of copyrighted data without permission.
Warmly,
Erik

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: 23 March: Invitation to Open Community Call on ChatGPT, generative AI, and Wikimedia