Lauren:
Erik, I see your point now and agree with you. But doesn't it seem like obtaining a perfect license is at present the enemy of the urgent good of bringing a concerted effort to bear on problems that are clearly detrimental to project integrity?
I don't think the licensing question matters for purposes of evaluation of third party APIs (including providing access to Wikimedia volunteers to participate in such evaluations), but I would personally draw the line when it comes to something like a Wikimedia Cloud Infrastructure installation. Spending a lot of money on compute infrastructure to run a proprietary model strikes me as clearly out of scope for the Wikimedia mission.
Openly licensed models for machine translation like Facebook's M2M (https://huggingface.co/facebook/m2m100_418M) or text generation like Cerebras-GPT-13B (https://huggingface.co/cerebras/Cerebras-GPT-13B) and GPT-NeoX-20B (https://huggingface.co/EleutherAI/gpt-neox-20b) seem like better targets for running on Wikimedia infrastructure, if there's any merit to be found in running them at this stage.
Note that Facebook's proprietary but widely circulated LLaMA model has triggered a lot of work on dramatically improving performance of LLMs through more efficient implementations, to the point that you can run a decent quality LLM (and combine it with OpenAI's freely licensed voice detection model) on a consumer grade laptop:
https://github.com/ggerganov/llama.cpp
While I'm not sure if the "hallucination" problem is tractable when all you have is an LLM, I am confident (based on, e.g., the recent results with Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html) that the performance of smaller models will continue to increase as we find better ways to train, steer, align, modularize and extend them.
Chris:
there is probably an implicit licence granted by whoever publishes the work for whoever views it to use it.
Here's a link to the Stable Diffusion (image generation) model weights from their official repository. Note the lack of any licensing statement or clickthrough agreement when directly downloading the weights.
https://huggingface.co/stabilityai/stable-diffusion-2-base/resolve/main/512-...
Are you infringing Stability AI's copyright by clicking this link? If not, are you infringing Stability AI's copyright by then writing a Python script that uses this file to generate images, if you only run it locally on your GPU?
Even if a court answers either question with "yes", it still does not follow that you are bound by any other licensing terms Stability AI is attaching to those files, a license which you never agreed to when clicking the link.
But this discussion highlights the fundamental difference between free licenses like CC-BY-SA/GPL and nonfree "ethical use" licenses like OpenRail-M. If you want to enforce your ethical use restrictions without a clickthrough agreement, you have no choice but to adopt an expansive definition of copyright infringement. This is somewhat ironic, given that the models themselves are trained on vast amounts of copyrighted data without permission.
Warmly, Erik