On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
...With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet.
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort. See https://arxiv.org/pdf/2205.10770.pdf -- in particular the first paragraph of the Background and Related Work section on page 2, where document extraction is considered an "attack" against such systems, which to me implies that the researchers fully realize they are involved with copyright issues on an enormous scale. Please see also https://bair.berkeley.edu/blog/2020/12/20/lmmem/
On Sat, Mar 18, 2023 at 9:17 PM Steven Walling steven.walling@gmail.com wrote:
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC....
The Foundation needs to get on top of this, by making a public request to all of the LLM providers which use Wikipedia as training data, asking that they acknowledge attribution of any output which may have depended on CC-BY-SA content, licence model productions as CC-BY-SA, and most importantly, disclaim any notion of accuracy or fidelity to the training data. This needs to be done soon. So many people are preparing to turn the reins of their editorial control over to these new LLMs which they don't understand, and the problems at CNET[https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151], let alone Tyler Cowen's blog, have already felt the pain but sadly decided to hastily try to cover it up. The overarching risk here is akin to "citogenesis" but much more pernicious.
On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
Current (ie. ChatGPT) systems doesn't work that way, as the source of information is lost in the process when the information is encoded into the model....
In fact, they do work that way, but it takes some effort to elucidate the source of any given output. Anyone discussing these issues needs to become familiar with ROME: https://twitter.com/mengk20/status/1588581237345595394 Please see also https://www.youtube.com/watch?v=_NMQyOu2HTo
With luck we will all have the chance to discuss these issues in detail on the March 23 Zoom discussion of large language models for Wikimedia projects: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...
--LW