[Wikimedia-l] Re: Bing-ChatGPT

19 Mar 2023


      On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
...
...With image-generating models like Stable Diffusion, it's been found
that the models sometimes generate output nearly indistinguishable
from source material [1]. I don't know if similar studies have been
undertaken for text-generating models yet.
They have, and LLMs absolutely do encode a verbatim copy of their
training data, which can be produced intact with little effort. See
https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
paragraph of the Background and Related Work section on page 2, where
document extraction is considered an "attack" against such systems,
which to me implies that the researchers fully realize they are
involved with copyright issues on an enormous scale. Please see also
https://bair.berkeley.edu/blog/2020/12/20/lmmem/
On Sat, Mar 18, 2023 at 9:17 PM Steven Walling steven.walling@gmail.com wrote:
...
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC....
The Foundation needs to get on top of this, by making a public request
to all of the LLM providers which use Wikipedia as training data,
asking that they acknowledge attribution of any output which may have
depended on CC-BY-SA content, licence model productions as CC-BY-SA,
and most importantly, disclaim any notion of accuracy or fidelity to
the training data. This needs to be done soon. So many people are
preparing to turn the reins of their editorial control over to these
new LLMs which they don't understand, and the problems at
CNET[https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151], let
alone Tyler Cowen's blog, have already felt the pain but sadly decided
to hastily try to cover it up. The overarching risk here is akin to
"citogenesis" but much more pernicious.
On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen
kimmo.virtanen@wikimedia.fi wrote:
...
...
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
Current (ie. ChatGPT) systems doesn't work that way, as the source of information is lost in the process when the information is encoded into the model....
In fact, they do work that way, but it takes some effort to elucidate
the source of any given output. Anyone discussing these issues needs
to become familiar with ROME:
https://twitter.com/mengk20/status/1588581237345595394 Please see also
https://www.youtube.com/watch?v=_NMQyOu2HTo
With luck we will all have the chance to discuss these issues in
detail on the March 23 Zoom discussion of large language models for
Wikimedia projects:
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...
--LW

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Re: Bing-ChatGPT