On Sun, Mar 19, 2023 at 12:12 PM Lauren Worden laurenworden89@gmail.com wrote:
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort. See https://arxiv.org/pdf/2205.10770.pdf -- in particular the first paragraph of the Background and Related Work section on page 2, where document extraction is considered an "attack" against such systems, which to me implies that the researchers fully realize they are involved with copyright issues on an enormous scale. Please see also https://bair.berkeley.edu/blog/2020/12/20/lmmem/
Thanks for these links, Lauren. I think it could be a very interesting research project (for WMF, affiliates or Wikimedia research community members) to attempt to recall Wikimedia project content such as Wikipedia articles via the GPT-3.5 or GPT-4 API, to begin quantifying the degree to which the models produce exact copies (or legally covered derivative works--as opposed to novel expressions).
With luck we will all have the chance to discuss these issues in detail on the March 23 Zoom discussion of large language models for Wikimedia projects: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...
I won't be able to join but am glad this is happening. I agree that it would be good for WMF to engage with LLM providers on these questions of attribution sooner rather than later, if that is not already underway. WMF is, as I understand it, still not in any privileged position of asserting or enforcing copyright (because it requires no copyright assignment from authors) -- but it can certainly make legal requirements clear, and also develop best practices that go beyond the legal minimum.
Warmly, Erik