[Wikimedia-l] Attribution of specific Wikipedia articles as sources of a LLM's output (Was: Bing-ChatGPT)

7 Sep 2023

*TL;DR: It was previously claimed on this list that it's generally
technically possible to attribute information in the output of a LLM-based
chatbot (such as ChatGPT) to specific parts of the LLM's training data
(such as a Wikipedia article). These claims are dubious and we shouldn't
rely on them as we continue to navigate the relations between Wikimedia
projects and LLMs.*

On Sun, Mar 19, 2023 at 12:12 PM Lauren Worden &lt;laurenworden89(a)gmail.com&gt;
wrote:
[...]

...

 On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen
 &lt;kimmo.virtanen(a)wikimedia.fi&gt; wrote:

> Or, maybe just require an open disclosure of where the bot pulled from  and how
much, instead of having it be a black box? "Text in this response
 derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
 'SomeOtherThing', 10%...".

 Current (ie. ChatGPT) systems doesn't work that way, as the source of 
information is lost in the process when the information is encoded into the
 model....

 In fact, they do work that way, but it takes some effort to elucidate
 the source of any given output. Anyone discussing these issues needs
 to become familiar with ROME:
 https://twitter.com/mengk20/status/1588581237345595394 Please see also
 https://www.youtube.com/watch?v=_NMQyOu2HTo

 I sense some confusion here. That paper (ROME, http://rome.baulab.info/ ) is about
attributing a model's factual claims to specific parts (weights,
neurons) of its neural network (and then changing them). It is **not**
about attribution to specific parts of its training data (such as Wikipedia
articles or other web pages), which is what Wikimedians have been
expressing concerns about.
In other words, it's entirely unclear why this should contradict what Kimmo
had said (and, separately in this thread, Galder
<https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/VC666JWWVZJ77SZIO7KF46RIL236LF5N/attachment/0/attachment.htm>
).

(Trying to understand LLMs with analogies can be treacherous
<https://threadreaderapp.com/thread/1631491972685869056.html>. But for
people who automatically assume that neural networks "do work that way" -
i.e. preserve this kind of provenance information - and that chatbots can
be required to disclose "where [they] pulled from and how much" for a
particular answer: Imagine someone accosting you in the street and asking
you where you had originally learned that Paris is the capital of
France, say. How many of us would be able to come up with a truthful answer
like "our geography teacher told us in third grade" or "I read this in
Encyclopaedia Britannica when I was 10 years old"?)

With luck we will all have the chance to discuss these issues in
...
  detail on the March 23 Zoom discussion of large
language models for
 Wikimedia projects:

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/…

 The notes from that meeting (now at
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/…
) contain the following statements:

*"In an ideal world, the Foundation would start internal projects to
replicate ROME and RARR."*
*"The Foundation should make a public statement in support of increasing
the accuracy of attribution and verification systems such as RARR [
https://arxiv.org/abs/2210.08726 <https://arxiv.org/abs/2210.08726> ]"*

These proposals do not seem to have made it into the WMF's actual annual
plan in the end. And I realize that this thread is already a couple of
months old. However, it still seems worth resolving misconceptions in this
regard, e.g. because there have been references to such claims more
recently in other community discussion spaces.

Regarding RARR (the second project proposed in that meeting and here on
this list, as something that WMF should replicate or embrace):
RARR is indeed designed to find **a** text document supporting a given
statement produced by an LLM. But importantly, it makes no claims that the
source it finds was "the" original source used by the LLM. The first
"R" in
"RARR" stands for "Retrofit[ting]" attribution - not for
"restoring",
"retrieving" or such. (In fact, RARR doesn't even try to find a source in
the model's training corpus. It simply does a Google search of the entire
internet, see section 3.1 in the paper.) In other words, it too won't
"elucidate **the** source of any given output" as claimed above (my
bolding).
This is particularly relevant in light of the fact that e.g. on English
Wikipedia we generally require information to be attributable to reliable
external sources. So even if a chatbot's statement was indeed based on a
Wikipedia article, but that Wikipedia article cited the New York Times for
this information, RARR might very well pick the NYT article as source
instead of Wikipedia.

A larger issue here is that individual facts are not owned by any company
or community. Specifically, they are not copyrightable (as Wikipedians are
well aware from their daily practice: we can't enforce citing sources -
[[WP:BURDEN]] - as a legal requirement like we do for [[WP:COPYPASTE]]).
This should be kept in mind by folks who advocate for a moral or even legal
obligation for LLMs to "cite their sources'' for their output (like earlier
in this thread: "just require an open disclosure of where the bot pulled
from and how much").

Back to the technical difficulties and claims that machine learning models
"do work that way":
Folks may also be interested in a general overview paper titled "Training
Data Influence Analysis and Estimation: A Survey" (
https://arxiv.org/abs/2212.04612 ). It says e.g. that

"it can be very difficult to answer even basic questions about the
relationship between training data and model predictions; for example:
[...] Which instances in the training set caused the model to make a
specific prediction?"

Now all that said, some weeks ago, Anthropic (a startup focused on
responsible use of AI, which is researching interpretability of LLMs)
released a new research paper that actually tries to tackle this very
difficult question in case of LLMs, and do something like the kind of
attribution we are concerned with here:

"Large language models have demonstrated a surprising range of skills and
behaviors. How can we trace their source? In our new paper, we use
influence functions to find training examples that contribute to a given
model output. [...]" (
https://threadreaderapp.com/thread/1688946685937090560.html )

It looks like really interesting cutting-edge research. (They used some
advanced approximation techniques to make the required calculations
feasible in case of some LLMs that are however still much smaller than e.g.
GPT 3.5 or whatever the version of ChatGPT you use is based on.) If someone
with access to the required huge compute resources and technical skills
would apply the methods described in the paper (
https://arxiv.org/abs/2308.03296 ) to specifically investigate the case of
Wikipedia, that could be fascinating. (There's also an upcoming conference
soliciting such research: https://attrib-workshop.cc/ .)

But before anyone gets too excited: This is a statistical approach focused
on generating estimates of influence ratios only. And from the concrete
examples Anthropic shares, it seems that the relation between source and
output is typically much more diffuse and tenuous than simplistic "AI
steals from Wikipedia!!1!" type arguments would let you believe. (That's
even true for their "simple factual queries" category - see figure 42 in
the paper, for example: "Prompt: Inflation is often measured using /
Completion: the Consumer Price Index." Table 9 in the appendix describes
the sequences from the training data that were found to be most influential
for the answer in one of the examined LLMs. It observes that most of these
source texts don't actually contain the term "consumer price index",
contrary to what one might expect.)
The Anthropic authors also state generally that:

"Model outputs do not seem to result from pure memorization [...] the
influence of any particular training sequence is much smaller than the
information content of a typical sentence, so the model does not appear to
be reciting individual training examples at the token level." (
https://www.anthropic.com/index/influence-functions )

Regards, Tilman

PS 1: Of course there are still ways to make an LLM-based chatbot actually
cite sources, if one is prepared to restrict the kind of answers it can
give. Bing Chat
<https://en.wikipedia.org/wiki/Microsoft_Bing#Bing_Chat> actually
does this by default (or at least tries to), as opposed to ChatGPT,
retrieving live sources at question time. Specifically regarding Wikipedia,
one can prompt ChatGPT or other LLMs to only answer based on Wikipedia
content, and hope that it complies without hallucinating. (I summarized
three such approaches, including the Wikimedia Foundation's ChatGPT plugin,
here: https://meta.wikimedia.org/wiki/Research:Newsletter/2023/July .)
"Retrieval-augmented generation" (RAG) is a good search term for learning
more about similar approaches.

PS 2: All this is separate from questions about the overall influence of
Wikipedia (or other parts of an LLM's training data) on the general
performance of e.g. ChatGPT, with regard to its average factual accuracy,
biases etc. The answers there are also much less clear than some appear to
assume, but that's a topic for another post.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Wikimedia-l] Attribution of specific Wikipedia articles as sources of a LLM's output (Was: Bing-ChatGPT)