On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller eloquence@gmail.com wrote:
...With image-generating models like Stable Diffusion, it's been found that the models sometimes generate output nearly indistinguishable from source material [1]. I don't know if similar studies have been undertaken for text-generating models yet.
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort. See https://arxiv.org/pdf/2205.10770.pdf -- in particular the first paragraph of the Background and Related Work section on page 2, where document extraction is considered an "attack" against such systems, which to me implies that the researchers fully realize they are involved with copyright issues on an enormous scale. Please see also https://bair.berkeley.edu/blog/2020/12/20/lmmem/
On Sat, Mar 18, 2023 at 9:17 PM Steven Walling steven.walling@gmail.com wrote:
The whole thing is definitely a hot mess. If the remixing/transformation by the model is a derivative work, it means OpenAI is potentially violating the ShareAlike requirement by not distributing the text output as CC....
The Foundation needs to get on top of this, by making a public request to all of the LLM providers which use Wikipedia as training data, asking that they acknowledge attribution of any output which may have depended on CC-BY-SA content, licence model productions as CC-BY-SA, and most importantly, disclaim any notion of accuracy or fidelity to the training data. This needs to be done soon. So many people are preparing to turn the reins of their editorial control over to these new LLMs which they don't understand, and the problems at CNET[https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151], let alone Tyler Cowen's blog, have already felt the pain but sadly decided to hastily try to cover it up. The overarching risk here is akin to "citogenesis" but much more pernicious.
On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:
Or, maybe just require an open disclosure of where the bot pulled from and how much, instead of having it be a black box? "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%...".
Current (ie. ChatGPT) systems doesn't work that way, as the source of information is lost in the process when the information is encoded into the model....
In fact, they do work that way, but it takes some effort to elucidate the source of any given output. Anyone discussing these issues needs to become familiar with ROME: https://twitter.com/mengk20/status/1588581237345595394 Please see also https://www.youtube.com/watch?v=_NMQyOu2HTo
With luck we will all have the chance to discuss these issues in detail on the March 23 Zoom discussion of large language models for Wikimedia projects: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...
--LW
On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort.
https://arxiv.org/pdf/2205.10770.pdf https://bair.berkeley.edu/blog/2020/12/20/lmmem/
My understanding so far is that encoding a verbatim copy is typically due to 'Overfitting'.
This is considered a type of bug. It is undesirable for many reasons (technical, ethical, legal).
Models are (supposed to be) trained to prevent this as much as possible.
Clearly there was still work to be done in dec 2020 at the least.
sincerely, Kim Bruning
In addition to all content being available verbatim versus all content being unavailable verbatim, developers might desire for some content to be available verbatim while having other content available only indirectly.
While AI systems can automatically determine which content to usefully store verbatim, if we desire for content authors to be able to provide hints, we could consider new HTML markup elements or some clever uses of existing elements and attributes or schema.org Web schemas.
In these regards, consider the following example, where an HTML document author desires to hint that a topic sentence of a paragraph is desired to be quoted verbatim while the remainder of that paragraph is desired only to be indirectly available. Perhaps the markup could resemble something like the following rough-draft sketch:
<p><span id="anchor123" role="quoteable">This is some text, a topic sentence.</span> This is a secondary sentence in the paragraph.</p>
This sketch shows some overlapping markup approaches. Perhaps all elements with IDs, URL-addressable content, should be considered to be verbatim quotable. Or, perhaps some HTML attribute, e.g., role, could be of use. Again, schema.org Web schemas could also be of use.
Also, I hope that you find interesting the following discussion thread: https://github.com/microsoft/semantic-kernel/discussions/108 about Educational Applications of AI in Web Browsers. There, I ask some questions about modern LLMs and APIs, about referring to documents by URLs in prompts, about prioritizing some documents for utilization over others when answering questions, and so forth. A “Web browser Copilot” would have educational applications. It could allow students to ask questions pertinent to the specific HTML, PDF, and EPUB documents that they are browsing and, perhaps, AI components could navigate to pages, scroll to content, and highlight document content for end-users while responding.
Best regards, Adam Sobieski
________________________________ From: Kim Bruning via Wikimedia-l wikimedia-l@lists.wikimedia.org Sent: Sunday, March 19, 2023 10:36 PM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Cc: Kim Bruning kim@kimbruning.nl Subject: [Wikimedia-l] Re: Bing-ChatGPT
On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort.
https://arxiv.org/pdf/2205.10770.pdf https://bair.berkeley.edu/blog/2020/12/20/lmmem/
My understanding so far is that encoding a verbatim copy is typically due to 'Overfitting'.
This is considered a type of bug. It is undesirable for many reasons (technical, ethical, legal).
Models are (supposed to be) trained to prevent this as much as possible.
Clearly there was still work to be done in dec 2020 at the least.
sincerely, Kim Bruning _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Mon, Mar 20, 2023 at 9:28 PM Kim Bruning via Wikimedia-l wikimedia-l@lists.wikimedia.org wrote:
On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
.... LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort.
https://arxiv.org/pdf/2205.10770.pdf https://bair.berkeley.edu/blog/2020/12/20/lmmem/
My understanding so far is that encoding a verbatim copy is typically due to 'Overfitting'.
This is considered a type of bug. It is undesirable for many reasons (technical, ethical, legal).
I believe the authors mainly use "overfitting" to describe the condition when the model produces verbatim copies of its training data instead of a reasonably distinct paraphrase or summary when the verbatim source is not specifically elicited. But it's not clear to me that the term isn't used in both ways.
This brings up an important point. ChatGPT seems to almost always avoid the kind of infringing paraphrases described in https://en.wikipedia.org/wiki/Wikipedia:Close_paraphrasing when asked to paraphrase or summarize input text, which makes it very useful for easily avoiding such issues. I get the feeling that Wikipedia editors are already using it for this purpose on a relatively large scale. But I'm hesitant to encourage such use until copyright experts familiar with legal precedents involving "substantial similarity" as described in that essay have had the opportunity to evaluate whether such LLM output is a problem over a wide range of example cases. Ordinary Wikipedia editors have no way to know how likely this is as a problem, how to evaluate specific cases, or how to address such issues when they arise. Professional guidance would be very helpful on this topic.
On Mon, Mar 20, 2023 at 8:01 PM Erik Moeller eloquence@gmail.com wrote:
... I agree that it would be good for WMF to engage with LLM providers on these questions of attribution sooner rather than later, if that is not already underway. WMF is, as I understand it, still not in any privileged position of asserting or enforcing copyright (because it requires no copyright assignment from authors) -- but it can certainly make legal requirements clear, and also develop best practices that go beyond the legal minimum.
Thank you. Another thing the Foundation could do without editors getting involved (a class action suit by editors would probably at best be counterproductive at this point, for a number of reasons, and could backfire) is to highlight and encourage the ongoing but relatively obscure work on attribution and verification by LLMs. There are two projects in particular, SPARROW [ https://arxiv.org/abs/2209.14375 ] and RARR [https://arxiv.org/abs/2210.08726 ] that deserve wider recognition, support, and work on replication by third parties. These research directions are the most robust way to avoid the hallucination problems which are at the root of most everything that can go wrong when LLMs are used to produce Wikipedia content, so it would be extremely helpful if the Foundation uses its clout to shine a light and point out that they do what we expect of Wikipedia editors: provide sources in support of summary text cited in a way that third parties can independently verify.
The Bing LLM already includes some attempt at doing this with a dual process search system, which I believe is modeled after the SPARROW approach, but without the explicit rigor such as in RARR, it can fail spectacularly, and produce the same confidently wrong output everyone has recently become familiar with, but with the confounding problem of appearing to cite sources in support, but which aren't. For example, see this thread: https://twitter.com/dileeplearning/status/1634699315582226434
-LW
Google BARD, announced this week, also tries and fails to perform attribution and verification: https://old.reddit.com/r/Bard/comments/11yeegu/google_bard_claims_bard_has_a...
BARD also produces lengthy passages from its training data verbatim without elicitation: https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_...
.... Another thing the Foundation could do without editors getting involved (a class action suit by editors would probably at best be counterproductive at this point, for a number of reasons, and could backfire) is to highlight and encourage the ongoing but relatively obscure work on attribution and verification by LLMs. There are two projects in particular, SPARROW [ https://arxiv.org/abs/2209.14375 ] and RARR [https://arxiv.org/abs/2210.08726 ] that deserve wider recognition, support, and work on replication by third parties. These research directions are the most robust way to avoid the hallucination problems which are at the root of most everything that can go wrong when LLMs are used to produce Wikipedia content, so it would be extremely helpful if the Foundation uses its clout to shine a light and point out that they do what we expect of Wikipedia editors: provide sources in support of summary text cited in a way that third parties can independently verify.
The Bing LLM already includes some attempt at doing this with a dual process search system, which I believe is modeled after the SPARROW approach, but without the explicit rigor such as in RARR, it can fail spectacularly, and produce the same confidently wrong output everyone has recently become familiar with, but with the confounding problem of appearing to cite sources in support, but which aren't. For example, see this thread: https://twitter.com/dileeplearning/status/1634699315582226434
-LW
On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden laurenworden89@gmail.com wrote:
BARD also produces lengthy passages from its training data verbatim without elicitation: https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_...
Very true. I tested the "Mr. Ripley" example with Bard, and it instantly reproduced the exact language of the Wikipedia article's lede section; only the remainder of the generation was substantially different. [1]
Warmly, Erik
Hi,
I just noticed that OpenAI has fixed the Wikidata property and item mappings so now it can generate working SPARQL. Example:
Prompt : Search finnish female journaists using SPARQL from Wikidata?
*GPT-3.5 (default)*
To search for Finnish female journalists using SPARQL from Wikidata, you can use the following query: SELECT ?journalistLabel WHERE { ?journalist wdt:P106 wd:Q1930187 . ?journalist wdt:P27 wd:Q33 . ?journalist wdt:P21 wd:Q6581072 . ?journalist wdt:P569 ?birthdate . ?journalist wdt:P21 ?gender . FILTER(YEAR(?birthdate) < 2005) . FILTER(?gender = wd:Q6581072) . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } . } This query will retrieve the labels (names) of all female journalists who were born before the year 2005, are Finnish citizens, and have a profession (P106) of journalist (Q1930187). You can execute this query on the Wikidata Query Service (https://query.wikidata.org/) by pasting the query into the query editor and clicking the "Run" button. The results will be displayed in a table format.
*GPT-4*
To search for Finnish female journalists using SPARQL on Wikidata, you can
use the following query:
SELECT ?journalist ?journalistLabel WHERE {
?journalist wdt:P31 wd:Q5; # instance of human wdt:P106 wd:Q1930187; # occupation: journalist wdt:P21 wd:Q6581072; # gender: female wdt:P27 wd:Q33. # country of citizenship: Finland SERVICE wikibase:label { bd:serviceParam wikibase:language "fi,en". } } This query searches for entities that are instances of human (Q5), have an occupation as a journalist (Q1930187), are female (Q6581072), and have Finnish citizenship (Q33). The query results will include the journalist's Wikidata ID and their label (name) in Finnish or English. You can run this query using the Wikidata Query Service (https://query.wikidata.org/).
Br, -- Kimmo Virtanen, Zache
On Thu, Mar 23, 2023 at 8:07 AM Erik Moeller eloquence@gmail.com wrote:
On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden laurenworden89@gmail.com wrote:
BARD also produces lengthy passages from its training data verbatim without elicitation:
https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_...
Very true. I tested the "Mr. Ripley" example with Bard, and it instantly reproduced the exact language of the Wikipedia article's lede section; only the remainder of the generation was substantially different. [1]
Warmly, Erik
[1] https://social.coop/@eloquence/110071078132245483 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Repeating exactly what has been in the training data is not overfitting. Overfitting is when the model fails to recognize the underlying pattern in the training data leading to inaccurate or false results when used on new data. Getting the exact same prediction from the training data set is actually how the models train (to minimize the difference). Supervised ML in its core is a regression function,
Andrew Ng examples a model that predicts the price of a house based on number of rooms and such. Overfitting is when a completely new house results in a wildly inaccurate price range. Coming with the exact price as an existing house in the training set is not overfitting nor a bug.
Speaking as someone who built many ML models for Wikimedia.
Am Do., 23. März 2023 um 11:23 Uhr schrieb Kimmo Virtanen < kimmo.virtanen@wikimedia.fi>:
Hi,
I just noticed that OpenAI has fixed the Wikidata property and item mappings so now it can generate working SPARQL. Example:
Prompt : Search finnish female journaists using SPARQL from Wikidata?
*GPT-3.5 (default)*
To search for Finnish female journalists using SPARQL from Wikidata, you can use the following query: SELECT ?journalistLabel WHERE { ?journalist wdt:P106 wd:Q1930187 . ?journalist wdt:P27 wd:Q33 . ?journalist wdt:P21 wd:Q6581072 . ?journalist wdt:P569 ?birthdate . ?journalist wdt:P21 ?gender . FILTER(YEAR(?birthdate) < 2005) . FILTER(?gender = wd:Q6581072) . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } . } This query will retrieve the labels (names) of all female journalists who were born before the year 2005, are Finnish citizens, and have a profession (P106) of journalist (Q1930187). You can execute this query on the Wikidata Query Service (https://query.wikidata.org/) by pasting the query into the query editor and clicking the "Run" button. The results will be displayed in a table format.
*GPT-4*
To search for Finnish female journalists using SPARQL on Wikidata, you can
use the following query:
SELECT ?journalist ?journalistLabel WHERE {
?journalist wdt:P31 wd:Q5; # instance of human wdt:P106 wd:Q1930187; # occupation: journalist wdt:P21 wd:Q6581072; # gender: female wdt:P27 wd:Q33. # country of citizenship: Finland SERVICE wikibase:label { bd:serviceParam wikibase:language "fi,en". } } This query searches for entities that are instances of human (Q5), have an occupation as a journalist (Q1930187), are female (Q6581072), and have Finnish citizenship (Q33). The query results will include the journalist's Wikidata ID and their label (name) in Finnish or English. You can run this query using the Wikidata Query Service (https://query.wikidata.org/).
Br, -- Kimmo Virtanen, Zache
On Thu, Mar 23, 2023 at 8:07 AM Erik Moeller eloquence@gmail.com wrote:
On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden laurenworden89@gmail.com wrote:
BARD also produces lengthy passages from its training data verbatim without elicitation:
https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_...
Very true. I tested the "Mr. Ripley" example with Bard, and it instantly reproduced the exact language of the Wikipedia article's lede section; only the remainder of the generation was substantially different. [1]
Warmly, Erik
[1] https://social.coop/@eloquence/110071078132245483 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
On Sun, Mar 19, 2023 at 12:12 PM Lauren Worden laurenworden89@gmail.com wrote:
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort. See https://arxiv.org/pdf/2205.10770.pdf -- in particular the first paragraph of the Background and Related Work section on page 2, where document extraction is considered an "attack" against such systems, which to me implies that the researchers fully realize they are involved with copyright issues on an enormous scale. Please see also https://bair.berkeley.edu/blog/2020/12/20/lmmem/
Thanks for these links, Lauren. I think it could be a very interesting research project (for WMF, affiliates or Wikimedia research community members) to attempt to recall Wikimedia project content such as Wikipedia articles via the GPT-3.5 or GPT-4 API, to begin quantifying the degree to which the models produce exact copies (or legally covered derivative works--as opposed to novel expressions).
With luck we will all have the chance to discuss these issues in detail on the March 23 Zoom discussion of large language models for Wikimedia projects: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...
I won't be able to join but am glad this is happening. I agree that it would be good for WMF to engage with LLM providers on these questions of attribution sooner rather than later, if that is not already underway. WMF is, as I understand it, still not in any privileged position of asserting or enforcing copyright (because it requires no copyright assignment from authors) -- but it can certainly make legal requirements clear, and also develop best practices that go beyond the legal minimum.
Warmly, Erik
wikimedia-l@lists.wikimedia.org