In addition to all content being available verbatim versus all content being unavailable verbatim, developers might desire for some content to be available verbatim while having other content available only indirectly.

While AI systems can automatically determine which content to usefully store verbatim, if we desire for content authors to be able to provide hints, we could consider new HTML markup elements or some clever uses of existing elements and attributes or schema.org Web schemas.

In these regards, consider the following example, where an HTML document author desires to hint that a topic sentence of a paragraph is desired to be quoted verbatim while the remainder of that paragraph is desired only to be indirectly available. Perhaps the markup could resemble something like the following rough-draft sketch:

<p><span id="anchor123" role="quoteable">This is some text, a topic sentence.</span> This is a secondary sentence in the paragraph.</p>

This sketch shows some overlapping markup approaches. Perhaps all elements with IDs, URL-addressable content, should be considered to be verbatim quotable. Or, perhaps some HTML attribute, e.g., role, could be of use. Again, schema.org Web schemas could also be of use.

Also, I hope that you find interesting the following discussion thread: https://github.com/microsoft/semantic-kernel/discussions/108 about Educational Applications of AI in Web Browsers. There, I ask some questions about modern LLMs and APIs, about referring to documents by URLs in prompts, about prioritizing some documents for utilization over others when answering questions, and so forth. A “Web browser Copilot” would have educational applications. It could allow students to ask questions pertinent to the specific HTML, PDF, and EPUB documents that they are browsing and, perhaps, AI components could navigate to pages, scroll to content, and highlight document content for end-users while responding.

Best regards,

Adam Sobieski

On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
>
> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort.

> https://arxiv.org/pdf/2205.10770.pdf
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

My understanding so far is that encoding a verbatim copy is typically due to 'Overfitting'.

This is considered a type of bug. It is undesirable for many reasons
(technical, ethical, legal).

Models are (supposed to be) trained to prevent this as much as possible.

Clearly there was still work to be done in dec 2020 at the least.

sincerely,
Kim Bruning
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/5PNCR3KVBCEEKYT6I3J6VZKFE7NFIGB2/
To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org