On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
They have, and LLMs absolutely do encode a verbatim copy of their training data, which can be produced intact with little effort.
https://arxiv.org/pdf/2205.10770.pdf https://bair.berkeley.edu/blog/2020/12/20/lmmem/
My understanding so far is that encoding a verbatim copy is typically due to 'Overfitting'.
This is considered a type of bug. It is undesirable for many reasons (technical, ethical, legal).
Models are (supposed to be) trained to prevent this as much as possible.
Clearly there was still work to be done in dec 2020 at the least.
sincerely, Kim Bruning