23 March: Invitation to Open Community Call on ChatGPT, generative AI, and Wikimedia

List overview All Threads
Download

newer

older

Some Updates & Foundation’s...

Join the #WikiForHumanRights...

Yael Weissburg

9 Mar 2023 9 Mar '23

3:32 p.m.

Hi Everyone,

Last year, as part of our annual planning process, the Wikimedia Foundation shared a list of external trends https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Chief_Executive_Officer/Maryana%E2%80%99s_Listening_Tour/External_Trends that we believed were likely to significantly impact the context in which the Wikimedia movement operates. Our focus at the time was on the changing nature of search, the astronomical rise in the global demand for content, and rich media content in particular, and the concerning rise of misinformation and disinformation. We heard from many in our movement about additional trends that our movement faces that we didn’t include in that list, but that are critical to how we as a movement operate, including the de-prioritization of investigative journalism, and the damage to GLAM institutions wrought by the global pandemic.

As part of this year’s annual planning process, we set out to update that list. In particular, we’ve been tracking recent advancements in artificial intelligence (AI). In our recent Diff post on the topic, [1] we noted some risks as well as some potential opportunities for our movement as this technology continues to evolve. Since there has been a great deal of interest in and discussion about AI products like ChatGPT and what it means for Wikimedia over the past few months (including several threads on the topic on this mailing list), we’d love to explore this topic in more depth with you and continue the conversation about its implications for us as a free knowledge movement.

I’d like to invite you all to an open call on 23 March at 18:00 UTC (find your local time here) [2] where we can share reflections on the opportunities, risks, and questions we see raised by new AI tools and products.

The call will be held on Zoom. If you’re interested in joining, email answers@wikimedia.org and we will share the Zoom link with you via email. We will work to coordinate interpretation for languages where there are 3 or more interested community members; please email answers@wikimedia.org with interpretation requests as well.

For those who are unable to join the call, but interested in following and contributing to the conversation, we plan to share notes on our External Trends Meta page [3] afterward so that you can add your thoughts.

Whether in person or on-wiki, I hope you’ll share your ideas so that we can all get a broader understanding of the potential benefits and challenges of this emergent technology. Looking forward to the discussion!

Best,

Yael Weissburg

https://diff.wikimedia.org/2023/02/17/looking-outward-external-trends-in-202... 2.

https://zonestamp.toolforge.org/1679594401 3.

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...

*Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

Attachments:

attachment.htm (text/html — 9.3 KB)

Show replies by date

Elena Lappen

21 Mar 21 Mar

6:23 p.m.

Hi everyone!

Reminder that this conversation is coming up on Thursday! We will have a member of the Foundation's Legal team with us to discuss possible legal implications, many of which have been raised on this list over the past few days. You can still register for the Zoom room by emailing answers@wikimedia.org. Notes will be shared after on the External Trends page on Meta https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends for those who want to participate asynchronously. Hope to see you there!

Best, Elena

Elena Lappen (she/her/hers)

Lead Movement Communications Specialist

Wikimedia Foundation https://wikimediafoundation.org/

On Thu, Mar 9, 2023 at 1:40 PM Yael Weissburg rweissburg@wikimedia.org wrote:

...

Hi Everyone,

Last year, as part of our annual planning process, the Wikimedia Foundation shared a list of external trends https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Chief_Executive_Officer/Maryana%E2%80%99s_Listening_Tour/External_Trends that we believed were likely to significantly impact the context in which the Wikimedia movement operates. Our focus at the time was on the changing nature of search, the astronomical rise in the global demand for content, and rich media content in particular, and the concerning rise of misinformation and disinformation. We heard from many in our movement about additional trends that our movement faces that we didn’t include in that list, but that are critical to how we as a movement operate, including the de-prioritization of investigative journalism, and the damage to GLAM institutions wrought by the global pandemic.

As part of this year’s annual planning process, we set out to update that list. In particular, we’ve been tracking recent advancements in artificial intelligence (AI). In our recent Diff post on the topic, [1] we noted some risks as well as some potential opportunities for our movement as this technology continues to evolve. Since there has been a great deal of interest in and discussion about AI products like ChatGPT and what it means for Wikimedia over the past few months (including several threads on the topic on this mailing list), we’d love to explore this topic in more depth with you and continue the conversation about its implications for us as a free knowledge movement.

I’d like to invite you all to an open call on 23 March at 18:00 UTC (find your local time here) [2] where we can share reflections on the opportunities, risks, and questions we see raised by new AI tools and products.

The call will be held on Zoom. If you’re interested in joining, email answers@wikimedia.org and we will share the Zoom link with you via email. We will work to coordinate interpretation for languages where there are 3 or more interested community members; please email answers@wikimedia.org with interpretation requests as well.

For those who are unable to join the call, but interested in following and contributing to the conversation, we plan to share notes on our External Trends Meta page [3] afterward so that you can add your thoughts.

Whether in person or on-wiki, I hope you’ll share your ideas so that we can all get a broader understanding of the potential benefits and challenges of this emergent technology. Looking forward to the discussion!

Best,

Yael Weissburg

https://diff.wikimedia.org/2023/02/17/looking-outward-external-trends-in-202... 2.
https://zonestamp.toolforge.org/1679594401
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...

*Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Samuel Klein

23 Mar 23 Mar

7:18 p.m.

Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

On Thu, Mar 9, 2023 at 4:40 PM Yael Weissburg rweissburg@wikimedia.org wrote:

...

Hi Everyone,

Last year, as part of our annual planning process, the Wikimedia Foundation shared a list of external trends https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Chief_Executive_Officer/Maryana%E2%80%99s_Listening_Tour/External_Trends that we believed were likely to significantly impact the context in which the Wikimedia movement operates. Our focus at the time was on the changing nature of search, the astronomical rise in the global demand for content, and rich media content in particular, and the concerning rise of misinformation and disinformation. We heard from many in our movement about additional trends that our movement faces that we didn’t include in that list, but that are critical to how we as a movement operate, including the de-prioritization of investigative journalism, and the damage to GLAM institutions wrought by the global pandemic.

As part of this year’s annual planning process, we set out to update that list. In particular, we’ve been tracking recent advancements in artificial intelligence (AI). In our recent Diff post on the topic, [1] we noted some risks as well as some potential opportunities for our movement as this technology continues to evolve. Since there has been a great deal of interest in and discussion about AI products like ChatGPT and what it means for Wikimedia over the past few months (including several threads on the topic on this mailing list), we’d love to explore this topic in more depth with you and continue the conversation about its implications for us as a free knowledge movement.

I’d like to invite you all to an open call on 23 March at 18:00 UTC (find your local time here) [2] where we can share reflections on the opportunities, risks, and questions we see raised by new AI tools and products.

The call will be held on Zoom. If you’re interested in joining, email answers@wikimedia.org and we will share the Zoom link with you via email. We will work to coordinate interpretation for languages where there are 3 or more interested community members; please email answers@wikimedia.org with interpretation requests as well.

For those who are unable to join the call, but interested in following and contributing to the conversation, we plan to share notes on our External Trends Meta page [3] afterward so that you can add your thoughts.

Whether in person or on-wiki, I hope you’ll share your ideas so that we can all get a broader understanding of the potential benefits and challenges of this emergent technology. Looking forward to the discussion!

Best,

Yael Weissburg

https://diff.wikimedia.org/2023/02/17/looking-outward-external-trends-in-202... 2.
https://zonestamp.toolforge.org/1679594401
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D...

*Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Lauren Worden

7:42 p.m.

On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...

Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW

Yael Weissburg

8:27 p.m.

Thank you, all, for such a great conversation! I'd love to make this something we do regularly, and wonder if there would be appetite for rotating hosts? I softly nominate the Basque community to host the next one!

One way or another, we'll find a way to make this more regular and will come back to this thread with updates.

Thank you, all!

Yael *Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

On Thu, Mar 23, 2023 at 12:43 PM Lauren Worden laurenworden89@gmail.com wrote:

...

On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Samuel Klein

9:10 p.m.

The Bau lab (that produced ROME) is great; see their update MEMIT https://memit.baulab.info scaling that approach.

On Thu, Mar 23, 2023 at 3:43 PM Lauren Worden laurenworden89@gmail.com wrote:

...

On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Paulo Santos Perneta

24 Mar 24 Mar

9:01 a.m.

Yes, please, make this a regular event, at least for the time being. These discussions are incredibly useful, given the speed the developments are happening in this area, and the complexity of the challenges we are facing due to them. And thank's a lot for organizing the meeting yesterday!

Paulo

Samuel Klein meta.sj@gmail.com escreveu no dia quinta, 23/03/2023 à(s) 21:11:

...

The Bau lab (that produced ROME) is great; see their update MEMIT https://memit.baulab.info scaling that approach.

On Thu, Mar 23, 2023 at 3:43 PM Lauren Worden laurenworden89@gmail.com wrote:

...
On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Yael Weissburg

27 Mar 27 Mar

8:02 p.m.

Hello again everyone,

Thanks again to those who made it to the call last week - it felt like such a luxury to be able to drop deeply into this subject for an hour (plus) with all of you.

For those who were unable to join, we captured extensive notes on Meta https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends/Community_call_notes. I hope we continue the vibrant discussion we started together on the Talk Page. Maybe someone can use that space to volunteer to host the next call? I know many folks are eager to continue the live discussion too.

I also wanted to share a few links / resources that might be useful (I'll add these to the Talk page as well):

- WMF's Legal team recently did a copyright analysis of ChatGPT. You can find that on Meta https://meta.wikimedia.org/wiki/Wikilegal/Copyright_Analysis_of_ChatGPT . - There is a proposed session on ChatGPT / generative AI for the Wikimedia Hackathon in May. You can find that on Phabricator https://phabricator.wikimedia.org/T333127.

Finally, a huge thank you to @Maryana Pinchuk mpinchuk@wikimedia.org who took the extensive and detailed notes on the call and also did a lot of "wrangling" behind the scenes to help draft the External Trends in the first place and get us to a point where we could have this discussion. Thank you, Maryana!

Feel free to reach out anytime to connect about this or other topics. I'll be in Belgrade for the EduWiki conference in May and Singapore for Wikimania - if you're coming to either of those events or in the area, let me know - I'd love to meet in person!

Best,

Yael

*Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

On Fri, Mar 24, 2023 at 2:02 AM Paulo Santos Perneta < paulosperneta@gmail.com> wrote:

...

Yes, please, make this a regular event, at least for the time being. These discussions are incredibly useful, given the speed the developments are happening in this area, and the complexity of the challenges we are facing due to them. And thank's a lot for organizing the meeting yesterday!

Paulo

Samuel Klein meta.sj@gmail.com escreveu no dia quinta, 23/03/2023 à(s) 21:11:

...
The Bau lab (that produced ROME) is great; see their update MEMIT https://memit.baulab.info scaling that approach.

On Thu, Mar 23, 2023 at 3:43 PM Lauren Worden laurenworden89@gmail.com wrote:

...
On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Lauren Worden

28 Mar 28 Mar

10:07 a.m.

Since proposals which don't fit in to existing discussions elsewhere are on topic here, I want to boldly recommend the following while the annual planning process is still ongoing, because it's far beyond the scope of what could be accomplished at a hackathon or on WMCS in a responsible fashion:

First, the Foundation should host a fork of BLOOM [ https://huggingface.co/bigscience/bloom ], which if I remember correctly was described by the Foundation's Machine Learning Director Chris Albon as the only LLM at the scale of GPT-3 adhering to the movement's FOSS criteria. This should be done under or alongside Toolforge on Wikimedia Cloud Services so that staff and volunteers alike may use its API and submit modification proposals for new instances. Presumably this would cost on the order of $100,000 per year per instance, according to https://huggingface.co/bigscience/bloom/discussions/161#63a33373b5fc9ab9f63d... but someone should double-check that math. I've tested BLOOM against a dozen of the uses shown around enwiki for GPT-3 and ChatGPT, and it seems to perform about as well. (You can use the Hosted Inference API version on Azure for free at the Huggingface URL.)

Secondly, the Foundation should sponsor staff-, grant-, affiliate-, and volunteer-run projects to replicate and extend the work on:

A. RARR [ https://arxiv.org/abs/2210.08726 ] and other methods of attribution and verification with goals aspiring to Wikipedia's standards of summarizing and citing sources in ways that can be independently verified.

B. ROME [ https://rome.baulab.info/ / MEMIT: https://memit.baulab.info/ ] and other approaches to knowledge editing in language models with the goal of producing simple interfaces to provide "language models that anyone can edit" and ideally coupled to Wikidata updates.

C. EditEval [ https://eval.ai/web/challenges/challenge-page/1866/overview ], an ongoing challenge competition to produce systems capable of automatically improving text, including its fluency, simplification, paraphrasing, neutralization, and updating information.

I apologize to those on Thursday's Zoom call who had proposals for ORES expansion to combat paid advocacy, images, audio, speech and video, as I don't remember enough of the details and there's not enough information at https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/D... to include them here. I hope the advocates will elucidate those proposals on list while the annual planning process is still in progress.

-LW

On Mon, Mar 27, 2023 at 1:04 PM Yael Weissburg rweissburg@wikimedia.org wrote:

...

Hello again everyone,

Thanks again to those who made it to the call last week - it felt like such a luxury to be able to drop deeply into this subject for an hour (plus) with all of you.

For those who were unable to join, we captured extensive notes on Meta https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends/Community_call_notes. I hope we continue the vibrant discussion we started together on the Talk Page. Maybe someone can use that space to volunteer to host the next call? I know many folks are eager to continue the live discussion too.

I also wanted to share a few links / resources that might be useful (I'll add these to the Talk page as well):

WMF's Legal team recently did a copyright analysis of ChatGPT. You

can find that on Meta https://meta.wikimedia.org/wiki/Wikilegal/Copyright_Analysis_of_ChatGPT .

There is a proposed session on ChatGPT / generative AI for the

Wikimedia Hackathon in May. You can find that on Phabricator https://phabricator.wikimedia.org/T333127.

Finally, a huge thank you to @Maryana Pinchuk mpinchuk@wikimedia.org who took the extensive and detailed notes on the call and also did a lot of "wrangling" behind the scenes to help draft the External Trends in the first place and get us to a point where we could have this discussion. Thank you, Maryana!

Feel free to reach out anytime to connect about this or other topics. I'll be in Belgrade for the EduWiki conference in May and Singapore for Wikimania - if you're coming to either of those events or in the area, let me know - I'd love to meet in person!

Best,

Yael

*Yael Weissburg* (she/her) VP, Partnerships, Programs & Grantmaking Wikimedia Foundation https://wikimediafoundation.org/ M: (+1) 415.513.6643 I work from San Francisco. My time zone is UTC -7/-8.

On Fri, Mar 24, 2023 at 2:02 AM Paulo Santos Perneta < paulosperneta@gmail.com> wrote:

...
Yes, please, make this a regular event, at least for the time being. These discussions are incredibly useful, given the speed the developments are happening in this area, and the complexity of the challenges we are facing due to them. And thank's a lot for organizing the meeting yesterday!

Paulo

Samuel Klein meta.sj@gmail.com escreveu no dia quinta, 23/03/2023 à(s) 21:11:

...
The Bau lab (that produced ROME) is great; see their update MEMIT https://memit.baulab.info scaling that approach.

On Thu, Mar 23, 2023 at 3:43 PM Lauren Worden laurenworden89@gmail.com wrote:

...
On Thu, Mar 23, 2023 at 12:20 PM Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Yael and all for hosting this! A great conversation which we should revisit regularly.

Yes, I hope that this can be a (monthly?) regularly occurring event given the current state of very substantial advancements and improvements in the field.

I want to reiterate some links which I feel may be of considerable help to those trying to understand our situation:

RARR: https://arxiv.org/abs/2210.08726

ROME: https://rome.baulab.info/

ROME:

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Jan Ainali

6:30 p.m.

Den tis 28 mars 2023 kl 12:08 skrev Lauren Worden <laurenworden89@gmail.com

...

:

...

First, the Foundation should host a fork of BLOOM [ https://huggingface.co/bigscience/bloom ], which if I remember correctly was described by the Foundation's Machine Learning Director Chris Albon as the only LLM at the scale of GPT-3 adhering to the movement's FOSS criteria.

No, BLOOM is not FOSS by any means. It fails freedom 0 of the four freedoms from the Free Software Foundation[1], and it is not recognized as an open source license by the Open Source Institute (and will not be as it fails requirement 6 of the open source definition[2]). So that model, any other using the RAIL license, is a dead end.

/Jan

[1] https://www.gnu.org/philosophy/free-sw.html [2] https://opensource.org/osd/

Alek Tarkowski

29 Mar 29 Mar

2:06 p.m.

Hi,

(I’m Alek from Open Future Foundation, I largely lurk here, so I want to say “Hi everyone!” first).

Jan, you’re right that the RAIL license does not meet any FOSS definitions. But its authors, in their white paper, position this license not just as “responsible” but also “open”. And project like RAIL or BLOOM, connected with the HuggingFace company, aim to define a standard that fits the idea of responsible sharing. Looking in more detail, the behavioural use limitations in RAIL are ones that could probably be endorsed by Wikimedia, based on its Code of Conduct and other community norms.

My point is that it would be good to explore to what extent “openish” AI stacks can be a good fit for Wikimedia. I follow the conversation around open/responsible AI licensing and understand the need to not “dilute" FOSS licensing. But also appreciate that AI researchers are actively setting a standard that they think is right for AI. I think that their work should not be dismissed just because it’s not using one of the canonical open licenses.

BY the way, there will probably be, anytime soon, an LLM that is available under a “traditional” FOSS license. But for me that’s even more so a reason to consider different options, and be able to make an informed decision.

Best, Alek -- Director of Strategy, Open Future | openfuture.eu | +48 889 660 444 At Open Future, we tackle the Paradox of Open: paradox.openfuture.eu/

...

On 28 Mar 2023, at 20:30, Jan Ainali jan@aina.li wrote:

Den tis 28 mars 2023 kl 12:08 skrev Lauren Worden <laurenworden89@gmail.com mailto:laurenworden89@gmail.com>:

...
First, the Foundation should host a fork of BLOOM [ https://huggingface.co/bigscience/bloom ], which if I remember correctly was described by the Foundation's Machine Learning Director Chris Albon as the only LLM at the scale of GPT-3 adhering to the movement's FOSS criteria.

No, BLOOM is not FOSS by any means. It fails freedom 0 of the four freedoms from the Free Software Foundation[1], and it is not recognized as an open source license by the Open Source Institute (and will not be as it fails requirement 6 of the open source definition[2]). So that model, any other using the RAIL license, is a dead end.

/Jan

[1] https://www.gnu.org/philosophy/free-sw.html [2] https://opensource.org/osd/ _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Jan Ainali

8:36 p.m.

Hi Alek, nice to see you here!

On the contrary, I think it is important to, as early as possible, deter all these attempts to weaken the concept of "open" and that we as a movement need to take a hard stance against them. These proprietary licenses do not fit the spirit of sharing all knowledge and letting anyone do whatever they want with it. It's not that they are using licenses that currently are not approved by the leaders of the open source movement, it's that these licenses are fundamentally, and deliberately, constructed so that they will never be approved by these bodies.

But yes, we will probably soon see FOSS licensed LLMs, and from these, we can choose which ones we might want to help develop. Let's just wait for that day, rather than make a hasty and morally dubious bet on models available as of today.

Jan Ainali

Den ons 29 mars 2023 kl 21:50 skrev Alek Tarkowski alek@openfuture.eu:

...

Hi,

(I’m Alek from Open Future Foundation, I largely lurk here, so I want to say “Hi everyone!” first).

Jan, you’re right that the RAIL license does not meet any FOSS definitions. But its authors, in their white paper, position this license not just as “responsible” but also “open”. And project like RAIL or BLOOM, connected with the HuggingFace company, aim to define a standard that fits the idea of responsible sharing. Looking in more detail, the behavioural use limitations in RAIL are ones that could probably be endorsed by Wikimedia, based on its Code of Conduct and other community norms.

My point is that it would be good to explore to what extent “openish” AI stacks can be a good fit for Wikimedia. I follow the conversation around open/responsible AI licensing and understand the need to not “dilute" FOSS licensing. But also appreciate that AI researchers are actively setting a standard that they think is right for AI. I think that their work should not be dismissed just because it’s not using one of the canonical open licenses.

BY the way, there will probably be, anytime soon, an LLM that is available under a “traditional” FOSS license. But for me that’s even more so a reason to consider different options, and be able to make an informed decision.

Best, Alek -- Director of Strategy, Open Future | openfuture.eu | +48 889 660 444 At Open Future, we tackle the Paradox of Open: paradox.openfuture.eu/

On 28 Mar 2023, at 20:30, Jan Ainali jan@aina.li wrote:

Den tis 28 mars 2023 kl 12:08 skrev Lauren Worden < laurenworden89@gmail.com>:

...
First, the Foundation should host a fork of BLOOM [ https://huggingface.co/bigscience/bloom ], which if I remember correctly was described by the Foundation's Machine Learning Director Chris Albon as the only LLM at the scale of GPT-3 adhering to the movement's FOSS criteria.

No, BLOOM is not FOSS by any means. It fails freedom 0 of the four freedoms from the Free Software Foundation[1], and it is not recognized as an open source license by the Open Source Institute (and will not be as it fails requirement 6 of the open source definition[2]). So that model, any other using the RAIL license, is a dead end.

/Jan

[1] https://www.gnu.org/philosophy/free-sw.html [2] https://opensource.org/osd/ _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Lauren Worden

10:31 p.m.

On Wed, Mar 29, 2023 at 1:50 PM Jan Ainali jan@aina.li wrote:

...

I think it is important to, as early as possible, deter all these attempts to weaken the concept of "open" and that we as a movement need to take a hard stance against them. These proprietary licenses do not fit the spirit of sharing all knowledge and letting anyone do whatever they want with it.

Is the BLOOM RAIL license [ https://huggingface.co/spaces/bigscience/license ] proprietary? My understanding is that is not proprietary, and the only reason it doesn't qualify for Open Source Initiative approval is because of these use restrictions:

"You agree not to use the Model or Derivatives of the Model: (a) In any way that violates any applicable national, federal, state, local or international law or regulation; (b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; (c) To generate or disseminate verifiably false information with the purpose of harming others; (d) To generate or disseminate personal identifiable information that can be used to harm an individual; (e) To generate or disseminate information or content, in any context (e.g. posts, articles, tweets, chatbots or other kinds of automated bots) without expressly and intelligibly disclaiming that the text is machine generated; (f) To defame, disparage or otherwise harass others; (g) To impersonate or attempt to impersonate others; (h) For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; (i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics (j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; (k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories; (l) To provide medical advice and medical results interpretation; (m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use)."

Those restrictions seem very reasonable to me, and I would consider them an advantage given the problems the field is experiencing, including the threats to project content integrity. I don't see any drawbacks, and I see several advantages to encouraging such restrictions.

So I expect the BLOOM license would therefor qualify for an exception as described in https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use

There is further discussion of these issues at https://arxiv.org/pdf/2011.03116.pdf

-LW

Kimmo Virtanen

30 Mar 30 Mar

5:12 a.m.

Hi,

...

...
My understanding is that is not proprietary, and the only reason it

doesn't qualify for Open Source Initiative approval is because of these use restrictions:

...

To generate or disseminate information or content, in any context (e.g.

posts, articles, tweets, chatbots or other kinds of automated bots) without expressly

...

and intelligibly disclaiming that the text is machine generated

This makes it useless in most content-related use cases as it requires too much extra text to use the results.

About FOSS compatible LLMs, EleutherAI's GPT-J, NeoX, and Pythia and Cerebras-GPT are under Apache 2.0. The question is whether these models are good enough to be useful. However, the same question is relevant to Bloom too.

Br, -- Kimmo Virtanen, Zache

On Thu, Mar 30, 2023 at 3:34 AM Lauren Worden laurenworden89@gmail.com wrote:

...

On Wed, Mar 29, 2023 at 1:50 PM Jan Ainali jan@aina.li wrote:

...
I think it is important to, as early as possible, deter all these

attempts to weaken the concept of "open" and that we as a movement need to take a hard stance against them.

...
These proprietary licenses do not fit the spirit of sharing all

knowledge and letting anyone do whatever they want with it.

Is the BLOOM RAIL license [ https://huggingface.co/spaces/bigscience/license ] proprietary? My understanding is that is not proprietary, and the only reason it doesn't qualify for Open Source Initiative approval is because of these use restrictions:

"You agree not to use the Model or Derivatives of the Model: (a) In any way that violates any applicable national, federal, state, local or international law or regulation; (b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; (c) To generate or disseminate verifiably false information with the purpose of harming others; (d) To generate or disseminate personal identifiable information that can be used to harm an individual; (e) To generate or disseminate information or content, in any context (e.g. posts, articles, tweets, chatbots or other kinds of automated bots) without expressly and intelligibly disclaiming that the text is machine generated; (f) To defame, disparage or otherwise harass others; (g) To impersonate or attempt to impersonate others; (h) For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; (i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics (j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; (k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories; (l) To provide medical advice and medical results interpretation; (m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use)."

Those restrictions seem very reasonable to me, and I would consider them an advantage given the problems the field is experiencing, including the threats to project content integrity. I don't see any drawbacks, and I see several advantages to encouraging such restrictions.

So I expect the BLOOM license would therefor qualify for an exception as described in https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use

There is further discussion of these issues at https://arxiv.org/pdf/2011.03116.pdf

-LW _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Jan Ainali

6:19 a.m.

Den tors 30 mars 2023 kl 02:33 skrev Lauren Worden <laurenworden89@gmail.com

...

:

...

Is the BLOOM RAIL license [ https://huggingface.co/spaces/bigscience/license ] proprietary?

Yes. The common definition is that if it is not open source, it is proprietary. But you don't need to take my word for it.

...

So I expect the BLOOM license would therefor qualify for an exception as described in https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use

Point 3 in the "What uses of Cloud Services do we not like?":

*"Proprietary software*: Do not use or install any software unless the software is licensed under an Open Source license http://opensource.org/licenses/."

The Wikimedia Cloud terms of use even narrows it down to only Open Source Initiative approved licenses. So if not even CC0 licenses are allowed on Wikimedia Cloud (that license is only approved by the FSF, not by the OSI), for sure, the RAIL license is not allowed.

/Jan

Erik Moeller

6:04 a.m.

On Wed, Mar 29, 2023 at 1:49 PM Jan Ainali jan@aina.li wrote:

...

On the contrary, I think it is important to, as early as possible, deter all these attempts to weaken the concept of "open" and that we as a movement need to take a hard stance against them.

I agree with Jan on this. Licenses are the wrong tool for the job for which they're being used for here (regulating use of AI models).

One core principle in open source licenses is that you are not required to agree to the license in order to download or run copies. The GPL makes this explicit: "You are not required to accept this License in order to receive or run a copy of the Program." This is really important. I can download and run every bit of open source software in existence without ever agreeing to a single license.

Downloading a thing you make available doesn't give me the right to distribute it -- copyright law itself is sufficient to limit that. If you want to impose _additional restrictions_ on a person for stuff they download from you, that actually requires proactive agreement from the user to those restrictions at the time they download the thing.

If you don't obtain this agreement, you cannot meaningfully enforce the "license" because the downloader never agreed to it in the first place. Moreover, you'll have to make sure that _everyone else making copies of the file_ also obtains agreement from people getting those copies, or your whole house of cards falls down. Needless to say, this is totally incompatible with the way we distribute open source software.

To pick a concrete example, you can download the Stable Diffusion Weights here: https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/...

Did you agree to the Open Rail-M license? Nope, but you visited a public URL to download some model weights you can do stuff with. I cannot see any reasonable argument that you would be subject to the provision of the license when _running_ the model locally or on your own infrastructure.

To illustrate the point further, let's say I make "CoolCalculator.exe" available to you, you download and run it, and then I demand 500 dollars from you. Why 500 dollars? Well, my license requires that if you add sums greater than 1000 with my calculator, you owe me money. You didn't agree to the license? Tough! Shouldn't have downloaded my calculator!

In short, in my view, these attempts to embed ethical rulesets into licensing agreements are a "We did a thing" approach to ethics. They are of highly dubious enforceability and do nothing to deter bad actors, while making the technology legally incompatible with open source software.

Warmly, Erik

Lauren Worden

7:24 p.m.

On Thu, Mar 30, 2023 at 4:28 AM Erik Moeller eloquence@gmail.com wrote:

...

If you want to impose _additional restrictions_ on a person for stuff they download from you, that actually requires proactive agreement from the user to those restrictions at the time they download the thing.

If you don't obtain this agreement, you cannot meaningfully enforce the "license" because the downloader never agreed to it in the first place. Moreover, you'll have to make sure that _everyone else making copies of the file_ also obtains agreement from people getting those copies, or your whole house of cards falls down.

Isn't that exactly how we impose attribution and share-alike requirements of CC-BY-SA content?

On Thu, Mar 30, 2023 at 4:25 AM Kimmo Virtanen kimmo.virtanen@wikimedia.fi wrote:

...

...
To generate or disseminate information or content, in any context (e.g. posts, articles, tweets, chatbots or other kinds of automated bots) without expressly and intelligibly disclaiming that the text is machine generated

This makes it useless in most content-related use cases as it requires too much extra text to use the results.

I guess that the General Disclaimer could serve to fulfill that requirement.

...

About FOSS compatible LLMs, EleutherAI's GPT-J, NeoX, and Pythia and Cerebras-GPT are under Apache 2.0. The question is whether these models are good enough to be useful. However, the same question is relevant to Bloom too.

I have no particular affinity to BLOOM, but I have been able to personally test that it is capable of at least a dozen different use cases that people have shown GPT-3 and ChatGPT can be used for on enwiki. My promotion of leveraging it is for the strictly utilitarian purpose of providing an infrastructure to work on the problems which seem to have the greatest risk to project content if not addressed.

I would prefer a more widely multilingual model trained on all of the Foundation content suitable for that purpose, but training such models is a much more expensive proposition than merely using them.

-LW

Erik Moeller

31 Mar 31 Mar

6:55 a.m.

On Thu, Mar 30, 2023 at 12:25 PM Lauren Worden laurenworden89@gmail.com wrote:

...

...
If you don't obtain this agreement, you cannot meaningfully enforce the "license" because the downloader never agreed to it in the first place. Moreover, you'll have to make sure that _everyone else making copies of the file_ also obtains agreement from people getting those copies, or your whole house of cards falls down.

...

Isn't that exactly how we impose attribution and share-alike requirements of CC-BY-SA content?

Not exactly. CC-BY-SA gives Wikimedia readers permissions they would not otherwise have (e.g., to distribute copies), and it ties those permissions to certain obligations (e.g., attribution). Readers who do not wish to exercise those additional permissions are not required to adhere to the obligations. They'd just be limited to what copyright law lets you do with content you download from a public website. Nobody can stop you from making your own offline version of Wikipedia, calling it "Bobbypedia", and removing all other attribution -- as long as you keep it to yourself.

To be sure, you can put restrictions in an AI model license that kick in for folks distributing the model, which is something they wouldn't legally be able to do without consulting and agreeing to the licensing terms. But, crucially, you don't have to distribute an AI model to run it. Most of the unethical uses folks tend to worry about (e.g., bulk generation of misinformation) do not involve distributing copies of the model, only of its output.

If you want to impose ethical use restrictions on people running your AI models, you have two choices: You can require everyone getting a copy of the model by any means to explicitly agree to those restrictions (presumably Facebook does this when distributing LLaMA to researchers), or you can make your model freely available and protest ineffectually when a downloader ignores the restrictions you've spelled out in a textfile in your repository. Neither approach is compatible with open source.

...

I have no particular affinity to BLOOM, but I have been able to personally test that it is capable of at least a dozen different use cases that people have shown GPT-3 and ChatGPT can be used for on enwiki.

I think it's fine to explore all sorts of models, free and nonfree, for the purpose of assessing capabilities and mitigating risks. When it comes to deployment of models in a production context, IMO Wikimedia should exclude from consideration any models under ill-conceived "ethical use" licenses.

Warmly, Erik

Chris Keating

1 Apr 1 Apr

4:35 p.m.

On Fri, Mar 31, 2023 at 3:05 PM Erik Moeller eloquence@gmail.com wrote:

...

On Thu, Mar 30, 2023 at 12:25 PM Lauren Worden laurenworden89@gmail.com wrote:

...
...
If you don't obtain this agreement, you cannot meaningfully enforce the "license" because the downloader never agreed to it in the first place. Moreover, you'll have to make sure that _everyone else making copies of the file_ also obtains agreement from people getting those copies, or your whole house of cards falls down.

...
Isn't that exactly how we impose attribution and share-alike requirements of CC-BY-SA content?

Not exactly. CC-BY-SA gives Wikimedia readers permissions they would not otherwise have (e.g., to distribute copies), and it ties those permissions to certain obligations (e.g., attribution). Readers who do not wish to exercise those additional permissions are not required to adhere to the obligations. They'd just be limited to what copyright law lets you do with content you download from a public website. Nobody can stop you from making your own offline version of Wikipedia, calling it "Bobbypedia", and removing all other attribution -- as long as you keep it to yourself.

To be sure, you can put restrictions in an AI model license that kick in for folks distributing the model, which is something they wouldn't legally be able to do without consulting and agreeing to the licensing terms. But, crucially, you don't have to distribute an AI model to run it. Most of the unethical uses folks tend to worry about (e.g., bulk generation of misinformation) do not involve distributing copies of the model, only of its output.

This is perhaps a bit academic, but this is not really the case, at least in UK copyright law.

The 'copying' inherent in viewing a web page is permissible under two grounds: 1) there is a statutory exemption in copyright law for this specific activity (in section 28a of the Copyright, Designs and Patents Act 1988, if anyone cares to look it up ;) ). This would likely not apply to details of AI models as the exemption excludes 'a computer program or a database'. Whether it would apply to Bobbypedia depends on whether it counts as a database (strikes me as arguable). 2) there is probably an implicit licence granted by whoever publishes the work for whoever views it to use it. The scope of this implicit licence is highly debatable and probably extremely limited. Do you have an implied licence to download the HTML of a webpage into your browser cache and use your web browser to render the page and display the resulting content? Very likely. Do you have an implied licence to save a PDF copy onto your hard drive? Maybe. Do you have an implied licence to use the page to create a personal AI model and distribute the output? That is very unclear, probably not. Perhaps less likely if there was also an explicit license attached to the page.

Chris

petr.kadlec＠gmail.com

31 Mar 31 Mar

3:33 p.m.

Hi,

On Thu, Mar 30, 2023 at 1:28 PM Erik Moeller eloquence@gmail.com wrote:

...

One core principle in open source licenses is that you are not required to agree to the license in order to download or run copies. The GPL makes this explicit: "You are not required to accept this License in order to receive or run a copy of the Program." This is really important. I can download and run every bit of open source software in existence without ever agreeing to a single license.

Downloading a thing you make available doesn't give me the right to distribute it -- copyright law itself is sufficient to limit that. If you want to impose _additional restrictions_ on a person for stuff they download from you, that actually requires proactive agreement from the user to those restrictions at the time they download the thing.

I’m not saying this is wrong in all jurisdictions, but it is definitely not correct in at least some of them…

Specifically, per the Czech copyright law, an act of downloading some copyrighted work is restricted by copyright, as it is (obviously?) copying (“reproduction”) of the work, which is (obviously?) covered by copyright.

There is an exception by which you are specifically allowed to copy some copyrighted works “for personal needs by a natural person without seeking to achieve direct or indirect economic benefit” but this exception does not apply to computer programs and electronic databases. Downloading computer programs and electronic databases (and downloading for purposes outside the listed exception) requires an express consent of the copyright holder, i.e. a license. In other words, you _cannot_ download a GPL program without agreeing to the GPL (which, as you wrote, allows that to anyone without further conditions, so that’s not a problem as far as downloading and running the program goes).

-- [[cs:User:Mormegil | Petr Kadlec]]

Erik Moeller

5:50 p.m.

On Fri, Mar 31, 2023 at 8:38 AM petr.kadlec@gmail.com wrote:

...

Downloading computer programs and electronic databases (and downloading for purposes outside the listed exception) requires an express consent of the copyright holder, i.e. a license. In other words, you _cannot_ download a GPL program without agreeing to the GPL

The act of downloading a copyrighted work is, of course, covered by copyright. But it does not follow that by downloading a work, you agree to whatever terms the person offering it imagines you agreed to.

If you want them to agree to those terms, you have to obtain that agreement. Otherwise, if you publish your work freely (i.e. with obvious intent to publish, not in some hidden directory on your webserver), the permission to download the work is implied by you publishing it. Or to put it another way, you can't publish and advertise a website and then make a credible demand for 500 dollars from anyone who clicks the link. Want 500 dollars? Ask for it on a clickthrough form that makes it obvious what the buyer pays for. Want people to agree to your ethical AI use restrictions? Ask for it before you give them your model weights.

Website terms of use are a gray area, but their enforceability is limited (beyond defending your right to refuse service by blocking a person from visiting your site) if you've not made their acceptance sufficiently explicit.

IANAL, so ask a lawyer if you don't believe me. :)

Warmly, Erik

Lauren Worden

1 Apr 1 Apr

8:21 a.m.

Erik, I see your point now and agree with you. But doesn't it seem like obtaining a perfect license is at present the enemy of the urgent good of bringing a concerted effort to bear on problems that are clearly detrimental to project integrity?

I haven't been able to tell whether any of the people training truly FOSS LLMs are even working on models the size (in parameters or of the context window) of GPT-3. The cost of training such models is falling rapidly with various advances, but it might never fall below the several million dollar range.

How would you characterize the harm of hosting BLOOM until a comparable FOSS model is available? Alternatively, is there a partnership solution to this problem within the Foundation's budget constraints?

-LW

On Fri, Mar 31, 2023 at 10:51 AM Erik Moeller eloquence@gmail.com wrote:

...

On Fri, Mar 31, 2023 at 8:38 AM petr.kadlec@gmail.com wrote:

...
Downloading computer programs and electronic databases (and downloading for purposes outside the listed exception) requires an express consent of the copyright holder, i.e. a license. In other words, you _cannot_ download a GPL program without agreeing to the GPL

The act of downloading a copyrighted work is, of course, covered by copyright. But it does not follow that by downloading a work, you agree to whatever terms the person offering it imagines you agreed to.

If you want them to agree to those terms, you have to obtain that agreement. Otherwise, if you publish your work freely (i.e. with obvious intent to publish, not in some hidden directory on your webserver), the permission to download the work is implied by you publishing it. Or to put it another way, you can't publish and advertise a website and then make a credible demand for 500 dollars from anyone who clicks the link. Want 500 dollars? Ask for it on a clickthrough form that makes it obvious what the buyer pays for. Want people to agree to your ethical AI use restrictions? Ask for it before you give them your model weights.

Website terms of use are a gray area, but their enforceability is limited (beyond defending your right to refuse service by blocking a person from visiting your site) if you've not made their acceptance sufficiently explicit.

IANAL, so ask a lawyer if you don't believe me. :)

Warmly, Erik _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Jan Ainali

12:04 p.m.

Den lör 1 apr. 2023 kl 10:21 skrev Lauren Worden laurenworden89@gmail.com:

...

How would you characterize the harm of hosting BLOOM until a comparable FOSS model is available?

There are a few risks that could be harmful, although I don't think they are neither certain nor very direct. But, if we do give up our principle of only using and allowing open source (that we held for over 20 years) here are a few of these risks. The first I come to think of is that we alienate the volunteers that hold these ideals high (remember the mp4 vote on Commons in 2014). Second, we dilute the concept of "open" by implying that these usage restricting licenses are just as good as FOSS licenses. Third, we are not being a good role model for the rest of the open movement.

With these risks in mind, I would much rather we wait for a FOSS model, not rushing on a hype train, even if it is not as powerful.

/Jan

Erik Moeller

9:35 p.m.

Lauren:

...

Erik, I see your point now and agree with you. But doesn't it seem like obtaining a perfect license is at present the enemy of the urgent good of bringing a concerted effort to bear on problems that are clearly detrimental to project integrity?

I don't think the licensing question matters for purposes of evaluation of third party APIs (including providing access to Wikimedia volunteers to participate in such evaluations), but I would personally draw the line when it comes to something like a Wikimedia Cloud Infrastructure installation. Spending a lot of money on compute infrastructure to run a proprietary model strikes me as clearly out of scope for the Wikimedia mission.

Openly licensed models for machine translation like Facebook's M2M (https://huggingface.co/facebook/m2m100_418M) or text generation like Cerebras-GPT-13B (https://huggingface.co/cerebras/Cerebras-GPT-13B) and GPT-NeoX-20B (https://huggingface.co/EleutherAI/gpt-neox-20b) seem like better targets for running on Wikimedia infrastructure, if there's any merit to be found in running them at this stage.

Note that Facebook's proprietary but widely circulated LLaMA model has triggered a lot of work on dramatically improving performance of LLMs through more efficient implementations, to the point that you can run a decent quality LLM (and combine it with OpenAI's freely licensed voice detection model) on a consumer grade laptop:

https://github.com/ggerganov/llama.cpp

While I'm not sure if the "hallucination" problem is tractable when all you have is an LLM, I am confident (based on, e.g., the recent results with Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html) that the performance of smaller models will continue to increase as we find better ways to train, steer, align, modularize and extend them.

Chris:

...

there is probably an implicit licence granted by whoever publishes the work for whoever views it to use it.

Here's a link to the Stable Diffusion (image generation) model weights from their official repository. Note the lack of any licensing statement or clickthrough agreement when directly downloading the weights.

https://huggingface.co/stabilityai/stable-diffusion-2-base/resolve/main/512-...

Are you infringing Stability AI's copyright by clicking this link? If not, are you infringing Stability AI's copyright by then writing a Python script that uses this file to generate images, if you only run it locally on your GPU?

Even if a court answers either question with "yes", it still does not follow that you are bound by any other licensing terms Stability AI is attaching to those files, a license which you never agreed to when clicking the link.

But this discussion highlights the fundamental difference between free licenses like CC-BY-SA/GPL and nonfree "ethical use" licenses like OpenRail-M. If you want to enforce your ethical use restrictions without a clickthrough agreement, you have no choice but to adopt an expansive definition of copyright infringement. This is somewhat ironic, given that the models themselves are trained on vast amounts of copyrighted data without permission.

Warmly, Erik

rupert THURNER

2 Apr 2 Apr

5:17 a.m.

On Sat, Apr 1, 2023 at 11:36 PM Erik Moeller eloquence@gmail.com wrote:

...

Openly licensed models for machine translation like Facebook's M2M (https://huggingface.co/facebook/m2m100_418M) or text generation like Cerebras-GPT-13B (https://huggingface.co/cerebras/Cerebras-GPT-13B) and GPT-NeoX-20B (https://huggingface.co/EleutherAI/gpt-neox-20b) seem like better targets for running on Wikimedia infrastructure, if there's any merit to be found in running them at this stage.

Note that Facebook's proprietary but widely circulated LLaMA model has triggered a lot of work on dramatically improving performance of LLMs through more efficient implementations, to the point that you can run a decent quality LLM (and combine it with OpenAI's freely licensed voice detection model) on a consumer grade laptop:

https://github.com/ggerganov/llama.cpp

While I'm not sure if the "hallucination" problem is tractable when all you have is an LLM, I am confident (based on, e.g., the recent results with Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html) that the performance of smaller models will continue to increase as we find better ways to train, steer, align, modularize and extend them.

to host open models like above would be really cool for multiple reasons, the most important one to bring back the openess into the training, besides the many voices out of the movement considering various social aspects one would never have the idea of otherwise.

rupert

Lauren Worden

6:48 a.m.

On Sat, Apr 1, 2023 at 10:18 PM rupert THURNER rupert.thurner@gmail.com wrote:

...

On Sat, Apr 1, 2023 at 11:36 PM Erik Moeller eloquence@gmail.com wrote:

...
... I am confident (based on, e.g., the recent results with Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html) that the performance of smaller models will continue to increase as we find better ways to train, steer, align, modularize and extend them.

to host open models like above would be really cool for multiple reasons, the most important one to bring back the openess into the training....

Wow! While Alpaca is English only and released under CC-NC-BY, it does seem like it's very easily replicated with a wide context window and could probably be made widely multilingual beyond the performance of GPT-3.5 for less than it would cost to merely host BLOOM for a few months. This shocked me and of course I take back what I said about requiring several million dollars.

https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-c... https://huggingface.co/databricks/dolly-v1-6b https://github.com/tatsu-lab/stanford_alpaca

What kind of hardware should WMCS buy to support such a project?

On Sat, Apr 1, 2023 at 2:36 PM Erik Moeller eloquence@gmail.com wrote:

...

... I'm not sure if the "hallucination" problem is tractable when all you have is an LLM

I disagree, which is why I have been pushing RARR and ROME. RARR seeks to use the same principles of WP:V to eliminate hallucination, requiring confirmation from verifiable sources, which can be limited to e.g. those approved by WP:RSP, and cited in a way that readers can independently verify. I've been posting links to the RARR paper which doesn't go very deep on some of those points, but here's an hour-long presentation by one of the authors which is a lot meatier on such topics: https://www.youtube.com/watch?v=d45Ms8LmF5k And here's a Twitter thread which is more accessible to those less familiar with similar literature: https://twitter.com/kelvin_guu/status/1582714222080688133

Once an attribution and verification system like RARR has identified inaccuracies and hallucinations, the ROME/MEMIT method of editing the models directly can eliminate them completely, and in a way that also eliminates similar generalized mistakes; please see: "Rank-One Editing of Encoder-Decoder Models" https://arxiv.org/abs/2211.13317

I can't believe that the large AI labs aren't working harder on these efforts than they've been letting on. Either they aren't or they are in an uncharacteristically secretive fashion, which would suggest they want to exploit such advances as proprietary trade secrets. In either case, it's vital that fully open organizations like the Foundation get involved quickly. There is reason to believe the latter case, because Google Bard uses a much less rigorous form of attribution and verification (probably based on SPARROW, https://arxiv.org/abs/2209.14375) but it actually causes its hallucinations to get worse e.g. in https://i.redd.it/f30u9n0gn9pa1.png If you watch the RARR video towards the end, Dr. Lao indicates they encountered similar issues but were able to eliminate almost all of them.

-LW

Erik Moeller

3 Apr 3 Apr

12:39 a.m.

Lauren:

...

What kind of hardware should WMCS buy to support such a project?

I can't comment on the hardware requirements, but I would note that in addition to the llama.cpp repository (https://github.com/ggerganov/llama.cpp), which currently focuses on LLaMA/Alpaca, there are other efforts to reduce the computational requirements for running LLMs. https://github.com/NolanoOrg/cformers looks promising and supports many of the open models. Fabrice Bellard of FFmpeg fame was one of the first implementers of a highly optimized LLM at https://textsynth.com/; sadly much of the work is proprietary (there is a binary-only distribution for CPU under MIT license).

https://textsynth.com/playground.html remains one of the most accessible ways to explore the performance of the open models with only a rate limitation, and no requirement to purchase credits.

...

I disagree, which is why I have been pushing RARR and ROME.

That approach does look promising, thanks for continuing to point it out. The publicly available implementations I've seen of citing sources (Google/Microsoft/You.com/character.ai) still hallucinate heavily and attribute claims to the sources that they don't make, but if an LLM can be steered to cite reliable sources and not fabricate information that isn't in the sources, that would be a huge win!

Chris:

...

However, not clicking a button to indicate acceptance of terms and conditions does not mean I can do whatever I want. It means that I either have to find other evidence of an explicit licence (maybe text on the document?), or consider whether there is an implicit licence or exemption. An implicit licence might well exist but is qite likely to be minimal in scope.

I broadly agree with your reasoning. My core argument is that using a software license to restrict "ethical use", without requiring clickthrough agreements, is either unenforceable in many cases, or it is enforceable only in a manner that otherwise hollows out consumer rights. That's because it is a _reasonable expectation_ that, at the very least, if you give me a piece of software without any obvious restrictions attached to it, I can run that piece of software (by virtue of implied license, exemption, or both).

In my view, a fundamental goal of free licenses is to expand and protect user rights for covered works. Both the GPL and CC-BY-SA make it explicit that none of their provisions are intended to limit fair use / fair dealing, and as previously noted, the GPL even says explicitly that you can receive and run GPL software without accepting the GPL itself.

To fully enforce the "ethical use" provisions of OpenRail-M (which are the raison d'être for the license) requires the licensor to argue that, if you legally obtained a copy from them without ever seeing the license, you are not permitted to run the software. That is a user-hostile, bait-and-switch approach to software licensing, regardless of whether the courts ultimately uphold it.

I suspect it may never be tested, because most organizations that adopt these licenses do so as a matter of reputational risk mitigation, not with actual intent to enforce these provisions -- which would be often be very impractical for other reasons (identifying the bad actor, proving that the model was used, etc.).

Warmly, Erik

Lauren Worden

5:40 a.m.

On Sun, Apr 2, 2023 at 5:40 PM Erik Moeller eloquence@gmail.com wrote:

...

I can't comment on the hardware requirements, but I would note that in addition to the llama.cpp repository (https://github.com/ggerganov/llama.cpp), which currently focuses on LLaMA/Alpaca, there are other efforts to reduce the computational requirements for running LLMs. https://github.com/NolanoOrg/cformers looks promising and supports many of the open models. Fabrice Bellard of FFmpeg fame was one of the first implementers of a highly optimized LLM at https://textsynth.com/ ; sadly much of the work is proprietary

At this point I guess I would recommend adding five or so g2.cores8.ram36.disk20 flavor VPSs to WMCS, with between one and three RTX A6000 GPUs each, plus a 1TB SSD each, which should cost under $60k. That should allow for very widely multilingual models somewhere between GPT-3.5 and 4 performance with current training rates.

...

https://textsynth.com/playground.html remains one of the most accessible ways to explore the performance of the open models with only a rate limitation, and no requirement to purchase credits.

There is are free Alpaca-30b demos for comparison at https://github.com/deep-diver/Alpaca-LoRA-Serve And free Alpaca-7b online at https://chatllama.baseten.co/

These models can be quantized into int4 weights which run on cell phones: https://github.com/rupeshs/alpaca.cpp/tree/linux-android-build-support It seems inevitable that we will someday include such LLMs with Internet-in-a-Box, and, why not also the primary mobile apps so we don't have to give away CPU utilization?

There is a proposal to allow apps over 4GB in WASM: https://github.com/WebAssembly/memory64/blob/master/proposals/memory64/Overv... At the rate things are improving maybe that won't even be neeedd to make a reasonable static web app, someday.

-LW

Samuel Klein

5:42 p.m.

...

At this point I guess I would recommend adding five or so g2.cores8.ram36.disk20 flavor VPSs to WMCS, with between one and three RTX A6000 GPUs each, plus a 1TB SSD each, which should cost under $60k. That should allow for very widely multilingual models somewhere between GPT-3.5 and 4 performance with current training rates.

Having part of the cluster for this makes sense, even as what it is used for changes over time.

...

These models can be quantized into int4 weights which run on cell phones: https://github.com/rupeshs/alpaca.cpp/tree/linux-android-build-support It seems inevitable that we will someday include such LLMs with Internet-in-a-Box, and, why not also the primary mobile apps

Eventually, yes. A good reason to renew attention to mobile as a canonical wiki experience.

Kimmo Virtanen

18 Apr 18 Apr

6:20 p.m.

Some interesting updates on current developments

Open Assistant is open source ChatGPT clone with crowdsourced fine-tuning - https://open-assistant.io

Redpajama is project for reproducing LLaMA and releasing the model under open source licence. Current status is that they have released the pre-training data - https://www.together.xyz/blog/redpajama

Free-dolly is CC-BY-SA licenced fine-tuning dataset. - https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-via...

LLM chat which runs in web browser (non-open source Vicuna-7B) - https://simonwillison.net/2023/Apr/16/web-llm/

Afaik all of these are clear steps towards full open source LLM stack. Especially Open Assistant is especially interesting as it is focusing on crowdsourcing.

Br, -- Kimmo Virtanen, Zache

On Mon, Apr 3, 2023 at 8:43 PM Samuel Klein meta.sj@gmail.com wrote:

...

At this point I guess I would recommend adding five or so

...
g2.cores8.ram36.disk20 flavor VPSs to WMCS, with between one and three RTX A6000 GPUs each, plus a 1TB SSD each, which should cost under $60k. That should allow for very widely multilingual models somewhere between GPT-3.5 and 4 performance with current training rates.

Having part of the cluster for this makes sense, even as what it is used for changes over time.

...
These models can be quantized into int4 weights which run on cell phones: https://github.com/rupeshs/alpaca.cpp/tree/linux-android-build-support It seems inevitable that we will someday include such LLMs with Internet-in-a-Box, and, why not also the primary mobile apps

Eventually, yes. A good reason to renew attention to mobile as a canonical wiki experience.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Chris Keating

2 Apr 2 Apr

9:17 a.m.

...

Are you infringing Stability AI's copyright by clicking this link? If not, are you infringing Stability AI's copyright by then writing a Python script that uses this file to generate images, if you only run it locally on your GPU?

Even if a court answers either question with "yes", it still does not follow that you are bound by any other licensing terms Stability AI is attaching to those files, a license which you never agreed to when clicking the link.

I don't entirely follow, I'm afraid.

Say I take a document from the Internet and then use it to do something (anything).

I am very likely to have done something which is 'copying' in legal terms, either in the act of downloading it, or the act of using it, or both.

What is my legal basis for doing this? Given that there is copyright in the work and I have copied it, I must have a legal basis for doing this, I can't just wave my hands and say it's fine. In common law legal systems the answer is likely to be one (or more) of; - there is a statutory basis for my doing so (perhaps the one I mentioned earlier in the thread or a 'fair dealing' exemption) - there is an explicit licence attached to the document defined somewhere by the creator (either a general one or something created by a contract between myself and the creator) - there is an implicit licence attached to the document defined by the creator's actions in the light of their reasonable expectations of others' actions, e.g. if you publish a document on the internet you very probably imply a licence to perform such copying as is actually needed to read the document

If I have clicked a button to indicate acceptance of some terms and conditions, then very likely those terms contains an explicit licence I can rely on.

However, not clicking a button to indicate acceptance of terms and conditions does not mean I can do whatever I want. It means that I either have to find other evidence of an explicit licence (maybe text on the document?), or consider whether there is an implicit licence or exemption. An implicit licence might well exist but is quite likely to be minimal in scope.

If I then re-copy the material and then republish it, that would be a further act of copying, and I would have to answer the same questions. The fact of republication does not fundamentally change anything, though it may be handled differently in the different exemptions or licences I am probably relying on. An implicit licence is less likely to exist the further I manipulate something, as that is probably further removed from the copyright holder's original expectations.

Chris

583

Age (days ago)

623

Last active (days ago)

wikimedia-l@lists.wikimedia.org

30 comments

12 participants

tags (0)

participants (12)

Alek Tarkowski
Chris Keating
Elena Lappen
Erik Moeller
Jan Ainali
Kimmo Virtanen
Lauren Worden
Paulo Santos Perneta
petr.kadlec＠gmail.com
rupert THURNER
Samuel Klein
Yael Weissburg