Chat GPT

List overview All Threads
Download

newer

older

Re: My project was stolen by a...

Results of the Universal Code of...

Victoria Coleman

30 Dec 2022 30 Dec '22

12:09 a.m.

Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives?

This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman

Show replies by date

Ziko van Dijk

30 Dec 30 Dec

12:52 a.m.

Hello Victorioa,

Thank you for the great question!

In my humble opinion, ChatGPT is far away from producing useful Wikipedia content. My own experience is here to see: https://youtu.be/zKPEyxYt5kg

But anyone who wants to use the existing AI website(s) may use the AI at pleasure and copy content from it. Finally, it is the individual editor who is responsible for her edits.

Should we include AI in the user interface of Wikipedia? I tend to say no. But I have to think about automatic translation services: these are very good nowadays, and I'd actually wish one being integrated in the Wikipedia translation tool! Of course, the human editor MUST ALWAYS check the translation with her own eyes. But the integration into the translation tool would be very welcome.

There is resistance against the inclusion of automatic translation, because that would make it easier for lazy editors to abuse it. (Not checking the translations personally.)

And that is my objection against the integration of AI text production in Wikipedia's website: it would make it lazy editors too easy to add dubious content.

(I know that it is a contradiction if I welcome the automatic translation but not the AI text production, but that is partially due to the specific structure of the translation tool.)

At the moment, AI texts often look excellent but are very unreliable. And that makes it so dangerous.

Kind regards, User:Ziko

P.S.: One example of todays's playing with ChatGPT. Who was responsible for the 1933 Reichstag fire? According to AI, the national socialists. There is proof for that. - Oh? I learned that the historians are still arguining. So I asked the AI: What is the proof? - And the AI gave me some motives of the national socialists, but no proof. Instead, the AI offered that "Georg Irminger" was a national socialist involved in the fire, according to his own confession. But that confession might have been made under torture. - I wonder about the name and Google it. Google knows of several people named Georg(e) Irminger, but all of them died before 1933. I tell the AI that Georg Irminger does not exist! - The AI apologizes for giving me wrong information. Instead, some Georg Elser was involved in the fire, according to his own confession. But that confession might have been made unter torture.

Funny aftermath: I mentioned this conversation in a Facebook group "Digital history" (in German). One person answered: "But no, Georg Elser was not related to the fire, he later tried to shoot Hitler!" (Georg Elser did not try to shoot anyone, he tried to kill Hitler with a bomb in 1939.)

Am Fr., 30. Dez. 2022 um 01:10 Uhr schrieb Victoria Coleman vstavridoucoleman@gmail.com:

...

Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives?

This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Steven Walling

1:16 a.m.

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...

Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train the

...

bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...

exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...

This is a significant matter for the community and seeing us step to it

...

would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Victoria Coleman

1:56 a.m.

Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

...

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

...
On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman vstavridoucoleman@gmail.com wrote: Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

...
My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

...
And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Gnangarra

2:38 a.m.

I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman vstavridoucoleman@gmail.com wrote:

...

Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train the

...
bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...
exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to it

...
would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Raymond Leonard

3:55 a.m.

As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, *Voyage en Icarie*, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra gnangarra@gmail.com wrote:

...

I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train the

...
bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...
exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to it

...
would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Yaroslav Blanter

6:39 a.m.

Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard < raymond.f.leonard.jr@gmail.com> wrote:

...

As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, *Voyage en Icarie*, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra gnangarra@gmail.com wrote:

...
I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train

...
the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...
exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to it

...
would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Risker

8:11 a.m.

Given what we already know about AI-like projects (think Siri, Alexis, etc), they're the result of work done by organizations utilizing resources hundreds of times greater than the resources within the entire Wikimedia movement, and they'renot all that good if we're being honest. They're entirely dependent on existing resources. We have seen time and again how easily they can be led astray; ChatGPT is just the most recent example. It is full of misinformation. Other efforts have resulted in the AI becoming radicalized. Again, it's all about what sources the AI project uses in developing its responses, and those underlying sources are generally completely unknown to the person asking for the information.

Ironically, our volunteers have created software that learns pretty effectively (ORES, several anti-vandalism "bots"). The tough part is ensuring that there is continued, long-term support for these volunteer-led efforts, and the ability to make them effective on projects using other languages. We've had bots making translations of formulaic articles from one language to another for years; again, they depend on volunteers who can maintain and support those bots, and ensure continued quality of translation.

AI development is tough. It is monumentally expensive. Big players have invested billions USD trying to develop working AI, with some of the most talented programmers and developers in the world, and they're barely scratching the surface. I don't see this as a priority for the Wikimedia movement, which achieves considerably higher quality with volunteers following a fairly simple rule set that the volunteers themselves develop based on tried and tested knowledge. Let's let those with lots of money keep working to develop something that is useful, and then we can start seeing if it can become feasible for our use.

I envision the AI industry being similar to the computer hardware industry. My first computer cost about the same (in 2022 dollars) as the four computers and all their peripherals that I have within my reach as I write this, and had less than 1% of the computing power of each of them.[1] The cost will go down once the technology gets better and more stable.

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter ymbalt@gmail.com wrote:

...

Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard < raymond.f.leonard.jr@gmail.com> wrote:

...
As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, *Voyage en Icarie*, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest. "

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra gnangarra@gmail.com wrote:

...
I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train

...
the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...
exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to

...
it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Victoria Coleman

4:53 p.m.

Anne,

Interestingly enough what these large companies have to spend a ton of money on is creating and moderating content. In other words people. Passionate volunteers in large numbers is what the movement has in abundance. Imagine the power of combining the talents and passion of our community members with the advances offered by AI today. I was struck recently during a visit to NVIDIA how language models have changed. Back in my day, we would have to build one language model per domain and then load it in to the device, a computer or a phone, to use. Now they have one massive combined language model in a data center full of their GPUs which is there so long as you are connected. My sense is that within the guard rails offered by our volunteer community, we could use AI to force multiply their efforts and make knowledge even more accessible than it is today. Both for those who create and record knowledge as well as those who consume it. In the case of Chat GPT, our volunteers could use supervised learning for example to narrow down the mistakes the bot makes - which should be many fewer that the Open AI version since the Wikipedia version would be trained on good, clean Wikipedia content which is constantly reviewed by the community.

Best regards,

Victoria Coleman

...

On Dec 30, 2022, at 12:21 AM, Risker risker.wp@gmail.com wrote:

Given what we already know about AI-like projects (think Siri, Alexis, etc), they're the result of work done by organizations utilizing resources hundreds of times greater than the resources within the entire Wikimedia movement, and they'renot all that good if we're being honest. They're entirely dependent on existing resources. We have seen time and again how easily they can be led astray; ChatGPT is just the most recent example. It is full of misinformation. Other efforts have resulted in the AI becoming radicalized. Again, it's all about what sources the AI project uses in developing its responses, and those underlying sources are generally completely unknown to the person asking for the information.

Ironically, our volunteers have created software that learns pretty effectively (ORES, several anti-vandalism "bots"). The tough part is ensuring that there is continued, long-term support for these volunteer-led efforts, and the ability to make them effective on projects using other languages. We've had bots making translations of formulaic articles from one language to another for years; again, they depend on volunteers who can maintain and support those bots, and ensure continued quality of translation.

AI development is tough. It is monumentally expensive. Big players have invested billions USD trying to develop working AI, with some of the most talented programmers and developers in the world, and they're barely scratching the surface. I don't see this as a priority for the Wikimedia movement, which achieves considerably higher quality with volunteers following a fairly simple rule set that the volunteers themselves develop based on tried and tested knowledge. Let's let those with lots of money keep working to develop something that is useful, and then we can start seeing if it can become feasible for our use.

I envision the AI industry being similar to the computer hardware industry. My first computer cost about the same (in 2022 dollars) as the four computers and all their peripherals that I have within my reach as I write this, and had less than 1% of the computing power of each of them.[1] The cost will go down once the technology gets better and more stable.

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

...
On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter ymbalt@gmail.com wrote: Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

...
On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard raymond.f.leonard.jr@gmail.com wrote: As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, Voyage en Icarie, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

...
On Thu, Dec 29, 2022 at 6:39 PM Gnangarra gnangarra@gmail.com wrote: I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

...
On Fri, 30 Dec 2022 at 09:56, Victoria Coleman vstavridoucoleman@gmail.com wrote: Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

...
> On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote: >

> On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman vstavridoucoleman@gmail.com wrote: > Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

> My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

> And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

> This is a significant matter for the community and seeing us step to it would be very encouraging. > > Best regards, > > Victoria Coleman > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... > To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Neurodivergent Netizen

5:31 p.m.

One concern I have is that all “oldbies” like myself have all seen bots basically decay after whomever is maintaining goes inactive. Of course, this could be mostly rectified by having the AI be open source. This leaves the “people” aspect; that is, not only does the AI need to be maintained, but interest needs to be maintained as well.

From, I dream of horses She/her

...

On Dec 30, 2022, at 8:53 AM, Victoria Coleman vstavridoucoleman@gmail.com wrote:

Anne,

Interestingly enough what these large companies have to spend a ton of money on is creating and moderating content. In other words people. Passionate volunteers in large numbers is what the movement has in abundance. Imagine the power of combining the talents and passion of our community members with the advances offered by AI today. I was struck recently during a visit to NVIDIA how language models have changed. Back in my day, we would have to build one language model per domain and then load it in to the device, a computer or a phone, to use. Now they have one massive combined language model in a data center full of their GPUs which is there so long as you are connected. My sense is that within the guard rails offered by our volunteer community, we could use AI to force multiply their efforts and make knowledge even more accessible than it is today. Both for those who create and record knowledge as well as those who consume it. In the case of Chat GPT, our volunteers could use supervised learning for example to narrow down the mistakes the bot makes - which should be many fewer that the Open AI version since the Wikipedia version would be trained on good, clean Wikipedia content which is constantly reviewed by the community.

Best regards,

Victoria Coleman

...
On Dec 30, 2022, at 12:21 AM, Risker risker.wp@gmail.com wrote:

Given what we already know about AI-like projects (think Siri, Alexis, etc), they're the result of work done by organizations utilizing resources hundreds of times greater than the resources within the entire Wikimedia movement, and they'renot all that good if we're being honest. They're entirely dependent on existing resources. We have seen time and again how easily they can be led astray; ChatGPT is just the most recent example. It is full of misinformation. Other efforts have resulted in the AI becoming radicalized. Again, it's all about what sources the AI project uses in developing its responses, and those underlying sources are generally completely unknown to the person asking for the information.

Ironically, our volunteers have created software that learns pretty effectively (ORES, several anti-vandalism "bots"). The tough part is ensuring that there is continued, long-term support for these volunteer-led efforts, and the ability to make them effective on projects using other languages. We've had bots making translations of formulaic articles from one language to another for years; again, they depend on volunteers who can maintain and support those bots, and ensure continued quality of translation.

AI development is tough. It is monumentally expensive. Big players have invested billions USD trying to develop working AI, with some of the most talented programmers and developers in the world, and they're barely scratching the surface. I don't see this as a priority for the Wikimedia movement, which achieves considerably higher quality with volunteers following a fairly simple rule set that the volunteers themselves develop based on tried and tested knowledge. Let's let those with lots of money keep working to develop something that is useful, and then we can start seeing if it can become feasible for our use.

I envision the AI industry being similar to the computer hardware industry. My first computer cost about the same (in 2022 dollars) as the four computers and all their peripherals that I have within my reach as I write this, and had less than 1% of the computing power of each of them.[1] The cost will go down once the technology gets better and more stable.

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter <ymbalt@gmail.com mailto:ymbalt@gmail.com> wrote:

...
Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard <raymond.f.leonard.jr@gmail.com mailto:raymond.f.leonard.jr@gmail.com> wrote:

...
As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, Voyage en Icarie, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra <gnangarra@gmail.com mailto:gnangarra@gmail.com> wrote:

...
I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman <vstavridoucoleman@gmail.com mailto:vstavridoucoleman@gmail.com> wrote:

...
Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

> On Dec 29, 2022, at 5:17 PM, Steven Walling <steven.walling@gmail.com mailto:steven.walling@gmail.com> wrote: > > > > > On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman <vstavridoucoleman@gmail.com mailto:vstavridoucoleman@gmail.com> wrote: >> Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product? > > This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles. > > OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter. > >> My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else? > > They’re almost certainly using Wikipedia. The answer from ChatGPT is: > > “ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.” > >> And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives > > Who knows but we would have to develop our own models to test this idea. > >> This is a significant matter for the community and seeing us step to it would be very encouraging. >> >> Best regards, >> >> Victoria Coleman >> _______________________________________________ >> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l >> Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... >> To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... > To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Raymond Leonard

31 Dec 31 Dec

7:06 p.m.

Of relevance to this conversation:

https://www.wired.com/story/large-language-models-artificial-intelligence/

On Fri, Dec 30, 2022 at 9:32 AM Neurodivergent Netizen < idoh.idreamofhorses@gmail.com> wrote:

...

One concern I have is that all “oldbies” like myself have all seen bots basically decay after whomever is maintaining goes inactive. Of course, this could be mostly rectified by having the AI be open source. This leaves the “people” aspect; that is, not only does the AI need to be maintained, but interest needs to be maintained as well.

From, I dream of horses She/her

On Dec 30, 2022, at 8:53 AM, Victoria Coleman vstavridoucoleman@gmail.com wrote:

Anne,

Interestingly enough what these large companies have to spend a ton of money on is creating and moderating content. In other words people. Passionate volunteers in large numbers is what the movement has in abundance. Imagine the power of combining the talents and passion of our community members with the advances offered by AI today. I was struck recently during a visit to NVIDIA how language models have changed. Back in my day, we would have to build one language model per domain and then load it in to the device, a computer or a phone, to use. Now they have one massive combined language model in a data center full of their GPUs which is there so long as you are connected. My sense is that within the guard rails offered by our volunteer community, we could use AI to force multiply their efforts and make knowledge even more accessible than it is today. Both for those who create and record knowledge as well as those who consume it. In the case of Chat GPT, our volunteers could use supervised learning for example to narrow down the mistakes the bot makes - which should be many fewer that the Open AI version since the Wikipedia version would be trained on good, clean Wikipedia content which is constantly reviewed by the community.

Best regards,

Victoria Coleman

On Dec 30, 2022, at 12:21 AM, Risker risker.wp@gmail.com wrote:

Given what we already know about AI-like projects (think Siri, Alexis, etc), they're the result of work done by organizations utilizing resources hundreds of times greater than the resources within the entire Wikimedia movement, and they'renot all that good if we're being honest. They're entirely dependent on existing resources. We have seen time and again how easily they can be led astray; ChatGPT is just the most recent example. It is full of misinformation. Other efforts have resulted in the AI becoming radicalized. Again, it's all about what sources the AI project uses in developing its responses, and those underlying sources are generally completely unknown to the person asking for the information.

Ironically, our volunteers have created software that learns pretty effectively (ORES, several anti-vandalism "bots"). The tough part is ensuring that there is continued, long-term support for these volunteer-led efforts, and the ability to make them effective on projects using other languages. We've had bots making translations of formulaic articles from one language to another for years; again, they depend on volunteers who can maintain and support those bots, and ensure continued quality of translation.

AI development is tough. It is monumentally expensive. Big players have invested billions USD trying to develop working AI, with some of the most talented programmers and developers in the world, and they're barely scratching the surface. I don't see this as a priority for the Wikimedia movement, which achieves considerably higher quality with volunteers following a fairly simple rule set that the volunteers themselves develop based on tried and tested knowledge. Let's let those with lots of money keep working to develop something that is useful, and then we can start seeing if it can become feasible for our use.

I envision the AI industry being similar to the computer hardware industry. My first computer cost about the same (in 2022 dollars) as the four computers and all their peripherals that I have within my reach as I write this, and had less than 1% of the computing power of each of them.[1] The cost will go down once the technology gets better and more stable.

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter ymbalt@gmail.com wrote:

...
Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard < raymond.f.leonard.jr@gmail.com> wrote:

...
As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, *Voyage en Icarie*, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra gnangarra@gmail.com wrote:

...
I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling steven.walling@gmail.com wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman < vstavridoucoleman@gmail.com> wrote:

...
Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train

...
the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia

...
exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

...
This is a significant matter for the community and seeing us step to

...
it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Victoria Coleman

9:38 p.m.

Good article. I think it underlines the truth that without human curation all these models produce is junk. The trick (which is far from simple btw) is to figure out ways of harnessing the power of these models without breaking lives or hearts. I think that’s what engineering is all about. We don’t hear the term AI engineering or language model engineering but that’s where we need to get to if we can ever rely on these things. It’s a bit like combustion. It can cause explosions and hurt people. Harnessing it into the internal combustion engine has changed transportation for ever.

Victoria

...

On Dec 31, 2022, at 11:06 AM, Raymond Leonard raymond.f.leonard.jr@gmail.com wrote:

Of relevance to this conversation:

https://www.wired.com/story/large-language-models-artificial-intelligence/ https://www.wired.com/story/large-language-models-artificial-intelligence/ On Fri, Dec 30, 2022 at 9:32 AM Neurodivergent Netizen <idoh.idreamofhorses@gmail.com mailto:idoh.idreamofhorses@gmail.com> wrote: One concern I have is that all “oldbies” like myself have all seen bots basically decay after whomever is maintaining goes inactive. Of course, this could be mostly rectified by having the AI be open source. This leaves the “people” aspect; that is, not only does the AI need to be maintained, but interest needs to be maintained as well.

From, I dream of horses She/her

...
On Dec 30, 2022, at 8:53 AM, Victoria Coleman <vstavridoucoleman@gmail.com mailto:vstavridoucoleman@gmail.com> wrote:

Anne,

Interestingly enough what these large companies have to spend a ton of money on is creating and moderating content. In other words people. Passionate volunteers in large numbers is what the movement has in abundance. Imagine the power of combining the talents and passion of our community members with the advances offered by AI today. I was struck recently during a visit to NVIDIA how language models have changed. Back in my day, we would have to build one language model per domain and then load it in to the device, a computer or a phone, to use. Now they have one massive combined language model in a data center full of their GPUs which is there so long as you are connected. My sense is that within the guard rails offered by our volunteer community, we could use AI to force multiply their efforts and make knowledge even more accessible than it is today. Both for those who create and record knowledge as well as those who consume it. In the case of Chat GPT, our volunteers could use supervised learning for example to narrow down the mistakes the bot makes - which should be many fewer that the Open AI version since the Wikipedia version would be trained on good, clean Wikipedia content which is constantly reviewed by the community.

Best regards,

Victoria Coleman

...
On Dec 30, 2022, at 12:21 AM, Risker <risker.wp@gmail.com mailto:risker.wp@gmail.com> wrote:

Given what we already know about AI-like projects (think Siri, Alexis, etc), they're the result of work done by organizations utilizing resources hundreds of times greater than the resources within the entire Wikimedia movement, and they'renot all that good if we're being honest. They're entirely dependent on existing resources. We have seen time and again how easily they can be led astray; ChatGPT is just the most recent example. It is full of misinformation. Other efforts have resulted in the AI becoming radicalized. Again, it's all about what sources the AI project uses in developing its responses, and those underlying sources are generally completely unknown to the person asking for the information.

Ironically, our volunteers have created software that learns pretty effectively (ORES, several anti-vandalism "bots"). The tough part is ensuring that there is continued, long-term support for these volunteer-led efforts, and the ability to make them effective on projects using other languages. We've had bots making translations of formulaic articles from one language to another for years; again, they depend on volunteers who can maintain and support those bots, and ensure continued quality of translation.

AI development is tough. It is monumentally expensive. Big players have invested billions USD trying to develop working AI, with some of the most talented programmers and developers in the world, and they're barely scratching the surface. I don't see this as a priority for the Wikimedia movement, which achieves considerably higher quality with volunteers following a fairly simple rule set that the volunteers themselves develop based on tried and tested knowledge. Let's let those with lots of money keep working to develop something that is useful, and then we can start seeing if it can become feasible for our use.

I envision the AI industry being similar to the computer hardware industry. My first computer cost about the same (in 2022 dollars) as the four computers and all their peripherals that I have within my reach as I write this, and had less than 1% of the computing power of each of them.[1] The cost will go down once the technology gets better and more stable.

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter <ymbalt@gmail.com mailto:ymbalt@gmail.com> wrote: Hi,

just to remark that it superficially looks like a great tool for small language Wikipedias (for which the translation tool is typically not available). One can train the tool in some less common language using the dictionary and some texts, and then let it fill the project with a thousands of articles. (As an aside, in fact, one probably can train it to the soon-to-be-extint languages and save them until the moment there is any interest for revival, but nobody seems to be interested). However, there is a high potential for abuse, as I can imagine people not speaking the language running the tool and creating thousands of substandard articles - we have seen this done manually, and I would be very cautious allowing this.

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard <raymond.f.leonard.jr@gmail.com mailto:raymond.f.leonard.jr@gmail.com> wrote: As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarians https://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, Voyage en Icarie, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

It seems that given a prompt, ChatGPT shuffles & regurgitates facts. Just as a card dealer deals a good hand, sometimes ChatGPT seems to make sense, but I think at present it really is " a glorified version of autosuggest."

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra <gnangarra@gmail.com mailto:gnangarra@gmail.com> wrote: I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman <vstavridoucoleman@gmail.com mailto:vstavridoucoleman@gmail.com> wrote: Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

If the models are not open source, this indeed would require developing our own models. For that kind of investment, we would probably want to have more application areas. Translation being one that Ziko already pointed out but also summarization. These kinds of Information retrieval queries would effectively index into specific parts of an article vs returning the whole thing.

Wikipedia as we all know is not perfect but it’s about the best you can get with the thousands of editors and reviewers doing quality control. If a bot was exclusively trained on Wikipedia, my guess is that the falsehood generation would be as minimal as it can get. Garbage in garbage out in all these models. Good stuff in good stuff out. I guess the falsehoods can also come when no material exists in the model. So instead of making stuff up, they could default to “I don’t know the answer to that”. Or in our case, we could add the topic to the list of article suggestions to editors…

I know I am almost day dreaming here but I can’t help but think that all the recent advances in AI could create significantly broader free knowledge pathways for every human being. And I don’t see us getting after them aggressively enough…

Best regards,

Victoria Coleman

...
On Dec 29, 2022, at 5:17 PM, Steven Walling <steven.walling@gmail.com mailto:steven.walling@gmail.com> wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman <vstavridoucoleman@gmail.com mailto:vstavridoucoleman@gmail.com> wrote: Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

This is a cool idea but what would the goals of developing a Wikipedia-specific generative AI be? IMO it would be nice to have a natural language search right in Wikipedia that could return factual answers not just links to our (often too long) articles.

OpenAI models aren’t open source btw. Some of the products are free to use right now, but their business model is to charge for API use etc. so including it directly in Wikipedia is pretty much a non-starter.

My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

“ChatGPT is a chatbot model developed by OpenAI. It was trained on a dataset of human-generated text, including data from a variety of sources such as books, articles, and websites. It is possible that some of the data used to train ChatGPT may have come from Wikipedia, as Wikipedia is a widely-used source of information and is likely to be included in many datasets of human-generated text.”

And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/CYPO3PEMM4FIWPNL6MRTORHZXVTS2VNN/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/G57JUOQ5S5ZHXHWJN7LPYEBZMFVMJGVO/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/WH6SHKVKPBVKPPWID5WFM2RSY3ZUUSQ6/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/N4CYGIOUJOAO2FCKKRFSMFZTATIYUKL5/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/FIALTVJ6AR6MRDUBECFPIDXX5YXNC2CS/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GIEYQ7BNV4LMR4YOIYSUUL4OLAQVGAFO/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W4IAWBV7VPBRFNQGRZT54UIV77E7M2XJ/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/5F3ONUSUOKXV52ZCZ73T5KVPAWMJUTYN/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org mailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/2UBNTXB72SIMB7NRXSLQNBYJNVFQAO4E/ To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org mailto:wikimedia-l-leave@lists.wikimedia.org_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Adam Sobieski

4 Feb 4 Feb

4:23 a.m.

Wikimedia Mailing List,

Hello. I just discovered this mailing list thread and am also interested in the topics of crowdsourcing and dialogue systems. I support the vision of man-machine collaboration and synergy indicated by Victoria Coleman.

With respect to the state of the art, modern dialogue systems include: ChatGPT by OpenAI, Sparrow by DeepMind, and TeachMe by AI2. These modern dialogue systems can interact with end-users conversationally about knowledge; some can cite their sources; and some can learn, on-the-fly, from operators in control centers, subject-matter experts, and/or broader crowdsourced communities.

Major search engine providers are, according to news reports, already, or soon will be, integrating modern dialogue systems. Will the Wikimedia Search Platform be exploring conversational search features?

User experiences for control center operators or for broader communities of editors to interact with that knowledge, that content, utilized by large-scale dialogue systems could be Wiki-based.

In theory, community dashboards, potentially personalized for each editor, could be provided for editors to determine which articles were popular or trending in terms of usage by dialogue systems' end-users, or otherwise determined to be in potential need of human review, moderation, or curation. These and other related approaches to community productivity enhancement could be of use for amplifying the performance of and synergy between communities of editors and AI systems.

In a recent bibliography [1], I reference some contemporary scholarly and scientific publications hoping to point to and to indicate that research is underway into how modern dialogue systems could interoperate with, interact with, both read from and write to, Wiki systems.

Best regards,

Adam

[1] http://www.phoster.com/dialogue-systems-and-information-retrieval/

________________________________ From: Raymond Leonard raymond.f.leonard.jr@gmail.com Sent: Saturday, December 31, 2022 2:06 PM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

Of relevance to this conversation:

https://www.wired.com/story/large-language-models-artificial-intelligence/

On Fri, Dec 30, 2022 at 9:32 AM Neurodivergent Netizen <idoh.idreamofhorses@gmail.commailto:idoh.idreamofhorses@gmail.com> wrote: One concern I have is that all “oldbies” like myself have all seen bots basically decay after whomever is maintaining goes inactive. Of course, this could be mostly rectified by having the AI be open source. This leaves the “people” aspect; that is, not only does the AI need to be maintained, but interest needs to be maintained as well.

From, I dream of horses She/her

On Dec 30, 2022, at 8:53 AM, Victoria Coleman <vstavridoucoleman@gmail.commailto:vstavridoucoleman@gmail.com> wrote:

Anne,

Best regards,

Victoria Coleman

On Dec 30, 2022, at 12:21 AM, Risker <risker.wp@gmail.commailto:risker.wp@gmail.com> wrote:

Risker/Anne

[1] Comparison of 1990 to 2022 dollars.

On Fri, 30 Dec 2022 at 01:40, Yaroslav Blanter <ymbalt@gmail.commailto:ymbalt@gmail.com> wrote: Hi,

Best Yaroslav

On Fri, Dec 30, 2022 at 4:57 AM Raymond Leonard <raymond.f.leonard.jr@gmail.commailto:raymond.f.leonard.jr@gmail.com> wrote: As a friend wrote on a Slack thread about the topic, "ChatGPT can produce results that appear stunningly intelligent, and there are things that I’ve seen that really leave me scratching my head- “how on Earth did it DO that?!?” But it’s important to remember that it isn’t actually intelligent. It’s not “thinking.” It’s more of a glorified version of autosuggest. When it apologizes, it’s not really apologizing, it’s just finding text that fits the self description it was fed and that looks related to what you fed it."

The person initiating the thread had asked ChatGPT "What are the 5 biggest intentional communities on each continent?" (As an aside, this was as challenging as the question that led to Wikidata, "What are the ten largest cities in the world that have women mayors?") One of the answers ChatGPT gave for Europe was "Ikaria (Greece)". As near as I can determine, there is no intentional community of any size in Ikaria. However, the Icarianshttps://en.wikipedia.org/wiki/Icarians were a 19th-century intentional community in the US founded by French expatriates. It was named after a utopian novel, Voyage en Icarie, that was written by Étienne Cabet. He chose the Greek island of Icaria as the setting of his utopian vision. Interesting that ChatGPT may have conflated these.

Yours Peaceray

On Thu, Dec 29, 2022 at 6:39 PM Gnangarra <gnangarra@gmail.commailto:gnangarra@gmail.com> wrote: I think the simplest answer is yes its an artificial writer but its not intelligence as the name implies but rather just a piece of software that gives answers according to the methodology of that software. The garbage in garbage out format, it can never be better than the programmers behind the machine

On Fri, 30 Dec 2022 at 09:56, Victoria Coleman <vstavridoucoleman@gmail.commailto:vstavridoucoleman@gmail.com> wrote: Thank you Ziko and Steven for the thoughtful responses.

My sense is that for a class for readers having a generative UI that returns an answer VS an article would be useful. It would probably put Quora out of business. :-)

Best regards,

Victoria Coleman

On Dec 29, 2022, at 5:17 PM, Steven Walling <steven.walling@gmail.commailto:steven.walling@gmail.com> wrote:

On Thu, Dec 29, 2022 at 4:09 PM Victoria Coleman <vstavridoucoleman@gmail.commailto:vstavridoucoleman@gmail.com> wrote: Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

They’re almost certainly using Wikipedia. The answer from ChatGPT is:

And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives

Who knows but we would have to develop our own models to test this idea.

This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.orgmailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.orgmailto:wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.orgmailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.orgmailto:wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.orgmailto:wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.orgmailto:wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Gergő Tisza

5:46 a.m.

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Steven Walling

6:59 a.m.

On Fri, Feb 3, 2023 at 9:47 PM Gergő Tisza gtisza@gmail.com wrote:

...

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

_______________________________________________

...

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Mustafa Kabir

7:55 a.m.

mk0705182@gmail.com

On Sat, Feb 4, 2023, 1:01 PM Steven Walling steven.walling@gmail.com wrote:

...

On Fri, Feb 3, 2023 at 9:47 PM Gergő Tisza gtisza@gmail.com wrote:

...
Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

...
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Adam Sobieski

8:30 a.m.

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and non-profit researchers with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-... [2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

________________________________ From: Steven Walling steven.walling@gmail.com Sent: Saturday, February 4, 2023 1:59 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gergő Tisza <gtisza@gmail.commailto:gtisza@gmail.com> wrote: Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Christophe Henner

9:13 a.m.

Victoria Coleman

1:10 p.m.

Adam Sobieski

2:23 p.m.

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

________________________________ From: Victoria Coleman vstavridoucoleman@gmail.com Sent: Saturday, February 4, 2023 8:10 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need: * datasets * people to tag the dataset * people to correct the algo * computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

? With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-... [2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza <gtisza@gmail.commailto:gtisza@gmail.com> wrote: Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Mustafa Kabir

2:44 p.m.

mk0705182@gmail.com

On Sat, Feb 4, 2023, 8:24 PM Adam Sobieski adamsobieski@hotmail.com wrote:

...

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

? With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-... [2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Gnangarra

3:04 p.m.

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

...

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

? With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-... [2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

Peter Southwood

6:05 p.m.

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

From: Gnangarra [mailto:gnangarra@gmail.com] Sent: 04 February 2023 17:04 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Best regards,

Adam

_____

From: Victoria Coleman vstavridoucoleman@gmail.com Sent: Saturday, February 4, 2023 8:10 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

Hi Christophe,

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

* datasets

* people to tag the dataset

* people to correct the algo

* computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

_____

From: Steven Walling steven.walling@gmail.com Sent: Saturday, February 4, 2023 1:59 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Subhashish

6:46 p.m.

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

1. https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood peter.southwood@telkomsa.net wrote:

...

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Todd Allen

9:39 p.m.

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

...

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

...
From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner < christophe.henner@gmail.com> wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Subhashish

5 Feb 5 Feb

4:37 a.m.

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

...

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

...
Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

...
From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner < christophe.henner@gmail.com> wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Peter Southwood

7:11 a.m.

“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

From: Subhashish [mailto:psubhashish@gmail.com] Sent: 05 February 2023 06:37 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

1. https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood peter.southwood@telkomsa.net wrote:

Cheers,

Peter

From: Gnangarra [mailto:gnangarra@gmail.com] Sent: 04 February 2023 17:04 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Best regards,

Adam

_____

From: Victoria Coleman vstavridoucoleman@gmail.com Sent: Saturday, February 4, 2023 8:10 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

Hi Christophe,

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

* datasets

* people to tag the dataset

* people to correct the algo

* computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

_____

From: Steven Walling steven.walling@gmail.com Sent: Saturday, February 4, 2023 1:59 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

-- Boodarwun Gnangarra 'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar' http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient Virus-free. http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient www.avg.com _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Ilario Valdelli

6 Feb 6 Feb

7:37 a.m.

And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

...

“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

*From:* Subhashish [mailto:psubhashish@gmail.com] *Sent:* 05 February 2023 06:37 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Peter Southwood

10:25 a.m.

It would depend on whether it uses the text or the information/data. My guess is that the more it uses its own words, the more drift in meaning there will be, and the less reliable the result, but I have no way to test this hypothesis.

Cheers, Peter

From: Ilario Valdelli [mailto:valdelli@gmail.com] Sent: 06 February 2023 09:38 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

From: Subhashish [mailto:psubhashish@gmail.com] Sent: 05 February 2023 06:37 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

1. https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood peter.southwood@telkomsa.net wrote:

Cheers,

Peter

From: Gnangarra [mailto:gnangarra@gmail.com] Sent: 04 February 2023 17:04 To: Wikimedia Mailing List Subject: [Wikimedia-l] Re: Chat GPT

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Best regards,

Adam

_____

From: Victoria Coleman vstavridoucoleman@gmail.com Sent: Saturday, February 4, 2023 8:10 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

Hi Christophe,

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

* datasets

* people to tag the dataset

* people to correct the algo

* computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

_____

From: Steven Walling steven.walling@gmail.com Sent: Saturday, February 4, 2023 1:59 AM To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Eduardo Testart

15 Feb 15 Feb

8:26 p.m.

Hi,

This podcast might be interesting for some on this thread: https://www.nytimes.com/2023/02/15/podcasts/the-daily/chat-gpt-microsoft-bin...

There might be chance that something different or new is happening.

Who knows...

Best,

On Mon, Feb 6, 2023, 07:26 Peter Southwood peter.southwood@telkomsa.net wrote:

...

It would depend on whether it uses the text or the information/data. My guess is that the more it uses its own words, the more drift in meaning there will be, and the less reliable the result, but I have no way to test this hypothesis.

Cheers, Peter

*From:* Ilario Valdelli [mailto:valdelli@gmail.com] *Sent:* 06 February 2023 09:38 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

*From:* Subhashish [mailto:psubhashish@gmail.com] *Sent:* 05 February 2023 06:37 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner christophe.henner@gmail.com wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

The Cunctator

16 Feb 16 Feb

2:47 p.m.

This is almost definitely the case.

On Mon, Feb 6, 2023, 2:39 AM Ilario Valdelli valdelli@gmail.com wrote:

...

And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

...
“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

*From:* Subhashish [mailto:psubhashish@gmail.com] *Sent:* 05 February 2023 06:37 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner < christophe.henner@gmail.com> wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Ali Kia

4:17 p.m.

Hi. Thank you for your cooperation.

در تاریخ پنجشنبه ۱۶ فوریهٔ ۲۰۲۳،‏ ۱۸:۱۸ The Cunctator cunctator@gmail.com نوشت:

...

This is almost definitely the case.

On Mon, Feb 6, 2023, 2:39 AM Ilario Valdelli valdelli@gmail.com wrote:

...
And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

...
“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

*From:* Subhashish [mailto:psubhashish@gmail.com] *Sent:* 05 February 2023 06:37 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner < christophe.henner@gmail.com> wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Eduardo Testart

18 Feb 18 Feb

1:26 a.m.

Hi again,

There was a very quick follow up: https://www.nytimes.com/2023/02/17/podcasts/the-daily/the-online-search-wars...

If you found the prior podcast interesting, you won't regret to check this one as well.

Best!

On Fri, Feb 17, 2023, 05:24 Ali Kia alikia621@gmail.com wrote:

...

Hi. Thank you for your cooperation.

در تاریخ پنجشنبه ۱۶ فوریهٔ ۲۰۲۳،‏ ۱۸:۱۸ The Cunctator cunctator@gmail.com نوشت:

...
This is almost definitely the case.

On Mon, Feb 6, 2023, 2:39 AM Ilario Valdelli valdelli@gmail.com wrote:

...
And this is a problem.

If ChatGPT uses open content, there is an infringement of license.

Specifically the CC-by-sa if it uses Wikipedia. In this case the attribution must be present.

Kind regards

On Sun, 5 Feb 2023, 08:12 Peter Southwood, peter.southwood@telkomsa.net wrote:

...
“Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models” This may be a choice that comes back to bite them. Without citing their sources, they are unreliable as a source for anything one does not know already. Someone will have a bad consequence from relying on the information and will sue the publisher. It will be interesting to see how they plan to weasel their way out of legal responsibility while retaining any credibility. My guess is there will be a requirement to state that the information is AI generated and of entirely unknown and untested reliability. How soon to the first class action, I wonder. Lots of money for the lawyers. Cheers, Peter.

*From:* Subhashish [mailto:psubhashish@gmail.com] *Sent:* 05 February 2023 06:37 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

Just to clarify, my point was not about Getty to begin with. Whether Getty would win and whether a big corporation should own such a large amount of visual content are questions outside this particular thread. It would certainly be interesting to see how things roll.

But AI/ML is way more than just looking. Training with large models is a very sophisticated and technical process. Data annotation among many other forms of labour are done by real people. the article I had linked earlier tells a lot about the real world consequences of AI. I'm certain AI/ML, especially when we're talking about language models like ChatGPT, are far from innocent looking/reading. For starters, derivative of works, except Public Domain ones, must attribute the authors. Any provision for attribution is deliberately removed from systems like ChatGPT and that only gives corporations like OpenAI a free ride sans accountability.

Subhashish

On Sat, Feb 4, 2023, 4:41 PM Todd Allen toddmallen@gmail.com wrote:

I'm not so sure Getty's got a case, though. If the images are on the Web, is using them to train an AI something copyright would cover? That to me seems more equivalent to just looking at the images, and there's no copyright problem in going to Getty's site and just looking at a bunch of their pictures.

But it will be interesting to see how that one shakes out.

Todd

On Sat, Feb 4, 2023 at 11:47 AM Subhashish psubhashish@gmail.com wrote:

Not citing sources is probably a conscious design choice, as citing sources would mean sharing the sources used to train the language models. Getty has just sued Stability AI, alleging the use of 12 million photographs without permission or compensation. Imagine if Stability had to purchase from Getty through a legal process. For starters, Getty might not have agreed in the first place. Bulk-scaping publicly visible text in text-based AIs like ChatGPT would mean scraping text with copyright. But even reusing CC BY-SA content would require attribution. None of the AI platforms attributes their sources because they did not acquire content in legal and ethical ways [1]. Large language models won't be large and releases won't happen fast if they actually start acquiring content gradually from trustworthy sources. It took so many years for hundreds and thousands of Wikimedians to take Wikipedias in different languages to where they are for a reason.

https://time.com/6247678/openai-chatgpt-kenya-workers/

Subhashish

On Sat, Feb 4, 2023 at 1:06 PM Peter Southwood < peter.southwood@telkomsa.net> wrote:

From what I have seen the AIs are not great on citing sources. If they start citing reliable sources, their contributions can be verified, or not. If they produce verifiable, adequately sourced, well written information, are they a problem or a solution?

Cheers,

Peter

*From:* Gnangarra [mailto:gnangarra@gmail.com] *Sent:* 04 February 2023 17:04 *To:* Wikimedia Mailing List *Subject:* [Wikimedia-l] Re: Chat GPT

I see our biggest challenge is going to be detecting these AI tools adding content whether it's media or articles, along with identifying when they are in use by sources. The failing of all new AI is not in its ability but in the lack of transparency with that being able to be identified by the readers. We have seen people impersonating musicians and writing songs in their style. We have also seen pictures that have been created by copying someone else's work yet not acknowledging it as being derivative of any kind.

Our big problems will be in ensuring that copyright is respected in legally, and not hosting anything that is even remotely dubious

On Sat, 4 Feb 2023 at 22:24, Adam Sobieski adamsobieski@hotmail.com wrote:

Brainstorming on how to drive traffic to Wikimedia content from conversational media, UI/UX designers could provide menu items or buttons on chatbots' applications or webpage components (e.g., to read more about the content, to navigate to cited resources, to edit the content, to discuss the content, to upvote/downvote the content, to share the content or the recent dialogue history on social media, to request review/moderation/curation for the content, etc.). Many of these envisioned menu items or buttons would operate contextually during dialogues, upon the most recent (or otherwise selected) responses provided by the chatbot or upon the recent transcripts. Some of these features could also be made available to end-users via spoken-language commands.

At any point during hypertext-based dialogues, end-users would be able to navigate to Wikimedia content. These navigations could utilize either URL query string arguments or HTTP POST. In either case, bulk usage data, e.g., those dialogue contexts navigated from, could be useful.

The capability to perform A/B testing across chatbots’ dialogues, over large populations of end-users, could also be useful. In this way, Wikimedia would be better able to: (1) measure end-user engagement and satisfaction, (2) measure the quality of provided content, (3) perform personalization, (4) retain readers and editors. A/B testing could be performed by providing end-users with various feedback buttons (as described above). A/B testing data could also be obtained through data mining, analyzing end-users’ behaviors, response times, responses, and dialogue moves. These data could be provided for the community at special pages and could be made available per article, possibly by enhancing the “Page information” system. One can also envision these kinds of analytics data existing at the granularity of portions of, or selections of, articles.

Best regards,

Adam

*From:* Victoria Coleman vstavridoucoleman@gmail.com *Sent:* Saturday, February 4, 2023 8:10 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

Hi Christophe,

I had not thought about the threat to Wikipedia traffic from Chat GPT but you have a good point. The success of the projects is always one step away from the next big disruption. So the WMF as the tech provider for the mission (because first and foremost in my view that?s what the WMF is - as well as the financial engine of the movement of course) needs to pay attention and experiment to maintain the long term viability of the mission. In fact I think the cluster of our projects offers compelling options. For example to your point below on data sets, we have the amazing Wikidata as well the excellent work on abstract Wikipedia. We have Wikipedia Enterprise which has built some avenues of collaboration with big tech. A bold vision is needed to bring all of it together and build an MVP for the community to experiment with.

Best regards,

Victoria Coleman

On Feb 4, 2023, at 4:14 AM, Christophe Henner < christophe.henner@gmail.com> wrote:

?Hi,

On the product side, NLP based AI biggest concern to me is that it would drastically decrease traffic to our websites/apps. Which means less new editors ans less donations.

So first from a strictly positioning perspective, we have here a major change that needs to be managed.

And to be honest, it will come faster than we think. We are perfectionists, I can assure you, most companies would be happy to launch a search product with a 80% confidence in answers quality.

From a financial perspective, large industrial investment like this are usually a pool of money you can draw from in x years. You can expect they did not draw all of it yet.

Second, GPT 3 and ChatGPT are far from being the most expensive products they have. On top of people you need:

datasets

people to tag the dataset

people to correct the algo

computing power

I simplify here, but we already have the capacity to muster some of that, which drastically lowers our costs :)

I would not discard the option of the movement doing it so easily. That being said, it would mean a new project with the need of substantial ressources.

Sent from my iPhone

On Feb 4, 2023, at 9:30 AM, Adam Sobieski adamsobieski@hotmail.com wrote:

?

With respect to cloud computing costs, these being a significant component of the costs to train and operate modern AI systems, as a non-profit organization, the Wikimedia Foundation might be interested in the National Research Cloud (NRC) policy proposal: https://hai.stanford.edu/policy/national-research-cloud .

"Artificial intelligence requires vast amounts of computing power, data, and expertise to train and deploy the massive machine learning models behind the most advanced research. But access is increasingly out of reach for most colleges and universities. A National Research Cloud (NRC) would provide academic and *non-profit researchers* with the compute power and government datasets needed for education and research. By democratizing access and equity for all colleges and universities, an NRC has the potential not only to unleash a string of advancements in AI, but to help ensure the U.S. maintains its leadership and competitiveness on the global stage.

"Throughout 2020, Stanford HAI led efforts with 22 top computer science universities along with a bipartisan, bicameral group of lawmakers proposing legislation to bring the NRC to fruition. On January 1, 2021, the U.S. Congress authorized the National AI Research Resource Task Force Act as part of the National Defense Authorization Act for Fiscal Year 2021. This law requires that a federal task force be established to study and provide an implementation pathway to create world-class computational resources and robust government datasets for researchers across the country in the form of a National Research Cloud. The task force will issue a final report to the President and Congress next year.

"The promise of an NRC is to democratize AI research, education, and innovation, making it accessible to all colleges and universities across the country. Without a National Research Cloud, all but the most elite universities risk losing the ability to conduct meaningful AI research and to adequately educate the next generation of AI researchers."

See also: [1][2]

[1] https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial-...

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

*From:* Steven Walling steven.walling@gmail.com *Sent:* Saturday, February 4, 2023 1:59 AM *To:* Wikimedia Mailing List wikimedia-l@lists.wikimedia.org *Subject:* [Wikimedia-l] Re: Chat GPT

On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza gtisza@gmail.com wrote:

Just to give a sense of scale: OpenAI started with a $1 billion donation, got another $1B as investment, and is now getting a larger investment from Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of their previous funding, which seems likely, their operational costs are in the ballpark of $300 million per year. The idea that the WMF could just choose to create conversational software of a similar quality if it wanted seems detached from reality to me.

Without spending billions on LLM development to aim for a conversational chatbot trying to pass a Turing test, we could definitely try to catch up to the state of the art in search results. Our search currently does a pretty bad job (in terms of recall especially). Today's featured article in English is the Hot Chip album "Made in the Dark", and if I enter anything but the exact article title the typeahead results are woefully incomplete or wrong. If I ask an actual question, good luck.

Google is feeling vulnerable to OpenAI here in part because everyone can see that their results are often full of low quality junk created for SEO, while ChatGPT just gives a concise answer right there.

https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top viewed English articles. If I search "The Menu reviews" the Google results are noisy and not so great. ChatGPT actually gives you nothing relevant because it doesn't know anything from 2022. If we could just manage to display the three sentence snippet of our article about the critical response section of the article, it would be awesome. It's too bad that the whole "knowledge engine" debacle poisoned the well when it comes to a Wikipedia search engine, because we could definitely do a lot to learn from what people like about ChatGPT and apply to Wikipedia search.

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

--

Boodarwun Gnangarra

'ngany dabakarn koorliny arn boodjera dardoon ngalang Nyungar koortaboodjar'

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Virus-free.www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

Ilario Valdelli

6 Feb 6 Feb

7:47 a.m.

There is a problem of incompatibility of examples of AI like ChatGPT.

1st: Wikipedia is not primary source, the references are important. In ChatGPT there are statements but not references to support the statements.

2nd: Bias. In Wikipedia all positions for a problem must be indicated. ChatGPT is not able to describe the different positions. It takes generally only one.

3rd: disambiguation. Examples like ChatGPT don't process well the disambiguation. It means that the system has a weak AI. It looks that in case of disambiguation, it takes only one meaning.

4th: neutral point of view. Examples like ChatGPT don't give a neutral answer. Frequently they are trained to take a specific answer.

However I personally consider that investigate in AI makes sense because AI is doing a lot of progress and Wikimedia projects can benefit.

But ChatGPT is a bad example for Wikimedia projects.

Kind regards

On Fri, 30 Dec 2022, 01:09 Victoria Coleman, vstavridoucoleman@gmail.com wrote:

...

Hi everyone. I have seen some of the reactions to the narratives generated by Chat GPT. There is an obvious question (to me at least) as to whether a Wikipedia chat bot would be a legitimate UI for some users. To that end, I would have hoped that it would have been developed by the WMF but the Foundation has historically massively underinvested in AI. That said, and assuming that GPT Open source licensing is compatible with the movement norms, should the WMF include that UI in the product?

My other question is around the corpus that Open AI is using to train the bot. It is creating very fluid narratives that are massively false in many cases. Are they training on Wikipedia? Something else?

And to my earlier question, if GPT were to be trained on Wikipedia exclusively would that help abate the false narratives?

This is a significant matter for the community and seeing us step to it would be very encouraging.

Best regards,

Victoria Coleman _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org

672

Age (days ago)

722

Last active (days ago)

wikimedia-l@lists.wikimedia.org

33 comments

19 participants

tags (0)

participants (19)

Adam Sobieski
Ali Kia
Christophe Henner
Eduardo Testart
Gergő Tisza
Gnangarra
Ilario Valdelli
Mustafa Kabir
Neurodivergent Netizen
Peter Southwood
Raymond Leonard
Risker
Steven Walling
Subhashish
The Cunctator
Todd Allen
Victoria Coleman
Yaroslav Blanter
Ziko van Dijk