Re: Chat GPT - Wikimedia-l

4 Feb 2023

Hi,

I think the Wikimedia community is generally well-positioned to create
high-quality training data for machine learning models. So,  improving the
crowdsourcing of wikidata and structured data is essential for making this
easy-to-use curated training data. So, the focus should be on making data
to be more widely used and using existing open-source NLP/NLU models from
organizations such as universitios, EleutherAI, Facebook etc rather than
developing new models from scratch by ourselves.

The bottleneck of utilizing these is the need for more human skills, which
can be addressed through documentation and examples demonstrating how to
use machine learning tools in real-life use cases such as image
classification, description/summary generation or automated error testing.
It would be essential also to develop ML tools that can be run on commodity
hardware, such as GPUs:s with 24GB RAM currently, for broader
accessibility. These could run on people's computers, home labs, and
hacklabs. It would also direct our development in the direction of less
resource-intensive ML tools.

Br,
-- Kimmo Virtanen, Zache

On Sat, Feb 4, 2023 at 1:16 PM Kimmo Virtanen &lt;kimmo.virtanen(a)gmail.com&gt;
wrote:

...
  Hi,

 I think the Wikimedia community is generally well-positioned to create
 high-quality training data for machine learning models. So,  improving the
 crowdsourcing of wikidata and structured data is essential for making this
 easy-to-use curated training data. So, the focus should be on making data
 to be more widely used and using existing open-source NLP/NLU models from
 organizations such as universitios, EleutherAI, Facebook etc rather than
 developing new models from scratch by ourselves.

 The bottleneck of utilizing these is the need for more human skills, which
 can be addressed through documentation and examples demonstrating how to
 use machine learning tools in real-life use cases such as image
 classification, description/summary generation or automated error testing.
 It would be essential also to develop ML tools that can be run on commodity
 hardware, such as GPUs:s with 24GB RAM currently, for broader
 accessibility. These could run on people's computers, home labs, and
 hacklabs. It would also direct our development in the direction of less
 resource-intensive ML tools.

 Br,
 -- Kimmo Virtanen, Zache

 On Sat, Feb 4, 2023 at 12:23 PM Christophe Henner <
 christophe.henner(a)gmail.com&gt; wrote:

  Hi,

 On the product side, NLP based AI biggest concern to me is that it would
 drastically decrease traffic to our websites/apps. Which means less new
 editors ans less donations.

 So first from a strictly positioning perspective, we have here a major
 change that needs to be managed.

 And to be honest, it will come faster than we think. We are
 perfectionists, I can assure you, most companies would be happy to launch a
 search product with a 80% confidence in answers quality.

 From a financial perspective, large industrial investment like this are
 usually a pool of money you can draw from in x years. You can expect they
 did not draw all of it yet.

 Second, GPT 3 and ChatGPT are far from being the most expensive products
 they have. On top of people you need:
 * datasets
 * people to tag the dataset
 * people to correct the algo
 * computing power

 I simplify here, but we already have the capacity to muster some of that,
 which drastically lowers our costs :)

 I would not discard the option of the movement doing it so easily. That
 being said, it would mean a new project with the need of substantial
 ressources.

 Sent from my iPhone

 On Feb 4, 2023, at 9:30 AM, Adam Sobieski &lt;adamsobieski(a)hotmail.com&gt;
 wrote:

 ?
 With respect to cloud computing costs, these being a significant
 component of the costs to train and operate modern AI systems, as a
 non-profit organization, the Wikimedia Foundation might be interested in
 the National Research Cloud (NRC) policy proposal:
 https://hai.stanford.edu/policy/national-research-cloud .

 "Artificial intelligence requires vast amounts of computing power, data,
 and expertise to train and deploy the massive machine learning models
 behind the most advanced research. But access is increasingly out of reach
 for most colleges and universities. A National Research Cloud (NRC) would
 provide academic and *non-profit researchers* with the compute power and
 government datasets needed for education and research. By democratizing
 access and equity for all colleges and universities, an NRC has the
 potential not only to unleash a string of advancements in AI, but to help
 ensure the U.S. maintains its leadership and competitiveness on the global
 stage.

 "Throughout 2020, Stanford HAI led efforts with 22 top computer science
 universities along with a bipartisan, bicameral group of lawmakers
 proposing legislation to bring the NRC to fruition. On January 1, 2021, the
 U.S. Congress authorized the National AI Research Resource Task Force Act
 as part of the National Defense Authorization Act for Fiscal Year 2021.
 This law requires that a federal task force be established to study and
 provide an implementation pathway to create world-class computational
 resources and robust government datasets for researchers across the country
 in the form of a National Research Cloud. The task force will issue a final
 report to the President and Congress next year.

 "The promise of an NRC is to democratize AI research, education, and
 innovation, making it accessible to all colleges and universities across
 the country. Without a National Research Cloud, all but the most elite
 universities risk losing the ability to conduct meaningful AI research and
 to adequately educate the next generation of AI researchers."

 See also: [1][2]

 [1]

https://www.whitehouse.gov/ostp/news-updates/2023/01/24/national-artificial…
 [2]
 https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

 ------------------------------
 *From:* Steven Walling &lt;steven.walling(a)gmail.com&gt;
 *Sent:* Saturday, February 4, 2023 1:59 AM
 *To:* Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 *Subject:* [Wikimedia-l] Re: Chat GPT

 On Fri, Feb 3, 2023 at 9:47 PM Gerg? Tisza &lt;gtisza(a)gmail.com&gt; wrote:

 Just to give a sense of scale: OpenAI started with a $1 billion donation,
 got another $1B as investment, and is now getting a larger investment from
 Microsoft (undisclosed but rumored to be $10B). Assuming they spent most of
 their previous funding, which seems likely, their operational costs are in
 the ballpark of $300 million per year. The idea that the WMF could just
 choose to create conversational software of a similar quality if it wanted
 seems detached from reality to me.

 Without spending billions on LLM development to aim for a
 conversational chatbot trying to pass a Turing test, we could definitely
 try to catch up to the state of the art in search results. Our search
 currently does a pretty bad job (in terms of recall especially). Today's
 featured article in English is the Hot Chip album "Made in the Dark", and
 if I enter anything but the exact article title the typeahead results are
 woefully incomplete or wrong. If I ask an actual question, good luck.

 Google is feeling vulnerable to OpenAI here in part because everyone can
 see that their results are often full of low quality junk created for SEO,
 while ChatGPT just gives a concise answer right there.

 https://en.wikipedia.org/wiki/The_Menu_(2022_film) is one of the top
 viewed English articles. If I search "The Menu reviews" the Google results
 are noisy and not so great. ChatGPT actually gives you nothing relevant
 because it doesn't know anything from 2022. If we could just manage to
 display the three sentence snippet of our article about the critical
 response section of the article, it would be awesome. It's too bad that the
 whole "knowledge engine" debacle poisoned the well when it comes to a
 Wikipedia search engine, because we could definitely do a lot to learn from
 what people like about ChatGPT and apply to Wikipedia search.

 _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org

 _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org

 _______________________________________________
 Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
 at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
 https://meta.wikimedia.org/wiki/Wikimedia-l
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
 To unsubscribe send an email to wikimedia-l-leave(a)lists.wikimedia.org