Wikianswers Proposal - Wikimedia-l

25 Sep 2023

Wikimedia,

Hello. After receiving and listening to the feedback from our previous discussion, I have
revised the Wikianswers proposal: https://meta.wikimedia.org/wiki/Wikianswers . I would
like to also call your attention to its technical discussion section:
https://meta.wikimedia.org/wiki/Wikianswers#Technical_discussion . A current version of
this section is available below.

Per the feedback, the revised proposal includes, in addition to an option for a sister
project at a new domain, e.g., https://en.wikianswers.org , an option for integration into
the search systems of Wikipedia, Wikidata, and Commons. With respect to this latter
option, AI systems' (LLMs') responses to end-users' questions would still be
URL-addressed, human-editable content, e.g.:
https://en.wikipedia.org/qa/2b106ea8-4d1b-441f-9dc8-4555a9999ae9 .

Thank you for checking out the revised proposal and for any feedback.

Technical discussion
Overview
Relevant artificial intelligence topics include retrieval-augmented generation,
retrieval-augmented generation with guardrails, and agent-based approaches.
As presently considered, those parts of the question-and-answer data which could be
human-editable include: (1) the template of the prompts, (2) the task, (3) the retrieved
context data, (4) the questions, and (5) the answers.
The template is the overall structure of the prompts to the LLM. It includes some natural
language and slots where the other parts will be placed. This should be locked so as to be
editable only by administrators. Editing this would invalidate every cached and unlocked
answer, meaning that every unlocked answer would be updated, refreshed, or regenerated.
The task is an instruction, e.g., "You are a helpful system which will answer the
user's question using the following information". This should be locked so as to
be editable only by administrators. Editing this would invalidate every dependent cached
and unlocked answer, meaning that every unlocked answer would be updated, refreshed, or
regenerated.
The retrieved context data are chunks or excerpts, e.g., of Wikipedia articles, which
enhance the answering of a particular question. Users could edit them, resulting in the
cascading invalidations of dependent cached and unlocked answers. With respect to user
experiences, editors might click on these displayed chunks or excerpts of content to
navigate to them as they occurred in source pages and edit them there, these updates to
the underlying pages resulting in updates to the chunks and dependent unlocked answers.
The questions would be unusual to edit, except in the cases of typographical errors.
The answers, abstractly, result from processing the other ingredients. These could be
edited by humans but, as shown above, they could be subsequently revised by the system per
cascading updates, refreshes, or regenerations. In some cases, editors might want to edit
an answer and then to lock it from subsequent revisions by the system.
In conclusion, as presently considered, users would ordinarily tend to want to edit the
retrieved chunks of content drawn from Wikipedia pages, these chunks augmenting the
prompts to the LLMs, the cascading of these page revisions updating dependent unlocked
answers automatically.
Database schemas
Wikianswers database schemas would include one or more tables with vector columns for
embedding vectors. A project goal, then, would be to efficiently combine into a database
schema the existing concepts of revision tables, page tables, and text tables with the
newer concepts of embedding vectors and vector databases. Relevant tools include pgvector,
a database extension which provides open-source vector-similarity search to PostgreSQL.
URL-addressability
Instead of requiring a new domain, e.g., https://en.wikianswers.org/ , Wikianswers
features could be integrated into the search systems of Wikipedia, Wikidata, and Commons.
In this case, human-editable responses could still be URL-addressable, e.g.:
https://en.wikipedia.org/qa/2b106ea8-4d1b-441f-9dc8-4555a9999ae9 .
Datetime encoding
Some questions have impermanent answers and others are volatile, meaning that their
answers could vary each time that the question was asked. In these regards, date and time
data could be encoded into URLs in a human-readable manner, e.g.,
https://en.wikipedia.org/qa/2023/09/21/21/29/00/2b106ea8-4d1b-441f-9dc8-455… .
Some questions and answers might involve different granularities of time. For example, a
natural-language question "Which teams are in the Super Bowl?" might have a
number of URLs, one for each year, e.g.,
https://en.wikipedia.org/qa/2022/40a7338d-fe75-4897-aee6-ec87141020a6 and
https://en.wikipedia.org/qa/2021/40a7338d-fe75-4897-aee6-ec87141020a6 .
User experience
In the approach where Wikianswers features are integrated into Wikipedia, Wikidata, and
Commons search, user experiences could utilize the existing text search boxes atop pages.
Perhaps the "magnifying glass" icon in those search boxes could be accompanied
by a "question mark" icon. One of these two icons would be selected, or
activated, by end-users. Which such icon was activated would toggle between using the
existing keyword-based content search and the described Wikianswers human-editable
question-answering subsystem. Still under consideration is whether and how end-users could
specify whether they desire for their question to have their current page, or selections
thereof, as focal when responding to their question.

Best regards,

Adam Sobieski