Wikimedia,
Hello. After receiving and listening to the feedback from our previous discussion, I have revised the Wikianswers proposal: https://meta.wikimedia.org/wiki/Wikianswers . I would like to also call your attention to its technical discussion section: https://meta.wikimedia.org/wiki/Wikianswers#Technical_discussion . A current version of this section is available below.
Per the feedback, the revised proposal includes, in addition to an option for a sister project at a new domain, e.g., https://en.wikianswers.org , an option for integration into the search systems of Wikipedia, Wikidata, and Commons. With respect to this latter option, AI systems' (LLMs') responses to end-users' questions would still be URL-addressed, human-editable content, e.g.: https://en.wikipedia.org/qa/2b106ea8-4d1b-441f-9dc8-4555a9999ae9 .
Thank you for checking out the revised proposal and for any feedback.
Technical discussion Overview Relevant artificial intelligence topics include retrieval-augmented generation, retrieval-augmented generation with guardrails, and agent-based approaches. As presently considered, those parts of the question-and-answer data which could be human-editable include: (1) the template of the prompts, (2) the task, (3) the retrieved context data, (4) the questions, and (5) the answers. The template is the overall structure of the prompts to the LLM. It includes some natural language and slots where the other parts will be placed. This should be locked so as to be editable only by administrators. Editing this would invalidate every cached and unlocked answer, meaning that every unlocked answer would be updated, refreshed, or regenerated. The task is an instruction, e.g., "You are a helpful system which will answer the user's question using the following information". This should be locked so as to be editable only by administrators. Editing this would invalidate every dependent cached and unlocked answer, meaning that every unlocked answer would be updated, refreshed, or regenerated. The retrieved context data are chunks or excerpts, e.g., of Wikipedia articles, which enhance the answering of a particular question. Users could edit them, resulting in the cascading invalidations of dependent cached and unlocked answers. With respect to user experiences, editors might click on these displayed chunks or excerpts of content to navigate to them as they occurred in source pages and edit them there, these updates to the underlying pages resulting in updates to the chunks and dependent unlocked answers. The questions would be unusual to edit, except in the cases of typographical errors. The answers, abstractly, result from processing the other ingredients. These could be edited by humans but, as shown above, they could be subsequently revised by the system per cascading updates, refreshes, or regenerations. In some cases, editors might want to edit an answer and then to lock it from subsequent revisions by the system. In conclusion, as presently considered, users would ordinarily tend to want to edit the retrieved chunks of content drawn from Wikipedia pages, these chunks augmenting the prompts to the LLMs, the cascading of these page revisions updating dependent unlocked answers automatically. Database schemas Wikianswers database schemas would include one or more tables with vector columns for embedding vectors. A project goal, then, would be to efficiently combine into a database schema the existing concepts of revision tables, page tables, and text tables with the newer concepts of embedding vectors and vector databases. Relevant tools include pgvector, a database extension which provides open-source vector-similarity search to PostgreSQL. URL-addressability Instead of requiring a new domain, e.g., https://en.wikianswers.org/ , Wikianswers features could be integrated into the search systems of Wikipedia, Wikidata, and Commons. In this case, human-editable responses could still be URL-addressable, e.g.: https://en.wikipedia.org/qa/2b106ea8-4d1b-441f-9dc8-4555a9999ae9 . Datetime encoding Some questions have impermanent answers and others are volatile, meaning that their answers could vary each time that the question was asked. In these regards, date and time data could be encoded into URLs in a human-readable manner, e.g., https://en.wikipedia.org/qa/2023/09/21/21/29/00/2b106ea8-4d1b-441f-9dc8-4555... . Some questions and answers might involve different granularities of time. For example, a natural-language question "Which teams are in the Super Bowl?" might have a number of URLs, one for each year, e.g., https://en.wikipedia.org/qa/2022/40a7338d-fe75-4897-aee6-ec87141020a6 and https://en.wikipedia.org/qa/2021/40a7338d-fe75-4897-aee6-ec87141020a6 . User experience In the approach where Wikianswers features are integrated into Wikipedia, Wikidata, and Commons search, user experiences could utilize the existing text search boxes atop pages. Perhaps the "magnifying glass" icon in those search boxes could be accompanied by a "question mark" icon. One of these two icons would be selected, or activated, by end-users. Which such icon was activated would toggle between using the existing keyword-based content search and the described Wikianswers human-editable question-answering subsystem. Still under consideration is whether and how end-users could specify whether they desire for their question to have their current page, or selections thereof, as focal when responding to their question.
Best regards,
Adam Sobieski
Wikimedia,
Hello. With respect to the Wikianswers proposal, I've been striving to find the best level of technical detail. The current revision is available at: https://meta.wikimedia.org/wiki/Wikianswers and your feedback is welcomed.
I would like to follow up with this quick letter about some new (and hopefully interesting) ideas pertaining to answering simple and complex natural-language questions using multiple sources, i.e., Wikipedia ,Wikidata, and Commons. These ideas involve processing natural-language questions to retrieve human-editable procedural knowledge with which to answer the questions.
Artificial intelligence systems can analyze questions (and abstractions and generalizations thereof) for cues which remind the systems of stored recipes or procedures with which to answer the questions. Retrieved recipes or procedures might involve structured queries (e.g., SQL or SPARQL), source code (e.g., Python), "wiki functions", or visualizable diagrams (i.e., human-editable extensible workflow diagrams).
Into more technical detail, instances of questions can be categorized, abstracted over, generalized over, into templates and data. These templates could be used to retrieve bits of abstract procedural knowledge and these data could be used to instantiate the bits of procedural knowledge into concrete structured queries, algorithms, procedures, or workflows.
For clarity, here is an example:
1. Starting with a question: "What is the population of Montpelier, Vermont?" 2. Are the question, procedure, and answer cached? If so, provide the cached answer. If not, continue on this sequence. 3. Abstract or generalize over the question instance, into a template: "What is the population of {{X : City}}?" with data: "X = 'Montpelier, Vermont'" * Or, perhaps, "What is the population of {{X : Q1093829}}?" with "'X = Q26426'" 4. Using the resultant template and data, retrieve one or more abstract structured queries, algorithms, procedures, or workflows. 5. Provide these retrieved abstract structured queries, algorithms, procedures, or workflows with the data (X = "Montpelier, Vermont") to produce concrete structured queries, algorithms, procedures, or workflows. 6. Execute the concrete procedures to obtain candidate answers. 7. Cache the question, procedure, and answer.
Adding human curation and wiki aspects to the architectures under discussion, editors with adequate privileges might be able to modify or edit the stored recipes or procedures. Editors with adequate privileges might also be able to adjust the weights on the stored recipes or procedures for sorting them during contextual retrieval.
In summary, these ideas intend to describe, beyond "wiki data", "wiki procedures" with which to query multiple data sources with natural-language questions. It may be a bit before I update the proposal with these ideas. I wanted to share them here for discussion purposes. Thank you.
Best regards, Adam Sobieski
wikimedia-l@lists.wikimedia.org