The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-11-17
--
Code of Conduct for Wikifunctions
An important non-technical goal for Wikifunctions is to have a friendly and
welcoming environment for newcomers, both for people from the existing
Wikimedia communities and from beyond, right from the start.
A way to ensure that this will happen is to establish a code of behavioral
policies, to which all the community members must adhere. As of now, we
have several (non-exclusive) possibilities.
As a Wikimedia project, the Universal Code of Conduct
<https://meta.wikimedia.org/wiki/Universal_Code_of_Conduct> will apply to
Wikifunctions automatically. That is a great starting point.
The first question is: whether we should also adopt the Technical Code of
Conduct <https://www.mediawiki.org/wiki/Code_of_Conduct>, which in some
places is more specific than the Universal Code of Conduct. Since
Wikifunctions is a technical project it seems to make a lot of sense.
The second question is: whether we should have some additional
behavioral/conduct policies in place, which are either more specific or
cover additional ground compared to the Universal and the Technical Codes
of Conduct. Inspiration can be taken from the lists of existing
behavioral/conduct policies
<https://meta.wikimedia.org/wiki/Wikimedia_community_code_of_conduct>. Also
in the case that we do not want to adopt the Technical Code of Conduct, we
should write our own version of it.
I would like to see suggestions for policies around giving newcomers a bit
of extra protection, particular given the complexity of our project. I'd
also like to hear thoughts on policies regarding the multilinguality of
Wikifunctions, which can hopefully learn from the best examples on
Wikimedia Commons or Wikidata, the large multilingual projects we already
have. Similarly, a policy that limits any discussion about “vim vs emacs”
to no more than two posts per month per contributor could be needed, and
some of you may have a few thoughts on how to avoid edit wars around code
style.
As with the previous recommendation
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-09-27> for
starting to draft a new policy before Wikifunction launches, we encourage
everyone to discuss options, and perhaps draft content. Let’s centralize
the discussion in this update's talkpage, and link to draft policies from
there.
We will put a space where folks can state their agreement and disagreement
with adopting the Technical Code of Conduct (as well as state their
indifference, so we can estimate engagement). Besides that, the page is
open for suggestions for further behavioral policies, and even drafts for
these.
We are aware that we will not start with a perfect set of policies, and
this is not the goal. The goal is to at least try to have the most
important pieces in place from day one, so that we don’t start with an
entirely blank slate. This is similar to the initial staff editing policy
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Staff_editing> that was
recently drafted. And just as with that policy, it is clear that the
results are not written in stone, but will be amenable to change and will
evolve as the actual community of Wikifunctions starts forming. But it is a
good idea to have the first few guidelines at hand right from the
beginning, and not to scramble reactively too much.
As it is always the case with such policies, a strong turnout would show a
strong commitment to these policies. I hope that our nascent
proto-community that is forming around Wikifunctions will show up and
demonstrate our commitment to a set of policies that will lead to an
inclusive and civil community in the future. Please take the time to let us
know your thoughts.
WikiConference North America
Last Saturday, we were presenting Wikifunctions virtually at the
WikiConference North America / OpenStreetMaps USA. The session was
recorded, but at the very end Denny’s Internet connection failed which took
away that opportunity from the community to ask questions live. However, we
still collected the questions and answered them on the wiki of the
conference
<https://wikiconference.org/wiki/Submissions:2022/Wikifunctions_-_a_new_Wiki…>.
Thanks to all attendees, and thanks for these great questions!
Development updates
Experience & Performance
- Fixed more FE bugs
- Enabled websockets in the evaluator, allowing two-way communication
with orchestrator (T318359 <https://phabricator.wikimedia.org/T318359>)
- Implemented versioning of Avro schema (T321752
<https://phabricator.wikimedia.org/T321752>)
- Submitted fixes and test coverage improvements for current
perform_test flow (T321495 <https://phabricator.wikimedia.org/T321495>,
T321492 <https://phabricator.wikimedia.org/T321492>, T312290
<https://phabricator.wikimedia.org/T312290>)
- Made function view page implementation and test tables mobile-friendly
(T310162 <https://phabricator.wikimedia.org/T310162>)
- Implemented FE integration test for connecting implementations and
testers to functions (T318426 <https://phabricator.wikimedia.org/T318426>
)
Meta-data
- Revised version finished: Record which implementation gets selected (
T320457 <https://phabricator.wikimedia.org/T320457>)
- Further work on caching tester results in MediaWiki DB (T297707
<https://phabricator.wikimedia.org/T297707>)
- Drop back-compat. code in orchestrator & evaluator (T291136
<https://phabricator.wikimedia.org/T291136>)
Abstract Wikipedia,
Hello. I am recently thinking about the generation of natural-language stories from Wikidata data, e.g., graphs of interrelated real-world historical events.
I am thinking about whether resultant machine-generated stories would be objectively or subjectively narrated. These topics appear to pertain to the philosophy of history [1] and neutrality [2], resembling encyclopedists' ideals of neutrality with respect to point of view [3].
In my opinion, there would be much to learn from developing natural-language story generating systems which could have parameters set or which could receive secondary input data to subsequently produce subjective stories. With such systems, developers could control and vary the subjectivities of resultant natural-language output, e.g., as pertaining to sentiment.
What do you think about the idea that natural-language story generating systems could use parameters or additional inputs to vary the subjectivities of the output?
Without a means of controlling and varying the subjectivities of output stories and language, shouldn't one desire for the output to be as measurably objective as possible?
What do you think about providing the capability for developers to be able to trace backwards from natural-language outputs (from words, phrases, sentences, and paragraphs) into source code and data? Developers would, then, be able to more readily version software and data utilizing metrics and evaluation tools, e.g., Grammarly or sentiment analysis. In theory, systems could provide accompanying “debugging data” alongside natural-language outputs, this data including mappings from selections of natural language, wikitext, or hypertext to stack traces or other data structures.
Best regards,
Adam Sobieski
[1] https://en.wikipedia.org/wiki/Philosophy_of_history
[2] https://en.wikipedia.org/wiki/Philosophy_of_history#Philosophy_of_neutrality
[3] https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view
The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-11-09
--
Checking lexical forms
Previously we have discussed morphological paradigms
<https://meta.wikimedia.org/wiki/Special:MyLanguage/Abstract_Wikipedia/Updat…>,
and how lexemes and paradigms
<https://meta.wikimedia.org/wiki/Special:MyLanguage/Abstract_Wikipedia/Updat…>
could
be used. To summarize and simplify, paradigms are patterns of inflection
<https://en.wikipedia.org/wiki/Inflection> of a word (or lexeme), and
functions can implement paradigms and specific inflections. To give an
example, the usual way to get the plural of a noun in English is to add the
letter s to its basic form, the so-called 'lemma'.
On Notwikilambda, the community-run preview version of Wikifunctions, we
started implementing a few such functions. Correspondingly, we recreated
some of them in the Wikifunctions Beta: *e.g.* add s to end
<https://wikifunctions.beta.wmflabs.org/wiki/Z10210> and replace y at end
with ies <https://wikifunctions.beta.wmflabs.org/wiki/Z10238>.
In order to demonstrate their use, we developed a small, browser-based
tool, form check <https://vrandezo.github.io/formcheck/>. Form check allows
you to select a language and a part of speech (*e.g.* English nouns), and
then state which forms you want to generate (*e.g.* the plural). Then you
choose the function from the Wikifunctions Beta, and the tool checks
whether the form as recorded in Wikidata corresponds to the output of the
function.
If it doesn’t, this may indicate an error, either in the function or in the
data, or an irregular form.
Form check has at least one major shortcoming, which is that it currently
does not allow you to filter for further statements on the lexeme. In many
languages this is crucial: for example, in German, nouns are inflected
differently depending on their grammatical gender. It also doesn’t
automatically update the list of available functions (but you can enter an
arbitrary ZID). The code is open source
<https://github.com/vrandezo/formcheck>, and contributions (or, indeed,
someone wanting to take over the code) would be more than welcome.
It is said that it is better to show than tell. In this spirit, we created
a 13 minute video. It demonstrates how the form check tool is used, how it
was helpful to find an error in a lexeme on Wikidata, and how it was used
to discover a paradigm and implement the respective function.
The video can be watched here:
https://meta.wikimedia.org/wiki/File:Checking_forms.webm
We invite you to implement more morphological functions in Wikifunctions
Beta, and try them out with the form check tool. Please report errors that
you find on the way, so we can fix them. And also share your results, and
how well you can cover all the different linguistic variations in your
language with your functions!
------------------------------
There are a number of interesting aspects to this demonstration.
Firstly, it shows the possible use of Wikifunctions as currently
implemented for natural language-related functions. It ties in directly
with the data on Wikidata, and offers both a way to find errors in the
data, and also an exploration that might help with finding patterns in the
data and so to create more such functions. Although I don’t speak
Ukrainian, I was able to create a function that captured the morphology of
a specific Ukrainian form. These functions can then, in turn, help us
discover more inconsistencies, or even to enter data faster and in a way
less prone to errors. For example, I would really love it if there was a
way to attach functions to the fields in the Wikidata Lexeme Forms
<https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms>, so that I
would only enter the lemma, and it would automatically fill in the other
fields based on the Wikifunction's results, and then, if needed, I could
manually edit the results to be correct before publishing.
Secondly, it shows how relatively easy it is to write functions, testers,
and implementations. In this case, it took us less than four minutes to
define the functions, write a tester, and provide an implementation. Our UX
is currently being improved to make many of these steps easier and more
intuitive. Not all functions will be as easy to implement. But in this
case, no coding was required at all, since we had a relevant function that
we could use for composition, replace at end
<https://wikifunctions.beta.wmflabs.org/wiki/Z10220>. Our hope is that a
solid library of such versatile functions can take us a long way towards
pretty good coverage of morphological functions. But even if the
implementation should turn out to be more complex, defining the function
and providing test cases is something we expect might be possible for many
potential contributors.
And thirdly, it shows, probably for the first time, an external tool
calling a function from Wikifunctions (albeit Beta). It is just a website,
standing in front of Wikifunctions, asking it to evaluate a function. Form
check calls the SPARQL endpoint of Wikidata, and uses data from there to
then to ask Wikifunctions to evaluate a function. The whole thing is a
static website, needs no libraries at all, merely plain old JavaScript, and
could be hosted anywhere (in fact, you can also download the HTML and load
the page locally; it should work just as well).
Note that I am rather unsure whether the Form check tool is a good and
useful tool. Do we really need to check thousands of forms by each
individual user? We would probably want a shared resource for doing this
evaluation instead. The tool is meant as an early inspiration that will
hopefully lead to other tools, libraries and workflows, which are more
robust, reusable, and are closely aligned with how the community works.
Volunteer’s corner
Thanks to everyone who joined the volunteer’s corner on Monday. It was
lively. Thanks to all who attended! The next will be on Monday, December 5
at 18:30 UTC <https://zonestamp.toolforge.org/1670265038>.
WikiConference North America 2022
This weekend Wikifunctions will be presented at the WikiConference North
America <https://meta.wikimedia.org/wiki/WikiConference_North_America/2022>,
jointly held with OpenStreetMaps USA. The presentation
<https://wikiconference.org/wiki/Submissions:2022/Wikifunctions_-_a_new_Wiki…>
will
be on Saturday, November 12 at 20:15 UTC
<https://zonestamp.toolforge.org/1668284148>, and we will focus on
Wikifunctions and possible use cases in the world of maps.
Staff editing policy
We are, for now, closing the hot phase of the staff editing policy
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Staff_editing>. The
policy belongs to the community, and can always be evolved and adapted by
you. We will, on launch, copy it over to Wikifunctions, and will follow
this policy.
Development updates
Experience & Performance:
- Fixed more FE bugs
- Merged patches related to error management
- Made great progress on drafting the Default Component technical specs
Meta-data:
- Completed readable summaries of all error types (T312611
<https://phabricator.wikimedia.org/T312611>) and ability to record which
implementation gets selected (T320457
<https://phabricator.wikimedia.org/T320457>)
Natural Language Generation:
- Finalized template language document
- More analysis on dependencies for isiZulu
The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-11-04
--
One Ring, or a thousand flowers?
<https://meta.wikimedia.org/wiki/File:One_Ring_or_a_thousand_flowers.jpg>
<https://meta.wikimedia.org/wiki/File:One_Ring_or_a_thousand_flowers.jpg>
One Ring, or a thousand flowers?
One of the Abstract Wikipedia workstreams is focused on the natural
language generation tasks that will be necessary for creating and
maintaining Wikipedia articles in hundreds of languages. Unlike the other
workstreams, this work is not focused on the immediate future and launch of
Wikifunctions, but explores the next steps necessary once Wikifunctions is
available and connected to the other Wikimedia projects, particularly
Wikidata and Wikipedia.
In previous newsletters we have talked about some of the approaches and
work around natural language generation for Abstract Wikipedia: Mahir
Morshed talked about Ninai and Udiron
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-03>, we
talked about Grammatical Framework
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-06-24>,
which has been a major influence for the development and design of the
project, Ariel Gutman and Maria Keet have presented the a template language
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-08-19>
specification
(accompanied now by a Scribunto implementation
<https://meta.wikimedia.org/wiki/Module:Sandbox/AbstractWikipedia>), and
had a recent update on Diff
<https://diff.wikimedia.org/2022/09/21/the-state-of-abstract-wikipedia-natur…>
.
With these three solutions, are we done? Do we now know what the solution
to natural language generation will look like?
I don’t know. It might be. These solutions have been developed by very
smart folks, with a cumulative decades of experience under their belts. The
most important goal that these implementations do right now is to provide
an existence proof, they demonstrate that a solution is possible. They show
us that the goals of Abstract Wikipedia are not too lofty. Grammatical
Framework did that for Abstract Wikipedia as a whole. I can genuinely say
that without Grammatical Framework, the Abstract Wikipedia project as it is
wouldn’t exist.
But are any of these solutions the approach that Abstract Wikipedia will
ultimately take?
I don’t know. One major novelty is that Abstract Wikipedia would benefit
from being able to scale to a large number of contributors with very
diverse skill levels. Some might be experienced programmers, some might be
trained linguists, others might bring native language level skills. Which
solution really scales well for a community of volunteer Wikimedians? This
is very difficult to predict in advance. And this is why I don’t want us
yet to commit to a specific solution.
I would like to see a Cambrian explosion
<https://en.wikipedia.org/wiki/Cambrian_explosion> of possible solutions.
This is one of the reasons why Wikifunctions allows for all kinds of
functions, why it is explicitly Turing complete
<https://en.wikipedia.org/wiki/Turing_completeness>: so we don’t lock
ourselves prematurely into a single architecture, into a single solution. I
am looking forward to a large number of different approaches being tried
out, and then having the community building around these approaches discuss
the advantages and disadvantages and also simply vote with their feet,
through activity.
Yes, in the end we should make sure that we unify on a single solution. It
would obviously be a tragic mistake if the natural language generation for
Bengali would work entirely differently than the one for Hausa, using
different abstract contents. But sometimes it might be necessary to develop
some morphological or grammatical functions which are unique to a specific
language, and then integrate them into the overall architecture for
generating whole texts. Examples for that are the noun classes in Niger
Congo languages
<https://en.wikipedia.org/wiki/Noun_class#Nominal_classes_in_Swahili>, or
the morphology of Arabic and Hebrew interleaving vowels and consonants
<https://en.wikipedia.org/wiki/Nonconcatenative_morphology>.
I see the community entirely taking the lead on which solution to choose,
implement, and pursue. But I see that happening mostly implicitly, through
the community’s actions, and not so much by the explicit means of debating
and voting on a single solution. I don’t want us to prematurely decide on a
single way, but rather to stay open and invite experimentation and new
ideas. The space of possible ideas is so vast, and the benefit of choosing
a solution better fitted to our community is so big, that it makes sense
for us to be creative.
I expect the development to go through four different levels of evolution.
First level: we might start with simple lookup tables. We might have a
function that returns a short description for a city, selecting the
relevant word or phrase in the right language, just based on the language.
There are 550 items in Wikidata that have the English short description
“city”, more than 1,700 items with “scientific paper”, and more than 400
with the French short description *“article scientifique”*. Even with such
a simple look-up table we could already improve the descriptions of
thousands of items on Wikidata.
Second level: we can extend the possibilities widely by allowing for
arguments in a templated structure, say in order to make a short
description such as “city in Azerbaijan” or “city in Israel” (each fifty
occurrences), or “French author” or “Argentinian chemist”. Such simple
patterns will both be useful for a large number of items as well as uncover
already a surprising number of edge cases. This will be useful for Wikidata
descriptions, but also can be already useful for Wikipedia articles: many
bots such as Rambot on English Wikipedia or LSJbot on Swedish have been
working exactly like this.
Third level: we add constructors to the templates. The constructors allow
us to build whole articles from individual sentences. Instead of having model
articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07> for
a whole category, we now allow for manually written articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-21>.
This makes the task at the same time easier and harder: because the
constructors now need to be reusable, they need to be more like modular
sentences than whole articles.
Fourth level: as with the third level the number of constructors will grow,
we should aim to rein in the amount of work that needs to be done by the
smaller language communities. This can be achieved by having abstract
renderers for constructors: winning an award can then, instead of having a
direct template in English such as (the example is simplified)
“{person} received the {award} on {date}”
have an abstract (i.e. language independent) renderer such as (again
simplified)
“Clause(subject=person, predicate=Q76664785
<https://www.wikidata.org/wiki/Q76664785>, object=award, time=date)”
which in turn has language dependent renderers, but fewer of those. This
would often lead to less idiomatic results.
Some of the solutions so far fare really well with the second or third
level, and others seem also to be capable of dealing with the fourth level.
The third level lends itself well to a certain kind of user experience,
which the fourth level does not. There will be advantages and disadvantages
to balance.
The goal of this newsletter is not to prescribe such a development. Unlike
with the development plan
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Plan>, we are not going
to actively work on making these levels happen. We do not have to go
through these particular levels, and we don’t have to be uniform about the
levels in the different domains. In some areas, the first level might be
entirely sufficient, in others we might flourish by using ideas described
for the fourth level, and again others might simply not fit into the
described levels at all. And that’s OK.
The goal of this newsletter is to explain my thinking and decisions, show
how the different systems and approaches we have previously mentioned fit
together, and to allow for rational predictions of where we are going and
what kind of contributions we are looking for. This is also an invitation
to all of you: the NLG system will be developed by all of us together.
Volunteer corner
Next week, on Monday, November 7, 18:30 UTC, we are going to host our next
volunteer corner. You can join us here: https://meet.google.com/evj-ktbq-hbn
Staff editing discussion is closing
Giving activity has calmed down, we are planning to close the staff editing
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Staff_editing> discussion
soon.
New developer channel
We will use the #wikipedia-abstract-techconnect
<https://web.libera.chat/?channel=#wikipedia-abstract-tech> channel on IRC
(also bridged to Telegram <https://t.me/abstract_wikipedia_tech>) as a
space more focused on developers and technology around Wikifunctions and
Abstract Wikipedia. Our channels are documented here: Abstract
Wikipedia#Participate
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia#Participate>
Development update for the week of October 28, 2022
Experience & Performance:
- Avoidance of type expansion in orchestrator (T297742
<https://phabricator.wikimedia.org/T297742>)
- Removed Work Summary component
- Aligned on what fields are mandatory and what fields are optional
during ZFunction and ZObject creation
- Finalized designs for Publish component
- Fixed more FE bugs
- Completed another round of testing
Meta-data:
- Initial implementation of Readable summaries of all error types (
T312611 <https://phabricator.wikimedia.org/T312611>)
Natural Language Generation:
- Shared document on UI and grammaticality judgments
- Made progress on the template language
- Demo of possible template creation GUI made (here
<https://github.com/mkeet/ToCTeditor>)