The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-09-27
--
Right from the start, Wikifunctions will be somewhat more complex than many
other Wikimedia projects. Sure, over time many of the Wikimedia projects
have accrued a lot of complexity: just think about Lua modules
<https://www.mediawiki.org/wiki/Lua>, conditional templates
<https://www.mediawiki.org/wiki/Template:If>, or the sophisticated use of
MediaWiki features to support the workflows of the community.
These complexities, however, have grown in scale alongside their wikis'
communities, and all of the projects have started out with a very simple
workflow. Wikifunctions will not. Though we are trying our best to make the
project as accessible and usable as possible, we also want to help the
community as best as possible to get the project off the ground.
In order to do so, we at the Wikimedia Foundation want to do something that
we usually don’t: directly support the community by contributing on-wiki to
the main namespace of Wikifunctions in our capacity as staff, from our
staff accounts.
Usually, edits to the main namespace for staff accounts are limited to
exceptional circumstances. This is primarily to make it very clear that the
content of each of the projects belongs to its community. This is
particularly important for projects like Wikipedia, where sometimes subtle
changes in wording can be very important and have significant real-world
consequences, as we were just reminded again recently
<https://en.wikipedia.org/wiki/Talk:Recession>.
Wikifunctions is different. Its content will be functions, their
implementations, and other supporting objects. We would like to be able to
work together with the community, in our paid time as staff members. This
means working on functions, helping to improve implementations, showing
exemplary cases of how to use new features, and also speeding up the
creation of implementations for functions requested by the community.
One particular domain where we are planning to contribute is for functions
around natural language generation. I think that, without staff support,
the necessary functions to make Abstract Wikipedia possible might take a
long time to develop, and that support by staff can speed up that key area
considerably.
Despite this approach, we also want to make sure that Wikifunctions remains
under the full ownership of the community. Whereas in the beginning our
staff accounts might have certain special rights on Wikifunctions (e.g. the
right to create instances of certain types), we want these roles to be
transferred to the community sooner rather than later. We don’t want to be
the ones making policy decisions beyond what is technically necessary (e.g.
for platform performance or code-security reasons). We don’t want to be
assigning sysop rights or other community leadership positions. We don’t
want to make policy decisions about which functions, implementations, or
which test cases are deemed necessary, valuable, or acceptable. All of
these areas, and more, should be fully owned by the Wikifunctions community.
It seems prudent and necessary, in order to be transparent, that the
community drafts a policy together with us in order to define how we will
be editing the Wikifunctions main namespace as staff. Since we will need
this policy to be in place from the beginning of Wikifunctions, we are
proposing to go the unusual path of creating a preliminary policy here on
Meta with interested community members, which we will then transfer to
Wikifunctions upon launch. That policy should be revisited once the
Wikifunctions community has formed, and once we have hands-on experience
with such edits.
*Request*: We are calling for contributors to lead the creation of this
preliminary policy, and asking everyone to comment and contribute to the
policy. If no contributors step forward, the Abstract Wikipedia team will
take the lead on drafting the preliminary policy. The policy will be
drafted at Abstract Wikipedia/Staff editing
<https://meta.wikimedia.org/w/index.php?title=Abstract_Wikipedia/Staff_editi…>
and
discussed at Talk:Abstract Wikipedia/Staff editing
<https://meta.wikimedia.org/w/index.php?title=Talk:Abstract_Wikipedia/Staff_…>
.
There are many questions to be answered: what limitations should staff
accounts face, if any? What about staff who are also volunteers? Should
staff also apply for sysop rights and other roles, or should they
automatically have certain rights and thus also responsibilities? How
should staff engage in debates, if at all? These are difficult questions
that would benefit from a preliminary answer, given to staff by the
community.
Note that all of this is strictly regarding Wikifunctions, and does not
have any implications for the other Wikimedia projects.
We are looking forward to working together!
The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-09-30
--
As you might recall, the Abstract Wikipedia team's Cory Massaro
<https://meta.wikimedia.org/wiki/User:CMassaro_(WMF)> recently finished an arts
residency in İstanbul
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-07-12>,
which he attended as part of the creative duo *Tecnologías Silvestres*. He
will share here in his voice some highlights of the trip as well as some
conclusions about knowledge democratization and the technological
challenges facing specific language communities.
Photo Tour
<https://meta.wikimedia.org/wiki/File:Cat_in_istanbul.jpg>
<https://meta.wikimedia.org/wiki/File:Cat_in_istanbul.jpg>
Cat in İstanbul
Istanbul is mostly cats, like 90%. Sometimes they stand under the
streetlights and meow at you like catnip dealers. Sometimes they just judge
you from the rocks
<https://commons.wikimedia.org/wiki/File:Cat_in_istanbul.jpg>.
<https://meta.wikimedia.org/wiki/File:Cathedral_underground_city.jpg>
<https://meta.wikimedia.org/wiki/File:Cathedral_underground_city.jpg>
Rock cathedral in Cappadocia
We took a field trip to Cappadocia for a few days to see what that was
about. The thing to do there, because it is warm as the devil's dust
jacuzzi, has historically been to live in a lava dome or cave. There are
some underground cities in Cappadocia where people used to hang out when it
was hot or a war outside. One such city contains an ancient Rock Cathedral
<https://commons.wikimedia.org/wiki/File:Cathedral_underground_city.jpg> (i.e.,
a cathedral made of a rock, not the site of the greatest Van Halen concert).
<https://meta.wikimedia.org/wiki/File:Underground_power.jpg>
<https://meta.wikimedia.org/wiki/File:Underground_power.jpg>
Underground power cables
The underground cities have lights hooked up for us pampered moderns. The
caves are full of cables and electric boxes
<https://commons.wikimedia.org/wiki/File:Underground_power.jpg>, creating a
climate apocalypse vibe which is delicious.
<https://meta.wikimedia.org/wiki/File:Shahmaran.jpg>
<https://meta.wikimedia.org/wiki/File:Shahmaran.jpg>
Shamaran
There were sculptures like this all over the city
<https://commons.wikimedia.org/wiki/File:Shahmaran.jpg>. I was embarrassed
that I couldn't identify this twice-crowned snake-butt centilady, so I put
the mythological expertise of the Abstract Wikipedia team to the test.
Quiddity finally identified her as the Shahmaran
<https://en.wikipedia.org/wiki/Shahmaran>.
<https://meta.wikimedia.org/wiki/File:Istanbul_Museum_of_the_History_of_Scie…>
<https://meta.wikimedia.org/wiki/File:Istanbul_Museum_of_the_History_of_Scie…>
Water clock
There's a whole Museum of the History of Science and Technology in Islam
<https://en.wikipedia.org/wiki/Istanbul_Museum_of_the_History_of_Science_and…>.
The museum begins with three galleries containing photos of European
Christian men. After that, it gets really fascinating. One highlight was this
gorgeous water clock
<https://commons.wikimedia.org/wiki/File:Istanbul_Museum_of_the_History_of_S…>
!
Art
I spent a lot of time staring pensively into the middle distance in a
scribal reverie. I made important literary sketches on cats fighting with
seagulls, the behavior of people in coffee shops, and other snippets of
daily life. Poems were written, short stories edited, and multiple visual
art installations created with other residents at the space. I also gave
two writing workshops using natural language processing and Surrealist
techniques to generate ideas, which we then used to create poetry and songs
(I made a word2vec <https://en.wikipedia.org/wiki/Word2vec> oracle!).
Language and Technology and Hegemony and Abstract Wikipedia
What kind of knowledge do people want to share? Many of us (or at least I)
intuitively believe that certain knowledge is more-or-less "objective" and
"neutral," but those categories are inadequate. Let us consider, for
example, standard objective facts about geography and biology. A city has a
certain population and square mileage, a founding date, a governing body
(usually), landmarks. A city also has history, and in many places, that
history cannot be discussed without reference to geopolitics. As I shared
information about personal history with people at the residency, I learned
facts about where they came from. Some of them came from cities about which
an interesting, useful, and very sad fact concerned recent violence. Other
facts had to do with the global superpowers which encouraged, condoned,
supplied arms for, or directly perpetrated that violence. There are plants,
like particular varieties of fig tree, which are now threatened or
endangered due to how war has terraformed their environment. These are
real, unimpeachable facts about cities and organisms, but it is impossible
to state those facts plainly without making a political statement.
While the propositional truth value of such a fact cannot be denied,
subjective domains like a person's political values inform whether that
fact is included in particular discourses. This is the art of creating
narrative or stories. I would consider it a noble goal to make Abstract
Wikipedia a platform where stories, not just facts, can be expressed and
shared. Abstract Wikipedia is the right platform for this because it allows
those stories to be shared outside the linguistic communities to which they
are directly relevant. Just as Abstract Wikipedia is intended to convey
objective information in less-resourced languages, I also hope that
speakers of these languages will represent their knowledge in Wikidata so
that Abstract Wikipedia can complicate the narratives of highly-resourced
languages' Wikipedias.
I also talked with people about how language informs their interactions
with technology. Some of the observations were unsurprising (but still
important to hear and hear again): certain software is hard to use in one
language or another; the Internet opens up if someone speaks a hegemonic
language, etc. One thing I hadn't anticipated was how often the discussions
turned to literacy. It was fascinating to speak with people who were fluent
and literate in multiple hegemonic languages but didn't read, or didn't
read well, the language they spoke at home. A speaker of Kurmanji
<https://en.wikipedia.org/wiki/Kurmanji> (Kurdish dialect) mentioned that,
when he exchanged messages with his Kurdish-speaking friends, they used
voice messages–using text felt unnatural.
Abstract Wikipedia has been conceived primarily as a text-based project.
This makes technical sense. However, if literacy is an impediment that
affects how and in what language a person might choose to access a website,
then it can be compared with other accessibility concerns. Vision-impaired
persons likewise suffer when projects only consider the text interface. In
both cases, it seems the same tools–screen reader-friendly User Interfaces,
better Text-To-Speech technology in all languages–can help solve the
problem.
In summary, I left the residency with two big questions about the work our
team is doing.
First: how can Abstract Wikipedia serve challenging, controversial
information, and expose people to perspectives they might not otherwise
have access to?
Second: issues of literacy and accessibility intersect in the languages
Abstract Wikipedia wants to serve. What discussions can we have about that
intersection?
(apologies for being quiet the last few weeks, we will catch up)
This update went to the Diff blog. You can find the version on the Web here:
https://diff.wikimedia.org/2022/09/21/the-state-of-abstract-wikipedia-natur…
Here is a copy of the text for your convenience and for the archive.
The State of Abstract Wikipedia Natural Language Generation
21 September 2022 by Natural Language Generation workstream of Abstract
Wikipedia
<https://diff.wikimedia.org/author/natural-language-generation-workstream-of…>
The Abstract Wikipedia team has taken further steps toward representing
abstract content in natural languages!
When Denny introduced the proposal for Abstract Wikipedia here on Diff
<https://diff.wikimedia.org/2020/05/07/a-proposal-for-a-new-wikimedia-projec…>,
he noted the need for “functions that can translate the content of Abstract
Wikipedia into the natural language text of every Wikipedia.” Those
“functions” will eventually comprise a community-driven natural language
generation <https://en.wikipedia.org/wiki/Natural_language_generation>
pipeline.
Research and prototyping for that NLG pipeline have now begun. In this
post, we will outline how the architecture of the NLG templating system
(part of the NLG pipeline) fits in with other components. We’ll also
highlight open questions in the hopes of encouraging discussion and further
contribution by the community.
As the AW team discussed a few weeks ago
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-08-19>,
the planned NLG realizer, also called Renderer, and a component of the NLG
system, will use a template language to help write templates and then it
will transform templates into natural language text. The template language
will provide a high-level, readable, declarative syntax to steer text
generation from the abstract content (captured with the constructors).
Then, the template language parser will produce a series of function
compositions, whose details are further described in Google.org Fellow
Ariel Gutman and Professor Maria Keet’s template language specification
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Template_Language_for_Wi…>.
It’s important for us to begin creating some standards for these functions
now in order to limit complexity and ensure interoperability, so that
abstract content can indeed benefit all languages and so that the community
can write Constructors and Renderers on Wikifunctions
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia#Project> with relative
ease. Some of the complexities regarding doing NLG in agglutinating African
languages have been addressed by Maria Keet in a TechTalk
<https://commons.wikimedia.org/wiki/File:Knowledge-to-text_Natural_Language_…>
she
gave to Google fellows in a meeting they held in Zurich in August.
To have a better idea of how the NLG realizer’s implementation may look,
Ariel Gutman has started creating a Scribunto prototype
<https://meta.wikimedia.org/wiki/Module:Sandbox/AbstractWikipedia>, which
will inform the Wikifunctions implementation. Mahir Moshed has also created
the Ninai <https://gitlab.com/mahir256/ninai/> and Udiron
<https://gitlab.com/mahir256/udiron/> libraries in Python to prototype the
realizer. We will share more about the prototype in a future Diff post. At
the same time, Google.org Fellow Sandy Woodruff has started reflecting
about a dedicated UI for the NLG system. You can learn about some of her
ideas in a brainstorming session
<https://commons.wikimedia.org/wiki/File:Natural_Language_Generation_in_Wiki…>
held
at the aforementioned meeting.
One open question concerns the Constructors themselves. A Constructor
represents a piece of abstract content. Let’s adapt an example from the
template language specification:
Age(
entity: Malala Yousafzai (Q32732)
age_in_years: 25
)
This is a Constructor that represents a fact true at the time of writing,
namely the age of Malala Yousafzai, which would be rendered in English as
“Malala Yousafzai is 25 years old.” Note that, in reality, “age_in_years”
would itself likely be defined by a function call that calculates age based
on birth date and the present date, but this detail is omitted here for
clarity.
Many of our open questions concern how representative this example
Constructor is. This example represents a single proposition and can be
realized as a sentence in most (maybe all?) natural languages, but will
that be true of all Constructors? What if some Constructors embed multiple
propositions? Is it possible for a Constructor to correspond to an
incomplete proposition?
Another set of questions concerns how general the relationship between a
Constructor and its participant entities should be. We might imagine a
Constructor for the sentence, “Bi Sheng invented movable type in 1040 AD.”
In order to make Constructors reusable across languages and for multiple
propositions, we would want to enshrine more general scenes or frames like
“Age” above or, in this case, “Invent.” What, if any, linguistic
formalization should be adopted for this purpose? FrameNet
<https://framenet.icsi.berkeley.edu/fndrupal/> is one possibility, but
might another work better, or does Abstract Wikipedia demand an *ad
hoc* solution?
How do we handle information which belongs in a sentence but isn’t
intrinsically part of a proposition, e.g. “in 1040 AD” from the given
example, which isn’t a “core” part of the notion of inventing something the
way that the inventor and invention are? Kutz Arrieta from Google has begun
thinking about these questions
<https://docs.google.com/document/d/1CDqpNgynN34qcRBi__KwxdeKLvNm6EQq/edit?p…>
.
Once the Constructors have done their job, the Renderers’ work begins. The
working NLG proposal presumes that the lexical forms in Wikidata will be
marked with grammatical features (*e.g.*, number for nouns and verbs,
gender or class for substantives, aspect and tense and mood for verbs, …).
Mahir Morshed and the rest of the NLG contributors have begun work on
standardizing
these representations in Wikidata’s lexicographical content
<https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation/L…>,
but our NLG system can’t assume the data will always be present or
complete. Therefore, our questions here concern how to address missing
lexical data. When the system generates a sentence, can it provide multiple
possibilities for words it’s uncertain about? Should it allow the user to
add new terms at that time? If so, how would it guide them to contribute to
Wikidata from another project’s context?
These are big questions, but hopefully the challenges they present look
exciting, rather than intimidating. As always, we welcome your contributions
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia#Participate>. We hope
that the breadth of experience and sheer number of languages present within
the community will help us find the most equitable solutions possible.