The on-wiki version of this newsletter can be found here:
https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-09-26
--
Quarterly planning for October–December 2024
We have planned the last quarter of the Western calendar year (or, as we
call it internally, Q2 of FY 2024/25), and as with the last few times, we
are making our plan public. This time, we have even finished our planning
before the quarter starts (yay!). It is a “short” quarter (due to a planned
team offsite and the end of year holidays), and yet we've picked up a lot
of work. Here’s a quick overview:
- *Enable one Wikifunctions use case in one language Wikipedia*: Just
two weeks ago, we announced that we aim to have our first integration
with Wikipedia
<https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-09-13>
on
the Dagbani Wikipedia. We aim to develop everything needed for that
integration this quarter, and to likely deploy it very early next year (
*i.e.*, January 2025).
- *Wikipedia integration usability improvements*: We will continue our
research, design and user test usability enhancements to make the
integration of Wikifunctions into Wikipedia easier. The implementation of
these design improvements will happen afterwards.
- *Iterate the Wikidata integration, and plan its and the Type system's
evolution*: We are very close towards the first integration of Wikidata
into Wikifunctions. The next quarter will see us extend that integration to
cover more parts of the Wikidata data model, and to evolve the
Wikifunctions type system to work with that.
- *Wikifunctions services alert monitoring*: We want to be automatically
notified when the Wikifunctions services are having issues.
- *Service platform improvements*: Our services are built on top of an
outdated "template" of how to write a back-end service, originally created
a decade ago before many changes in how Wikimedia manages them. We want to
modernise our services, replacing the base platform with a simpler, faster
framework. We also will explore rewriting the evaluator in a different
language better suited to process management.
- *On-wiki tooling to improve content and help editors onboard*: We plan
to create a set of related special pages to support the Wikifunctions
community with maintenance, like finding proposed Implementations that need
to be connected, or Functions that don't have any labels in a given
language like French or Igbo.
- *Testing Wikifunctions Services with Catalyst*: Catalyst is the
Wikimedia Foundation’s platform to support development through Continuous
Integration and testing. We want to integrate the Wikifunctions backend
services.
- *Improve performance of the PHP layer*: We want to give the MediaWiki
layer a proper audit. For example, we know that we are validating objects
more often than needed. The goal is to cut unnecessary work and improve
performance.
- *Make Phabricator more useful for the team*: Phabricator is our main
task and bug management system, but it needs some work to get a better
handling of the many tasks on our board so that we are focussed on working
on the right things at the right time.
- *Establish team chores practice*: As a team, we want to adopt
practices to help us improve reliability of the site and our responsiveness
to issues and questioned that you raise on the site.
We are looking forward to these developments, and hope you share in our
excitement about the upcoming few months.
Abstract Wikipedia presentation at Celtic Knot on Friday
The Celtic Knot conference 2024
<https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2024> started
yesterday in Waterford City, Ireland. On Friday, September 27, 2024, our
own Genoveva Galarza Heredero will be presenting about Abstract Wikipedia at
11:30 local time / 10:30 UTC <https://zonestamp.toolforge.org/1727433000>.
You can attend the talk in person or online
<https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2024/Attend>, and
the talk will also be recorded.
Abstract: *The availability of knowledge across different language editions
of Wikipedia is far from equal. In this presentation, we will discuss
Wikifunctions, a new initiative by the Wikimedia Foundation that empowers
users to collaboratively build and maintain a shared library of code
functions to enhance Wikimedia projects. Following this, we will introduce
Abstract Wikipedia and explore how it utilizes Natural Language Generation
functions maintained in Wikifunctions, combined with structured data from
Wikidata, to create abstract representations of Wikipedia articles. These
abstract articles can then be converted into natural language, helping to
bridge the gap between languages on Wikipedia.*
Recent Changes in the software
Here's an update for the last two weeks' work on technical matters.
Our main work was, as discussed last week, focused on the performance
issues many of you encountered (T374241
<https://phabricator.wikimedia.org/T374241> and others). We worked around
the issue with an on-wiki edit on Thursday, 19 September. Going out this
week is a fix for the infinite-recursion crasher we identified as the
immediate cause, but there is more work to come in this area.
This week, as part of the Quarterly work to support using Wikidata Lexemes
and other entities in Wikifunctions calls, we have landed the custom
selector for Wikidata Lexemes (T373589
<https://phabricator.wikimedia.org/T373589>). We look forward to announcing
more on this soon, including creating on-wiki all of the various new Types (
T370341 <https://phabricator.wikimedia.org/T370341>, T370343
<https://phabricator.wikimedia.org/T370343>, T370344
<https://phabricator.wikimedia.org/T370344>, T370346
<https://phabricator.wikimedia.org/T370346>, T370347
<https://phabricator.wikimedia.org/T370347>, T372594
<https://phabricator.wikimedia.org/T372594>) and their custom references
and utility Functions (T374533 <https://phabricator.wikimedia.org/T374533>).
We also added these new Types to the prohibited local content list; users
should always load them from Wikidata, not try to store them on
Wikifunctions (T373371 <https://phabricator.wikimedia.org/T373371>).
In terms of user-facing bug fixes, we fixed editing of Objects to not stop
regular users making an edit if it was not fully initialised, e.g. if you
were trying to add or adjust a label to a Type that doesn't have an
Equality Function set (T374931 <https://phabricator.wikimedia.org/T374931>).
We think this may have been the issue underlying being unable to edit some
pre-defined Objects, which I know has been irritating to a few of you (
T362011 <https://phabricator.wikimedia.org/T362011>).
We adjusted the behaviour of the general Object (and the more specific
language selector) to work with how Codex now expects us to use it,
following-up on our immediate bug-fix two weeks ago (T374248
<https://phabricator.wikimedia.org/T374248>). We improved the restrictions
in the object selector for situations like in Types or Implementations
where you're selecting function calls (T372995
<https://phabricator.wikimedia.org/T372995>).
We landed a coding tweak that made our front-end JS import format
consistent with our new standard, already implemented in our Vue code (
T334939 <https://phabricator.wikimedia.org/T334939>).
We added support to the Wikifunctions stack for the new languages Haryanvi
in Arabic script, Z1938/bgc-arab
<https://www.wikifunctions.org/view/en/Z1938> (T373561
<https://phabricator.wikimedia.org/T373561>) and Negeri Sembilan Malay,
Z1939/zmi <https://www.wikifunctions.org/view/en/Z1939> (T373931
<https://phabricator.wikimedia.org/T373931>), which have been added to
MediaWiki for translations.
We, along with all Wikimedia-deployed code, are now using the latest
version of the Codex UX library, v1.13.0, as of this week. The colours used
in Codex have slightly changed to be more accessible and work better with
dark mode; otherwise, we believe that there should be no further
user-visible changes on Wikifunctions, so please comment on the Project
chat or file a Phabricator task if you spot an issue.
Function of the Week: multiply two natural numbers
Last week, Yuntian Deng <https://yuntiandeng.com/>, a professor at the
University of Waterloo, was discussing the accuracy of large language
models when multiplying large integers
<https://x.com/yuntiandeng/status/1836114401213989366>, and it’s a really
interesting discussion: it shows how much better the new GPT-o1 model is
compared to the previous GPT-4o model on this particular task. Lastly, he
points out that a “small” model with more than 100 million parameters can
solve this task with over 99% accuracy for twenty-digit numbers.
In case you want 100% accuracy even with larger numbers and using far less
energy, you are invited to use Wikifunctions: we have a function for
multiplying two natural numbers
<https://www.wikifunctions.org/view/en/Z13539> that does the job very well!
The function offers five implementations:
- One in Python using Python’s * operator
<https://www.wikifunctions.org/view/en/Z13543>
- One in JavaScript using JavaScript’s * operator
<https://www.wikifunctions.org/view/en/Z13540> for BigInt
- One composition implements a recursive function
<https://www.wikifunctions.org/view/en/Z14073>, which first checks
whether any of the two arguments is zero (and returns zero if so), and if
not, it adds the greater number to the result of multiplying the greater
number times the lesser number minus one. Or, put differently, it turns a×b
into a + a×(b-1)
- One composition <https://www.wikifunctions.org/view/en/Z14760> using
the hyperoperation <https://en.wikipedia.org/wiki/Hyperoperation> of
rank 2
- And one more composition
<https://www.wikifunctions.org/view/en/Z17374> that
turns the natural numbers into integers, uses integer multiplication, and
then back again to a natural number
And five tests confirm that these implementations all work with 100%
accuracy:
- 1×1 is 1 <https://www.wikifunctions.org/view/en/Z13541> (testing the
unit)
- 2×3 is 6 <https://www.wikifunctions.org/view/en/Z13542>
- 0×10 is 0 <https://www.wikifunctions.org/view/en/Z13544> (testing the
zero)
- 10×0 is 0 <https://www.wikifunctions.org/view/en/Z13545> too (and
commutativity for the last one)
- 242×224=266, or 4398046511104×16777216=73786976294838206464
<https://www.wikifunctions.org/view/en/Z13550> (not 20 digits each, but
pretty large numbers)
So, next time you need two large integers multiplied, we recommend
Wikifunctions instead of ChatGPT.
The on-wiki version of this newsletter can be found here:
https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-09-06
======
Recording of September’s Volunteer’s Corner is now available on Commons
As always, the recording of this month’s Volunteer’s Corner is available for everyone to watch on Wikimedia Commons[https://commons.wikimedia.org/wiki/File:Abstract_Wikipedia_Voluntee….
Recent Changes in the software
This week was a quieter one in terms of features shipped, as several people were out and we were focused on finishing the bigger Quarterly pieces of work.
In terms of user-facing changes, we tweaked the code that integrates the special view of Objects with MediaWiki. This means that the "Tools" menu for the page now has the "What Links Here" and "Page Information" tools, amongst others (T343594[https://phabricator.wikimedia.org/T343594]). If you're editing a Test case to change the target Function, we now immediately clear the results widget rather than require you to manually re-run it. The message that warns you that you cannot run a Function because it has no connected Implementations now uses that term, rather than the old "approved" wording (T345848[https://phabricator.wikimedia.org/T345848]).
Function of the Week: lists have unequal length[https://www.wikifunctions.org/view/en/Z13310]
The list inequality function accepts two lists and returns a Boolean that indicates if the input lists are of different lengths. This simple function provides an efficient way to compare the lengths of two lists (or other iterables). It's helpful in scenarios where you want to ensure that two lists are not of the same size before proceeding with other more complicated operations that depend on lengths of lists.
We appreciate simple functions like this which have exemplary uses. A function that checks if two lists are of unequal length has numerous practical applications across various fields. We can use it in inventory[https://en.wikipedia.org/wiki/Inventory] management, to verify that product and quantity lists are aligned to prevent inventory mismatches. It can be used in data validation [https://en.wikipedia.org/wiki/Data_validation] and matching financial records to help detect missing or incomplete data by checking transaction lists for unequal lengths. Similarly we can use it in form validation, parallel processing and many more complex areas.
We currently have three implementations for this function - one in Javascript, one in Python, and another composition. Both Javascript[https://www.wikifunctions.org/view/en/Z14509] and Python[https://www.wikifunctions.org/view/en/Z13313] implementations perform a comparison of the two inputs by evaluating their ‘length’ properties. They return the result of the inequality (!==) between these lengths, providing an efficient way to check if the two inputs are of different sizes.
The composition[https://www.wikifunctions.org/view/en/Z13311] uses the list equality function and negates its result to provide the necessary inequality result for this function.
This function has two tests that demonstrate the use of the function.
The on-wiki version of this newsletter can be found here:
https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-09-20
--
Introducing focus topic areas
As we are moving closer to making it possible to generate natural language
text, we are starting to think about introducing topical foci. We already
have focus languages
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-04-15>,
but we are also considering introducing focus topic areas.
<https://www.wikifunctions.org/wiki/File:Bokeh_Example.jpg>
As we have discussed before, we expect communities to create (at least) two
different types of articles using Abstract Wikipedia: on the one hand, we
will have highly-standardised articles based mostly on Wikidata, called model
articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07>;
and on the other hand, we will have bespoke, hand-crafted articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-21>,
assembled sentence by sentence.
We suggest introducing (at least) two different focus topic areas: one
focus area for the model articles, and a different focus area for the
manually-written articles. The two different types of articles benefit from
different topic areas: model articles suit areas better that have a lot of
individual items which are uniformly describable in Wikidata, whereas
manually-written articles are better suited for topic areas where each
article in the topic area is quite different from the other, and where the
items in Wikidata are often rather empty.
We would prefer to choose topic areas that are not highly contentious,
either within a single language or across language barriers. This is
particularly true for the model articles approach: both subtle and blatant
differences may often miss the careful consideration that is necessary,
especially if we create thousands or millions of articles!
We would prefer a topic area that can invite contributions from all over
the world. For example, articles about kabuki theater
<https://en.wikipedia.org/wiki/Kabuki> would make a great contribution to
the knowledge of the world, but it is expected that most contributors would
come from one single country, speaking mostly one language.
We would prefer a topic area that is of interest to the wider population.
Whereas Wikidata is known for its coverage of scientific papers or
astronomical objects, both topic areas seem to have a limited readership,
thus also limiting the value that topic area would bring.
Having said that, we propose *food* as the topic area for hand-crafted
articles. We will write a more detailed weekly about why food makes such a
great topic area, and in case there are no vetos or better suggestions, we
will pick that up as one of our two focus areas.
For model articles, we are looking for a discussion to see what we should
select: the two most obvious topic areas would be human settlements and
people, but both have a lot of potential for being contentious. Biological
species are another interesting topic area, but are often much more
complicated than expected. There are many other interesting topic areas,
and we would love to hear your suggestions, thoughts, considerations, and
see the discussion.
Note that we most definitely won’t stop anyone from creating the content
they care about. You will be absolutely free to create articles on the
topic areas you care about, and they can be model articles or they can be
manually-written articles. The focus area is merely to help the development
team focus and to help set expectations when working together with you as
communities. If you want to write an abstract article about a specific
1980s fashion fad or create model articles for crochet patterns, you are
more than welcome to do so. We just want to help you understand our
prioritisation.
Please chime in on which topic areas you think would make particular sense,
so that we can come to a preliminary decision in the following weeks. Thank
you!
Site instability update
We had an ongoing incident that we already reported in last week’s update.
Together with our colleagues in SRE, we were looking for quite a while,
trying to figure out what was going on, and couldn’t. Most frustratingly,
we could not reproduce the issues in production in any other environment,
which made debugging really difficult.
The issue surfaced as about 10%-20% of function pages timing out, also
numerous test and implementation pages failed, and other issues. All tests
failed consistently. SRE got paged so often that they had to switch off the
monitoring on Wikifunctions.
We kicked off our incident procedure, and continuously increased the
resources dedicated to the issue. We found several possible culprits, but
since we were unable to locally replicate the issue, trying to fix it was
often a frustrating cycle of deploying to production, and checking whether
that helped. As of now, we hope that the site has recovered. While we were
throwing our net wider, we were able to find an edit to Wikifunctions, that
seemed to have caused an infinite loop on certain validations. We were able
to roll back that edit.
We are still investigating how that edit led to such a big impact, why we
didn’t catch this issue sooner, and what to do in order to ensure that we
don’t run into a similar situation again. If you still continue to see
pages time out, please let us know.
Thanks everyone for your patience. We apologize for being a bit vague
still, but we want to first understand the root cause of the issue a bit
better before making it easily visible how one may break the site again. We
understand that some of you could easily look into the site to figure out
details right now, but we would like to ask you to not share that knowledge
too widely just yet. We plan to say a bit more about this in the coming
weeks, once we have had a bit more time to understand the root cause. Thank
you for your patience!
Function of the Week: Caesar cipher for Bengali alphabets
We had talked about the Caesar cipher before
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2024-01-17#Funct…>
in
the Function of the Week. It is an old form of cryptography, where each
letter is shifted by a certain number of positions. Traditionally, it is
applied to texts written in the Latin alphabet.
One large advantage of Wikifunctions is that we can easily create and
deploy new functions, even functions that haven’t been implemented before,
and make them available to everyone in the world (or, at least, everyone
with access to the Web). This way our contributors can create functions
which have probably been thought about, but haven’t seen widely available
implementation: for example, one could take the idea of the Caesar cipher
and apply it to other alphabets.
And this is exactly what we have in our current Function of the Week: it
applies the idea of the Caesar cipher - of shifting letters along a given
alphabet - to the Bengali alphabet
<https://en.wikipedia.org/wiki/Bengali_alphabet>. The Bengali alphabet is a
script of the Indian subcontinent, used for a good thousand years, and used
in a number of languages, including classic languages such as Sanskrit and
living languages such as Bangla with more than a quarter of a *billion*
speakers.
The function Caesar cipher (Bengali alphabets) (Z17530)
<https://www.wikifunctions.org/view/en/Z17530> takes two arguments, the
Bengali string to be encoded, and the shift value, a number deciding by how
many letters to shift it. The return is a string, representing the encoded
Bengali input.
There is currently one implementation in Python
<https://www.wikifunctions.org/view/en/Z17532>, which is based on an array
with the Bengali alphabet, and then going through the input string to
replace each letter with the shifted letter.
The function has six tests:
1. Shifting অআকখ by 2 <https://www.wikifunctions.org/view/en/Z17533> results
in ইঈগঘ
2. Shifting ক by 38 <https://www.wikifunctions.org/view/en/Z17576> results
in অ
3. Shifting হ by 1 <https://www.wikifunctions.org/view/en/Z17627> results
in ড়
4. Shifting অ by 49 <https://www.wikifunctions.org/view/en/Z17531> results
in অ (i.e. that’s the identity shift)
5. Shifting য় by 1 <https://www.wikifunctions.org/view/en/Z17626> results
in ৎ
6. Shifting অ by 11 <https://www.wikifunctions.org/view/en/Z17577> results
in ক (which is the reverse of the second test)
The existing implementation passes all six tests.
The tests look great, and cover a few interesting cases (such as the
identity shift, or a reverse). I would add tests for a zero shift, for a
shift beyond 49 (e.g. for 98 or for 100), and shift more words instead of
single letters, including letters and characters which are not in the
Bengali alphabet. It would also be good to have more than just one
implementation.
But the interesting part really is that this gives a widely available
implementation of the Ceaser cipher for an alphabet where that wasn’t
widely available before. I am looking forward to seeing what other
functions we can make available in novel contexts, and see whether they
gain any traction.
Dagbani Wikipedia will be our first wiki for Wikifunctions integration
As we wrap up this quarter’s work and begin planning for the next, I want
to discuss the progress of our biggest initiative this fiscal year:
integrating Wikifunctions with Wikipedia articles. Our focus languages for
the project
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-04-15> are
Bangla, Igbo, Hausa, Dagbani, and Malayalam. We have chosen *Dagbani* as
the first Wikipedia for this integration.
<https://www.wikifunctions.org/wiki/File:Dagbani_Wikimedians_User_Group.jpg>
This Quarter
<https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-07-03>,
we have focused on building a design prototype, consulting with internal
teams to make key decisions, and drawing valuable insights from Wikidata’s
initial integration experience. Now, we're ready to apply these learnings
to Dagbani Wikipedia, creating a design that not only fits seamlessly with
this wiki but will also scale effectively to larger ones in the future.
Involving user groups in shaping such a new venture/feature is essential
for gaining insights into users' needs and pain points. Your feedback
ensures that the product meets real expectations, enhancing usability and
relevance. Early input helps identify issues before launch, reducing costly
redesigns. Ultimately, involving users fosters ownership, driving product
adoption and success.
We want to adopt this product philosophy in our integration work. We will
be reaching out to our Dagbani Wikipedia community in the coming weeks to
form a working group that can help us deliver this project to the Dagbani
Wikipedia in a meaningful way. We want to form a diverse team of new
editors, experienced editors and passionate readers. We are aiming for a
small group of 3–5 people. Our idea is to meet with this group regularly,
involving them in design prototype reviews, local demo reviews, exchanging
product ideas and vision, and in turn building our confidence in the
usefulness and readiness of our solutions.
Site reliability issues
As many of you will know, we've been having some stability challenges with
the site for the past few days, in part caused by a surge of Web crawler
traffic overloading the servers set aside for running Wikifunctions. This
has taken the form of several issues, including the whole site appearing
down (T374318 <https://phabricator.wikimedia.org/T374318>), or
intermittently breaking on pages that work sometimes (T374305
<https://phabricator.wikimedia.org/T374305> and T374241
<https://phabricator.wikimedia.org/T374241>). We have put in place a few
mitigations to try to reduce the load spent on non-human users. This has
included temporarily banning Anthropic's ClaudeBot via robots.txt
<https://www.wikifunctions.org/wiki/MediaWiki:Robots.txt>, and replacing
the standard, Wikimedia-wide site reliability monitoring suite with a
custom, more relevant, simpler one with less load (T374442
<https://phabricator.wikimedia.org/T374442>). However, these have so far
had limited effect, and we continue to review and try to improve the
situation. Our apologies for the disruptions.
We are also noticing at the same time a novel issue with validation, and
are currently simplifying validation workflows. This might lead to issues
as objects might be unvalidated. Please let us know if you see weird new
errors, particularly missing error messages where they should be.
Recent Changes in the software
Disconnected from the above site issues, we were alerted to a bug in last
week's code that meant you couldn't select instances of Types in the
selector; we made a quick fix for this, with a test to avoid future
regressions, and back-ported it into production on Monday (T374199
<https://phabricator.wikimedia.org/T374199>). We are thankful for
GrounderUK and other community members who noticed this, and sorry for the
disruption.
As an additional breakage, all of our end-to-end API testing that uses the
Beta Cluster unfortunately broke last week, so we have temporarily disabled
those tests and are now relying on manual testing alone (T374242
<https://phabricator.wikimedia.org/T374242>).
One of the big parts of our Quarterly work is preparing for the "Wikipedia
integration", in which you will be able to embed Wikifunctions call results
in wikitext (T261472 <https://phabricator.wikimedia.org/T261472>). We
landed some improvements there, in particular changes to separate the
concerns between the 'client' code, running on Wikipedias, and the 'repo'
code, running on Wikifunctions.org. More of this work should land soon,
including with a demonstration.
Another part of our Quarterly work is preparing for being able to reference
Wikidata items in Function calls (T282926
<https://phabricator.wikimedia.org/T282926>). We've made some changes to
our conceptual model of references, which we expect to be a temporary
work-around for the next few months so that Wikidata references can be used
ahead of any wider Type calculus reforms, which mean that our code to check
if something is a reference will, at least for now, stop recognising the
forms "Q1234" or "L1234", and only "Z1234" or "Z1234K1" (T373859
<https://phabricator.wikimedia.org/T373859>). The back-end code to access
these and formalise them into Types (T370072) continues, and we hope to
demonstrate it soon.
We made a handful of UX improvements that go out this week. When making
changes via the About control (rather than the whole-page editing flow), we
fixed the "Cancel" button in the publish dialog to return you to the
editor, rather than throw away all your changes – sorry for that (T360062
<https://phabricator.wikimedia.org/T360062>). When a Z6/String value is
very long, we now ask your browser to wrap the text rather than have it
overflow (T373987 <https://phabricator.wikimedia.org/T373987>). We fixed
the width of the Function editor when on narrow screens (below 500px), such
as mobile phones (T366675 <https://phabricator.wikimedia.org/T366675>). We
updated the object selector to be smarter about the restrictions on what
Functions to search for in particular contexts where we know the 'shape'
expected (T372995 <https://phabricator.wikimedia.org/T372995>).
We also made some general technical improvements. We have replaced our old
temporary Tooltip component with Codex's proper one, now that it exists (
T298040 <https://phabricator.wikimedia.org/T298040>). This nearly completes
our replacement with upstream components. We have just the Table used on
Function pages to list Implementations and Test cases to go (T373197
<https://phabricator.wikimedia.org/T373197>). We're hugely thankful to the
Design System team for their work developing the Codex library to the point
where our *ad hoc* versions are no longer needed. As part of our
long-running migration from strings to references for Z61/Programming
language objects (T287153 <https://phabricator.wikimedia.org/T287153>), we
have completed the dropping of support for them in the UX layer. All
existing content on Wikifunctions.org was migrated back in May/June, so
this should not have any disruptive effect.
We, along with all Wikimedia-deployed code, are now using the latest
version of the Codex UX library, v1.12.0, as of this week. We found one
change that broke how we were using the "lookup" component, which we have
worked around (T374248 <https://phabricator.wikimedia.org/T374248>) ahead
of an upstream fix (T374246 <https://phabricator.wikimedia.org/T374246>);
we believe that there should be no further user-visible changes on
Wikifunctions, so please comment on the Project chat or file a Phabricator
task if you spot an issue.
Function of the Week: count substrings
Recently, LLMs <https://en.wikipedia.org/wiki/Large_language_models> made a
small newscycle because they failed at the question “how often does the
letter *‘r’* appear in *‘strawberry’*?” (You can easily find coverage about
this on various fora and news sites.)
Wikifunctions has no such problem: with the function “count substrings” (count
substrings (Z14450) <https://www.wikifunctions.org/view/en/Z14450>) we can
easily ask how often the substring *‘r’* appears in the string
*‘strawberry’*, and, unsurprisingly, it returns 3.
The function has two implementations, one in JavaScript and one in Python:
- the Python implementation
<https://www.wikifunctions.org/view/en/Z14451> simply relies on the
built-in .count method of Python's String object
- the JavaScript implementation
<https://www.wikifunctions.org/view/en/Z15718> is using the match function,
which takes a regular expression based on the string to search for, and
which returns all matches of the second string in the first. That one then
gets counted using the length attribute. There is a special case in case
there are no matches, in which case 0 is returned.
The JavaScript implementation is a good example of a seemingly simple
functionality with a surprisingly complex implementation. And yet, the
suggested implementation is suspectible to errors. Since the second
argument gets turned into a regular expression, some symbols mess up the
search. I added a test for that case
<https://www.wikifunctions.org/view/en/Z19022>, the second to last in the
list below. Fortunately, only the Python implementation is connected,
therefore the error in the JavaScript implementation is actually not used
-- the approval system worked as intended.
The function offers six tests:
- *“hello, hello, hello”* has 3 times
<https://www.wikifunctions.org/view/en/Z14452> *“hello”*
- *“hello, hello”* has no <https://www.wikifunctions.org/view/en/Z14455>
*“world”*
- *“aaaaa”* has 2 times <https://www.wikifunctions.org/view/en/Z14453>
*“aa”*
- *"Талалаївка"* has 2 times
<https://www.wikifunctions.org/view/en/Z16041> *"ла"*
- *“[Ti++++]”* has 4 times <https://www.wikifunctions.org/view/en/Z16726>
*“+”*
- *“aa”* has no <https://www.wikifunctions.org/view/en/Z19022> *“a+”*
- And finally, *“strawberry”* has 3 times
<https://www.wikifunctions.org/view/en/Z19021> *“r”*
Since I added the last two tests while writing this entry, they are not
connected yet.
The third test here is particularly interesting, showing how such a
seemingly simple function can have very different interpretations: one
could argue that *“aaaaa”* has the string *“aa”* four times, at positions
1, 2, 3, and 4, but the function counts in a so-called greedy way: *“aa”* only
fits twice into *“aaaaa”*. This is why tests are so important, to show and
agree on the exact meaning of the function.
It would be great to see more tests with other scripts, such as for example
Arabic or Chinese. It is nice to see Cyrillic represented in one of the
tests.
It is not surprising that current LLMs struggle with this question, due to
the way they work. We strictly believe that a good future architecture for
a question answering machine doesn’t only use the model itself, but also a
large document store, such as the Web, a knowledge base, such as Wikidata,
and a repository of functions, such as Wikifunctions. Any one of these
drastically expands what kind of questions the system will be able to
answer with high accuracy and confidence.