Data inconsistency, MediaWiki + SMW, job queue and high-use templates. - MediaWiki-l

26 Apr 2022

Hi!

I am building a wiki do document and collect information on the Palme case
(https://en.wikipedia.org/wiki/Assassination_of_Olof_Palme). The case was closed two years
ago and since then a lot of documents have been released.  The police investigation is one
of the three largest in world history. The complete material is around 60 000 documents
consisting of around 1 000 000 pages, of which we have roughly 5%.
My wiki, https://wpu.nu is collecting these documents, OCRs them with Google Cloud Vision,
and publishes them using the Proofread Page extension. This is done using a python script
running on the server accessing the wiki via the API. Some users are also writing
"regular pages" and helps me sort through the material and proof read it. This
bit works very well for the most part.

The wiki is running on a bare metal server with AMD Ryzen 5 3600 6-Core Processor (12
logical cores) and 64Gb of RAM. MariaDB (10.3.34) is used for the database. I have used
Elastic for SMW data and fulltext search but have been switching back and forth in my
debugging efforts.

From the investigation we have almost 60 000 sections (namespace Uppslag), 22 000 chapters
(namespace Avsnitt) and 6000 documents (namespace Index).  The documents and the Index
namespace are handled by the Proofread Page extension. I have changed the PRP templates to
suit the annotation and ui needs of wpu. For instance, each Index page has a semantic
attribute pointing out which section it is attached to. Between all these pages there are
semantic links that represents relations between the sections. This can be for instance
the relation between a person and a specific gun, or an organisation or place.

Each namespace is rendered with its corresponding template, which in turn includes several
other templates. The templates renders the ui but also contains a lot of business logic,
that adds to categories, sets semantic data etc. 

I will use an example to try to explain it better. This is an example of a section page:
https://wpu.nu/wiki/Uppslag:E13-00 which in the header shows information such as date,
document number and relations to other pages. Below that is the meta-information from the
semantic data in the corresponding Index page, followed by the pages of that document.
The metadata of the page is entered using Page Forms and rendered using the
Uppslag_visning template. I use the Uppslag template to set a few variables that are used
a lot in the Uppslag_visning template. Uppslag_visning also sets the page semantic data
and categories. A semantic query is used to find if there is a corresponding Index page,
if so a template is used to render its metadata. Another semantic query is used to get the
pages of the Index and render them using template calls in the query.

Oh, and the skin is custom. It is based off of the Pivot-skin but changed extensively.

I have run in to a few problems which have led me to question if MediaWiki + SMW is the
right tools for the job, question my sanity and the principle of cause and effect :) It is
not a specific problem or bug as such.

Naturally, I often make changes to templates used by the 60 000 section pages. This queues
a lot of refreshLinks-jobs in the job queue - initially taking a few hours to clear. I run
the jobs as a service and experimented with the options to runJobs.php to get a good usage
of the resources. I optimized the templates to reduce the resources needed for each job.
(e.g. using proxy-templates to instanciate "variables" to reduce the number of
identical function calls, save calculated data in semantic properties etc). This helped a
little.

I noticed that a large portion of the refreshLinks-jobs failed with issues locking the
Localization Cache table. At that moment i had the runJobs.php --maxjobs parameter set
quite high, like 100-500 and --procs around 16 or 32. I lowered the --maxjobs to around 5
and the problem seemed solved. CPU utilization went down and also iowait. The job queue
still took a very long time to clear. Looking at the mysql process list i found that a lot
of time was spent by jobs trying to delete the Localisation Cache.

I switched to 'array' (and tested 'files' as well). This speed up the
queue processing a bit but caused various errors. Sometimes the localisation cache data
was read as a int(1) and sometimes the data seemed truncated. Looking at the source I
found that the cache file writes and reads was not protected by any mutex or lock. That
caused one job to read the LC file when another was writing it, causing a truncated file
to be read. I implemented locks and exception handling in the LC-code for jobs to recover
should they read corrupted data. I also mounted the $IP/cache dir on a ram-disk. The jobs
now went through without LC-errors and a bit faster, but...

MediaWiki must be one of the most battle tested software systems written by man. How come
they forgot to lock files that are used in a concurrent setting? I must be doing something
wrong here? 

Should the jobs be run serially? Why is the LC-cleared for each job when the cache should
be clean? Maybe there is some kind of development and deployment path that circumvents
this problem?

I could have lived with that the wiki lagged an hour or so, but I also experience data
inconsistency. For instance sometimes the query for the Index finds an index, but the
query for Pages finds nothing, resulting in that the metadata is filled in on the page but
no document is shown. Sometimes when i purge a page the document is shown, and if I purge
again, it is gone. Data from other sections in the same chapter is shown wrongly and
various sequences of refresh and purge may fix it. The entire wiki is plagued with this
type of inconsistency making it a very unreliable source of information for its users.

Any help, tips and pointers would be greatly appreciated.

A first step should probably be to get to a consistent state.

Best regards,
Simon

PS: I am probably running a refreshLinks or rebuildData when you visit the wiki, so the
info found there might vary.

__________
Versions:
Ubuntu 20.04 64bit / Linux 5.4.0-107-generic
MediaWiki	1.35.4
Semantic MediaWiki	3.2.3
PHP	7.4.3 (fpm-fcgi)
MariaDB	10.3.34-MariaDB-0ubuntu0.20.04.1-log
ICU	66.1
Lua	5.1.5
Elasticsearch	6.8.23