Data inconsistency, MediaWiki + SMW, job queue and high-use templates. - MediaWiki-l

26 Apr 2022


      Hi!
I am building a wiki do document and collect information on the Palme case (https://en.wikipedia.org/wiki/Assassination_of_Olof_Palme). The case was closed two years ago and since then a lot of documents have been released.  The police investigation is one of the three largest in world history. The complete material is around 60 000 documents consisting of around 1 000 000 pages, of which we have roughly 5%.
My wiki, https://wpu.nu is collecting these documents, OCRs them with Google Cloud Vision, and publishes them using the Proofread Page extension. This is done using a python script running on the server accessing the wiki via the API. Some users are also writing "regular pages" and helps me sort through the material and proof read it. This bit works very well for the most part.
The wiki is running on a bare metal server with AMD Ryzen 5 3600 6-Core Processor (12 logical cores) and 64Gb of RAM. MariaDB (10.3.34) is used for the database. I have used Elastic for SMW data and fulltext search but have been switching back and forth in my debugging efforts.
From the investigation we have almost 60 000 sections (namespace Uppslag), 22 000 chapters (namespace Avsnitt) and 6000 documents (namespace Index).  The documents and the Index namespace are handled by the Proofread Page extension. I have changed the PRP templates to suit the annotation and ui needs of wpu. For instance, each Index page has a semantic attribute pointing out which section it is attached to. Between all these pages there are semantic links that represents relations between the sections. This can be for instance the relation between a person and a specific gun, or an organisation or place.
Each namespace is rendered with its corresponding template, which in turn includes several other templates. The templates renders the ui but also contains a lot of business logic, that adds to categories, sets semantic data etc.
I will use an example to try to explain it better. This is an example of a section page: https://wpu.nu/wiki/Uppslag:E13-00 which in the header shows information such as date, document number and relations to other pages. Below that is the meta-information from the semantic data in the corresponding Index page, followed by the pages of that document.
The metadata of the page is entered using Page Forms and rendered using the Uppslag_visning template. I use the Uppslag template to set a few variables that are used a lot in the Uppslag_visning template. Uppslag_visning also sets the page semantic data and categories. A semantic query is used to find if there is a corresponding Index page, if so a template is used to render its metadata. Another semantic query is used to get the pages of the Index and render them using template calls in the query.
Oh, and the skin is custom. It is based off of the Pivot-skin but changed extensively.
I have run in to a few problems which have led me to question if MediaWiki + SMW is the right tools for the job, question my sanity and the principle of cause and effect :) It is not a specific problem or bug as such.
Naturally, I often make changes to templates used by the 60 000 section pages. This queues a lot of refreshLinks-jobs in the job queue - initially taking a few hours to clear. I run the jobs as a service and experimented with the options to runJobs.php to get a good usage of the resources. I optimized the templates to reduce the resources needed for each job. (e.g. using proxy-templates to instanciate "variables" to reduce the number of identical function calls, save calculated data in semantic properties etc). This helped a little.
I noticed that a large portion of the refreshLinks-jobs failed with issues locking the Localization Cache table. At that moment i had the runJobs.php --maxjobs parameter set quite high, like 100-500 and --procs around 16 or 32. I lowered the --maxjobs to around 5 and the problem seemed solved. CPU utilization went down and also iowait. The job queue still took a very long time to clear. Looking at the mysql process list i found that a lot of time was spent by jobs trying to delete the Localisation Cache.
I switched to 'array' (and tested 'files' as well). This speed up the queue processing a bit but caused various errors. Sometimes the localisation cache data was read as a int(1) and sometimes the data seemed truncated. Looking at the source I found that the cache file writes and reads was not protected by any mutex or lock. That caused one job to read the LC file when another was writing it, causing a truncated file to be read. I implemented locks and exception handling in the LC-code for jobs to recover should they read corrupted data. I also mounted the $IP/cache dir on a ram-disk. The jobs now went through without LC-errors and a bit faster, but...
MediaWiki must be one of the most battle tested software systems written by man. How come they forgot to lock files that are used in a concurrent setting? I must be doing something wrong here?
Should the jobs be run serially? Why is the LC-cleared for each job when the cache should be clean? Maybe there is some kind of development and deployment path that circumvents this problem?
I could have lived with that the wiki lagged an hour or so, but I also experience data inconsistency. For instance sometimes the query for the Index finds an index, but the query for Pages finds nothing, resulting in that the metadata is filled in on the page but no document is shown. Sometimes when i purge a page the document is shown, and if I purge again, it is gone. Data from other sections in the same chapter is shown wrongly and various sequences of refresh and purge may fix it. The entire wiki is plagued with this type of inconsistency making it a very unreliable source of information for its users.
Any help, tips and pointers would be greatly appreciated.
A first step should probably be to get to a consistent state.
Best regards,
Simon
PS: I am probably running a refreshLinks or rebuildData when you visit the wiki, so the info found there might vary.
__________
Versions:
Ubuntu 20.04 64bit / Linux 5.4.0-107-generic
MediaWiki	1.35.4
Semantic MediaWiki	3.2.3
PHP	7.4.3 (fpm-fcgi)
MariaDB	10.3.34-MariaDB-0ubuntu0.20.04.1-log
ICU	66.1
Lua	5.1.5
Elasticsearch	6.8.23