ti 26. huhtik. 2022 klo 17.23 s.lundell@gmail.com kirjoitti:
I noticed that a large portion of the refreshLinks-jobs failed with issues locking the Localization Cache table. At that moment i had the runJobs.php --maxjobs parameter set quite high, like 100-500 and --procs around 16 or 32. I lowered the --maxjobs to around 5 and the problem seemed solved. CPU utilization went down and also iowait. The job queue still took a very long time to clear. Looking at the mysql process list i found that a lot of time was spent by jobs trying to delete the Localisation Cache.
I switched to 'array' (and tested 'files' as well). This speed up the queue processing a bit but caused various errors. Sometimes the localisation cache data was read as a int(1) and sometimes the data seemed truncated. Looking at the source I found that the cache file writes and reads was not protected by any mutex or lock. That caused one job to read the LC file when another was writing it, causing a truncated file to be read. I implemented locks and exception handling in the LC-code for jobs to recover should they read corrupted data. I also mounted the $IP/cache dir on a ram-disk. The jobs now went through without LC-errors and a bit faster, but...
There might be an issue with your setup which causes the localisation cache to be rebuilt constantly. Most production sites I know are disabling automatic rebuilding. See manualRecache option in https://www.mediawiki.org/wiki/Manual:$wgLocalisationCacheConf and the associated script to build the cache separately. If this helps, you may want to investigate what was causing the cache to be rebuilt. Are the contents of localization files changing for some reason? Are their modification times changing for some reason?
-Niklas