Thanks Rob for starting the conversation about this.
I have explained our questions about how to run updates in the mail titled "Running periodic updates on a large number of wikis", because I feel that this is a more general issue, and I'd like to decouple it a bit from the Wikidata specifics.
I'll try to reply and clarify some other points below.
On 03.01.2013 23:57, Rob Lanphier wrote:
The thing that isn't covered here is how it works today, which I'll try to quickly sum up. Basically, it's a single cron job, running on hume[1].
[..]
When a change is made on wikidata.org with the intent of updating an arbitrary wiki (say, Hungarian Wikipedia), one has to wait for this single job to get around to running the update on whatever wikis are in line prior to Hungarian WP before it gets around to updating that wiki, which could be hundreds of wikis. That isn't *such* a big deal, because the alternative is to purge the page, which will also work.
Worse: currently, we would need one cron job for each wiki to update. I have explained this some more in the "Running periodic updates" mail.
Another problem is that this is running on a specific, named machine. This will likely get to be a big enough job that one machine won't be enough, and we'll need to scale this up.
My concern is not so much scalability (the updater will just be a dispatcher, shoveling notifications from one wiki's database to another) but the lack of redundancy. We can't simply configure the same cron job on another machine in case the first one crashes. That would lead to conflicts and duplicates. See the "Running periodic updates" mail for more.
The problem is that we don't have a good plan for a permanent solution nailed down. It feels like we should make this work with the job queue, but the worry is that once Wikidata clients are on every single wiki, we're going to basically generate hundreds of jobs (one per wiki) for every change made on the central wikidata.org wiki.
The idea is for the dispatcher jobs to look at all the updates on wikidata that have note yet been handed to the target wiki, batch them up, wrap them in a Job, and post them to the target wiki's job queue. When the job is executed on the target wiki, the notifications can be further filtered, combined and batched using local knowledge. Based on this, the required purging is performed on the client wiki, rerende/link update jobs scheduled, etc.
However, the question of where, when and how to run the dispatcher process itself is still open, which is what I hope to change with the "Running periodic updates" mail.
-- daniel