Hi folks,
One item that comes up pretty frequently in our regular conversations with the Wikidata folks is the question of how change propagation should work. This email is largely directed at the relevant folks in WMF's Ops and Platform Eng groups (and obviously, also the Wikidata team), but I'm erring on the side of distributing too widely rather than too narrowly. I originally asked Daniel to send this (earlier today my time, which was late in his day), but decided that even though I'm not going to be as good at describing the technical details (and I'm hoping he chimes in), I know a lot better what I was asking for, so I should just write it.
The spec is here: https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation#Dispatchin...
The thing that isn't covered here is how it works today, which I'll try to quickly sum up. Basically, it's a single cron job, running on hume[1]. So, that means that when a change is made on wikidata.org, one has to wait for this job to get around to running before the item. It'd be good for someone from the Wikidata team to
We've declared that Good Enough(tm) for now, where "now" is the period of time where we'll be running the Wikidata client on a small number of wikis (currently test2, soon Hungarian Wikipedia).
The problem is that we don't have a good plan for a permanent solution nailed down. It feels like we should make this work with the job queue, but the worry is that once Wikidata clients are on every single wiki, we're going to basically generate hundreds of jobs (one per wiki) for every change made on the central wikidata.org wiki.
Guidance on what a permanent solution should look like? If you'd like to wait for Daniel to clarify some of the tech details before answering, that's fine.
Rob
On Thu, Jan 3, 2013 at 2:57 PM, Rob Lanphier robla@wikimedia.org wrote:
The thing that isn't covered here is how it works today, which I'll try to quickly sum up. Basically, it's a single cron job, running on hume[1]. So, that means that when a change is made on wikidata.org, one has to wait for this job to get around to running before the item. It'd be good for someone from the Wikidata team to
*sigh* the dangers of sending email in haste (and being someone who frequently composes email non-linearly). What I meant to say was this:
When a change is made on wikidata.org with the intent of updating an arbitrary wiki (say, Hungarian Wikipedia), one has to wait for this single job to get around to running the update on whatever wikis are in line prior to Hungarian WP before it gets around to updating that wiki, which could be hundreds of wikis. That isn't *such* a big deal, because the alternative is to purge the page, which will also work.
Another problem is that this is running on a specific, named machine. This will likely get to be a big enough job that one machine won't be enough, and we'll need to scale this up.
It would be good for Daniel or someone else from the Wikidata team to chime in to verify I'm characterizing the problem correctly.
Rob
This is a follow-up to Rob's mail "Wikidata change propogation". I feel that the question of running periodic jobs on a large number of wikis is a more generic one, and deserves a separate thread.
Here's what I think we need:
1) Only one process should be performing a given update job on a given wiki. This avoids conflicts and duplicates during updates.
2) No single server should be responsible for running updates on a given wiki. This avoids a single point of failure.
3) The number of processes running update jobs (lets call them workers) should be independent of the number of wikis to update. For better scalability, we should not need one worker per wiki.
Such a system could be used in many scenarios where a scalable periodic update mechanism us needed. For Wikidata, we need it to let the Wikipedias know when data they are using from Wikidata has been changed.
Here is what we have come up with so far for that use case:
Currently: * there is a maintenance script that has to run for each wiki * the script is run periodically from cron on a single box * the script uses a pid file to make sure only one instance is running. * the script saves it's last state (continuation info) in a local state file.
This isn't good: It will require one process for each wiki (soon, all 280 or so Wikipedias), and one cron entry for each wiki to fire up that process.
Also, the update process for a given wiki can only be configured on a single box, creating a single point of failure. If we had a chron entry for wiki X on two boxes, both processes could end up running concurrently, because they won't see each other's pid file (and even if they did, via NFS or so, they wouldn't be able to detect whether the process with the id in the file is alive or not).
And, if the state file or pid file gets lost or inaccessible, hilarity ensues.
Soon: * We will implement a DB based locking/coordination mechanism that ensures that only one worker will be update any given wiki, starting where the previous job left off. The details are described in https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation#Dispatching_Changes.
* We will still be running these jobs from cron, but we can now configure a generic "run ubdate jobs" call on any number of servers. Each one will create one worker, that will then pick a wiki to update and lock it against other workers until it is done.
There is however no mechanism to keep worker processes from piling up if performing an update run takes longer than the time it takes for the next worker to be launched. So the frequency of the cron job has to be chosen fairly low, increasing update latency.
Note that each worker decides at runtime which wiki to update. That means it can not be a maintenance script running with the target wiki's configuration. Tasks that need wiki specific knowledge thus have to be deferred to jobs that the update worker posts to the target wiki's job queue.
Later: * Let the workers run persistently, each running it's own poll-work-sleep loop with configurable batch size and sleep time. * Monitor the workers and re-launch them if they die.
This way, we can easily scale by tuning the expected number of workers (or the number of servers running workers). We can further adjust the update latency by tuning the batch size and sleep time for each worker.
One way to implement this would be via puppet: puppet would be configured to ensure that a given number of update workers is running on each node. For starters, two or three boxes running one worker each, for redundancy, would be sufficient.
Is there a better way to do this? Using start-stop-daemon or something like that? A grid scheduler?
Any input would be great!
-- daniel
Thanks Rob for starting the conversation about this.
I have explained our questions about how to run updates in the mail titled "Running periodic updates on a large number of wikis", because I feel that this is a more general issue, and I'd like to decouple it a bit from the Wikidata specifics.
I'll try to reply and clarify some other points below.
On 03.01.2013 23:57, Rob Lanphier wrote:
The thing that isn't covered here is how it works today, which I'll try to quickly sum up. Basically, it's a single cron job, running on hume[1].
[..]
When a change is made on wikidata.org with the intent of updating an arbitrary wiki (say, Hungarian Wikipedia), one has to wait for this single job to get around to running the update on whatever wikis are in line prior to Hungarian WP before it gets around to updating that wiki, which could be hundreds of wikis. That isn't *such* a big deal, because the alternative is to purge the page, which will also work.
Worse: currently, we would need one cron job for each wiki to update. I have explained this some more in the "Running periodic updates" mail.
Another problem is that this is running on a specific, named machine. This will likely get to be a big enough job that one machine won't be enough, and we'll need to scale this up.
My concern is not so much scalability (the updater will just be a dispatcher, shoveling notifications from one wiki's database to another) but the lack of redundancy. We can't simply configure the same cron job on another machine in case the first one crashes. That would lead to conflicts and duplicates. See the "Running periodic updates" mail for more.
The problem is that we don't have a good plan for a permanent solution nailed down. It feels like we should make this work with the job queue, but the worry is that once Wikidata clients are on every single wiki, we're going to basically generate hundreds of jobs (one per wiki) for every change made on the central wikidata.org wiki.
The idea is for the dispatcher jobs to look at all the updates on wikidata that have note yet been handed to the target wiki, batch them up, wrap them in a Job, and post them to the target wiki's job queue. When the job is executed on the target wiki, the notifications can be further filtered, combined and batched using local knowledge. Based on this, the required purging is performed on the client wiki, rerende/link update jobs scheduled, etc.
However, the question of where, when and how to run the dispatcher process itself is still open, which is what I hope to change with the "Running periodic updates" mail.
-- daniel
wikitech-l@lists.wikimedia.org