Hi all, First post to the list. I've got a bunch of questions, and I hope this is the right place to ask them.
I'm interested in the idea of wiki 'mirroring': updating a second wiki ('B') periodically with content from wiki A. (There's of course some discussion of this on the web, so I'm aware that there's been quite a bit of thinking on this already, but I couldn't quite find the solution I was looking for.)
A first stab at mirroring would be to do a Special:Export on the whole of A, and then do a Special:Import on B. But this becomes impractical for larger wikis: Ideally, I just want to update what needs updating.
The best way to do this would probably be something like list=recentchanges (going back to the date of last transfer). Of course this doesn't work, because recentchanges are are periodically purged, so cannot be used between arbitrary dates. The log doesn't seem to record edits (is this correct?), so this can't be used to get a list of changes between two arbitrary dates.
So, question 1: Is it possible to get a list of all changes (including edits) between two dates (in a single query)?
If one wanted the complete version history, then another way to do this would be to get all revisions since the last transfer made, i.e. something like: action=query&prop=revisions&revids=1450|1451|1452|...&rvprop=content (then transform xml to Special:Import format, and upload). Together with a query of the log, this would give you all changes.
But suppose the wiki is very active or you don't have much bandwidth or you simply don't want the whole version history, but just the latest versions (since the last transfer). The only way I can see is to do something like this:
- 1. Fetch the list of namespaces - 2. Get the list of revisions in each namespace (action=query&prop=revisions&generator=allpages for each namespace) - 3. See what needs updating, and then fetch all the changed pages.
Question 2: Can you see a better way of doing this? Also, why won't generator=allpages work across namespaces? (I guess there my be a reason why that isn't possible to do easily.)
One way would be to try something like:
action=query&prop=revisions&generator=allpages&rvstart=20090521000000 but this doesn't work.
So, my question 3: Do you know why this doesn't work? I assume there isn't an efficient mysql query to accomplish this, or are there other reasons?
Finally, I guess I am wondering whether there are people actively interested in discussing issues around wiki mirroring/synchronisation more. If so, what's the best mailing list for this?
Sorry, the post got a bit longer than I expected - thanks for considering this!
All the best, Bjoern
2009/5/31 sl contrib sl.contrib@googlemail.com:
Hi all, First post to the list. I've got a bunch of questions, and I hope this is the right place to ask them. I'm interested in the idea of wiki 'mirroring': updating a second wiki ('B') periodically with content from wiki A. (There's of course some discussion of this on the web, so I'm aware that there's been quite a bit of thinking on this already, but I couldn't quite find the solution I was looking for.) A first stab at mirroring would be to do a Special:Export on the whole of A, and then do a Special:Import on B. But this becomes impractical for larger wikis: Ideally, I just want to update what needs updating. The best way to do this would probably be something like list=recentchanges (going back to the date of last transfer). Of course this doesn't work, because recentchanges are are periodically purged, so cannot be used between arbitrary dates.
If you have control over wiki A, you can set $wgRCMaxAge to a higher value. You could also do the updates more often so there's never more than $wgRCMaxAge between them.
The log doesn't seem to record edits (is this correct?), so this can't be used to get a list of changes between two arbitrary dates.
Correct.
So, question 1: Is it possible to get a list of all changes (including edits) between two dates (in a single query)?
Only list=recentchanges, but you already knew that.
If one wanted the complete version history, then another way to do this would be to get all revisions since the last transfer made, i.e. something like: action=query&prop=revisions&revids=1450|1451|1452|...&rvprop=content (then transform xml to Special:Import format, and upload). Together with a query of the log, this would give you all changes. But suppose the wiki is very active or you don't have much bandwidth or you simply don't want the whole version history, but just the latest versions (since the last transfer). The only way I can see is to do something like this:
1. Fetch the list of namespaces 2. Get the list of revisions in each namespace (action=query&prop=revisions&generator=allpages for each namespace) 3. See what needs updating, and then fetch all the changed pages.
Question 2: Can you see a better way of doing this? Also, why won't generator=allpages work across namespaces? (I guess there my be a reason why that isn't possible to do easily.)
Because other parameters like apprefix don't work cross-namespace. Requests to make list=allpages work cross-namespace have been made in the past and denied because the benefits of the slight increase in convenience (there are few namespaces anyway) don't outweigh the complexity of preventing certain parameters from being used cross-namespace.
One way would be to try something like:
action=query&prop=revisions&generator=allpages&rvstart=20090521000000
but this doesn't work. So, my question 3: Do you know why this doesn't work?
This'll probably result in an error, since rvstart can't be used in multi-page mode.
I assume there isn't an efficient mysql query to accomplish this, or are there other reasons? Finally, I guess I am wondering whether there are people actively interested in discussing issues around wiki mirroring/synchronisation more. If so, what's the best mailing list for this? Sorry, the post got a bit longer than I expected - thanks for considering this! All the best, Bjoern
I think your best bet is to use list=recentchanges and update frequently.
Roan Kattouw (Catrope)
Hi Roan, thanks for the reply!
because recentchanges are are periodically purged, so cannot be used between
arbitrary dates.
If you have control over wiki A, you can set $wgRCMaxAge to a higher value. You could also do the updates more often so there's never more than $wgRCMaxAge between them.
...
I think your best bet is to use list=recentchanges and update frequently.
Sure. However, for the cases I am thinking of, this isn't always under my control. E.g. I might not be able to convince wikipedia to keep much longer logs. (Or would keeping recent changes say for several months be feasible?)
Question 2: Can you see a better way of doing this? Also, why won't
generator=allpages work across namespaces? (I guess there my be a reason
why
that isn't possible to do easily.)
Because other parameters like apprefix don't work cross-namespace. Requests to make list=allpages work cross-namespace have been made in the past and denied because the benefits of the slight increase in convenience (there are few namespaces anyway) don't outweigh the complexity of preventing certain parameters from being used cross-namespace.
I guess there isn't a simple way to make allow something like apnamespace=0|1|2|3 for the same reason?
action=query&prop=revisions&generator=allpages&rvstart=20090521000000
but this doesn't work. So, my question 3: Do you know why this doesn't work?
This'll probably result in an error, since rvstart can't be used in multi-page mode.
Sure, it generates an error. I guess it's not implemented for the same reason as the apnamespace issue, in that it would just add a lot of complexity.
I can see that there are certain things one doesn't want to allow generally because the complexity outweighs the benefits. On the other hand it seem strange though that I can't get easily get all 'events' between two dates. Would it somehow be possible to build an intermediate solution? E.g. would it be feasible to build a dedicated action=query&prop=allchanges&start=...&end=... that just solved that problem?
I guess in principle it's possible to build this, but it might be quite inefficient, seeing as maintenance/rebuildrecentchanges.php says "This takes several hours, depending on the database size and server configuration."
Thanks, Bjoern
2009/5/31 sl contrib sl.contrib@googlemail.com:
On the other hand it seem strange though that I can't get easily get all 'events' between two dates.
You can, with recentchanges. It has its limitations, but IMO you should be able to cope with them.
Would it somehow be possible to build an intermediate solution? E.g. would it be feasible to build a dedicated action=query&prop=allchanges&start=...&end=... that just solved that problem?
For revisions, possibly. It wouldn't include log events, though.
I guess in principle it's possible to build this, but it might be quite inefficient, seeing as maintenance/rebuildrecentchanges.php says "This takes several hours, depending on the database size and server configuration."
...and as you might've guessed, inefficient stuff has no chance of being enabled on larger wikis.
Roan Kattouw (Catrope)
sl contrib sl.contrib@googlemail.com writes:
I guess there isn't a simple way to make allow something like apnamespace=0|1|2|3 for the same reason?
Don't tell anybody, but in http://rt.cpan.org/Public/Bug/Display.html?id=46061 it says < Wait, via my accident, I discovered how to print all namespaces in one command.
2009/5/31 jidanni@jidanni.org:
sl contrib sl.contrib@googlemail.com writes:
I guess there isn't a simple way to make allow something like apnamespace=0|1|2|3 for the same reason?
Don't tell anybody, but in http://rt.cpan.org/Public/Bug/Display.html?id=46061 it says < Wait, via my accident, I discovered how to print all namespaces in one command.
That's a bug which seems to have been fixed on trunk, since I can reproduce it on enwp but not on TranslateWiki.
Roan Kattouw (Catrope)
Hi Roan,
thanks again for the reply. Comments in line.
On Sun, May 31, 2009 at 8:05 PM, Roan Kattouw roan.kattouw@gmail.comwrote:
2009/5/31 sl contrib sl.contrib@googlemail.com:
On the other hand it seem strange though that I can't get easily get all 'events' between two
dates. You can, with recentchanges. It has its limitations, but IMO you should be able to cope with them.
While looking at this I noticed that log entries for moved pages don't contain revids:
'logaction' => 'move', 'move' => { 'new_ns' => 0, 'new_title' => 'Sandpit/test2' }, 'logtype' => 'move', 'revid' => 0, 'timestamp' => '2009-05-31T21:47:11Z', 'old_revid' => 0,
This seems to be inconsistent: For edits, there's an old_revid and a revid (which are recorded in the log), and when moving a page, it's there's also an old_revid and a revid. However, those are not recorded in the log.
Any ideas as to why that is, and if it doesn't make sense, which bug tracker should it go on?
Would it somehow be possible to build an intermediate solution? E.g. would
it be feasible to build a dedicated action=query&prop=allchanges&start=...&end=... that just solved that problem?
For revisions, possibly. It wouldn't include log events, though.
To be able to query: (a) all pages that changed between two dates (with the latest revision of that page) and (b) all revisions that were made between two dates would be useful, with similar options to prop=revisions and in particular rvprop (and going across all namespaces).
Merging this with log information would not be essential, as most things would be visible from the revisions themselves. Only the deletion log would have to be taken into account, but this could be done in a second query.
Would that be feasible? Something like that would make a mirroring process very easy, as you could just feed in the date of your last update, and get the pages back that you need.
All the best, Bjoern
Hi Roan,
Would it somehow be possible to build an intermediate solution? E.g.
would
it be feasible to build a dedicated action=query&prop=allchanges&start=...&end=... that just solved that problem?
For revisions, possibly. It wouldn't include log events, though.
I've had a go a modifying the code for allpages.
Basically if this is made conditional: $this->addWhereFld('page_namespace', $params['namespace']);
then all pages can be searched (irrespective of namespace). Has this got a massive impact on efficiency? The maximum number of entries returned is limited anyway, and it shouldn't really matter which namespace they come from. (Of course some things like apfrom no longer work as expected, but for my usecase, it would be ok to be disabled.)
You then introduce new parameters: startid, endid, start, end (for start/end of revid, or start/end of last touched), and amend the query: if (isset ($params['start'])) { $this->addWhere('page_touched>=' . $params['start']); }
Finally you need something like:
$this->addOption('ORDER BY', 'page_touched'); and $this->setContinueEnumParameter('start', $this->keyToTitle($row->page_latest));
With those changes (and a few conditionals) 'allpages' can produce a list of pages that were touched between two dates, or a set of pages that have new revisions between two revision numbers. Not sure yet whether last touched will work as well as the revision timestamp, but at least from the revision number you could easily update an offline set of wiki pages.
Do you think this looks good so far? Should I post the code somewhere so that people can have a look?
Cheers, Bjoern
2009/6/3 sl contrib sl.contrib@googlemail.com:
Hi Roan,
Would it somehow be possible to build an intermediate solution? E.g. would it be feasible to build a dedicated action=query&prop=allchanges&start=...&end=... that just solved that problem?
For revisions, possibly. It wouldn't include log events, though.
I've had a go a modifying the code for allpages. Basically if this is made conditional: $this->addWhereFld('page_namespace', $params['namespace']);
then all pages can be searched (irrespective of namespace). Has this got a massive impact on efficiency?
Yes, for queries with certain oft-used parameters, this'll harm efficiency a lot.
The maximum number of entries returned is limited anyway, and it shouldn't really matter which namespace they come from. (Of course some things like apfrom no longer work as expected, but for my usecase, it would be ok to be disabled.)
Not only do they no longer work as expected, they also cause inefficiency.
You then introduce new parameters: startid, endid, start, end (for start/end of revid, or start/end of last touched), and amend the query: if (isset ($params['start'])) { $this->addWhere('page_touched>=' . $params['start']); }
Finally you need something like: $this->addOption('ORDER BY', 'page_touched'); and $this->setContinueEnumParameter('start', $this->keyToTitle($row->page_latest));
Since there's no index on page_latest, sorting and paging on it the way you do is inefficient. Especially the ORDER BY page_latest part causes a filesort of the entire page table, which has over 10 million entries on English Wikipedia.
With those changes (and a few conditionals) 'allpages' can produce a list of pages that were touched between two dates, or a set of pages that have new revisions between two revision numbers. Not sure yet whether last touched will work as well as the revision timestamp, but at least from the revision number you could easily update an offline set of wiki pages. Do you think this looks good so far? Should I post the code somewhere so that people can have a look?
This'll probably work (albeit breaking a few things such as apfrom, as you mentioned), but due to the inefficient queries involved, it won't make it into the MediaWiki core.
Roan Kattouw (Catrope)
Hi Roan,
thanks for the answers!
You then introduce new parameters: startid, endid, start, end (for
start/end
of revid, or start/end of last touched), and amend the query: if (isset ($params['start'])) { $this->addWhere('page_touched>=' . $params['start']); }
Finally you need something like: $this->addOption('ORDER BY', 'page_touched'); and $this->setContinueEnumParameter('start', $this->keyToTitle($row->page_latest));
Since there's no index on page_latest, sorting and paging on it the way you do is inefficient. Especially the ORDER BY page_latest part causes a filesort of the entire page table, which has over 10 million entries on English Wikipedia.
I guess it would be the same if one sorted on revision id (rather than page_latest)?
Is there a proposal one could forward to make this more efficient, by somehow also indexing on revision id?
that people can have a look? This'll probably work (albeit breaking a few things such as apfrom, as you mentioned), but due to the inefficient queries involved, it won't make it into the MediaWiki core.
Again, is there something that could be done to make it more efficient?
Or perhaps one could put some less efficient code in, but with a switch to disable it on large wikis?
Just to give a little background, why I think this is important: Mediawiki is an important platform for Open Educational Resources, and when considering scenarios in developing countries, bandwidth is expensive, and so mirroring is important. Of course once you mirror, you want to start up to date, so being able to get updates since a revision or a date is important.
Would really like to hear your ideas on what the best way forward is!
Thanks again, Bjoern
2009/6/5 sl contrib sl.contrib@googlemail.com:
I guess it would be the same if one sorted on revision id (rather than page_latest)? Is there a proposal one could forward to make this more efficient, by somehow also indexing on revision id?
that people can have a look?
This'll probably work (albeit breaking a few things such as apfrom, as you mentioned), but due to the inefficient queries involved, it won't make it into the MediaWiki core.
Again, is there something that could be done to make it more efficient? Or perhaps one could put some less efficient code in, but with a switch to disable it on large wikis? Just to give a little background, why I think this is important: Mediawiki is an important platform for Open Educational Resources, and when considering scenarios in developing countries, bandwidth is expensive, and so mirroring is important. Of course once you mirror, you want to start up to date, so being able to get updates since a revision or a date is important. Would really like to hear your ideas on what the best way forward is!
IMO the best way forward is still to use list=recentchanges whenever possible. A list=allrevisions module, which just grabs all revisions from the revision table (paging by rev_id), could easily be written to accommodate for getting revisions older than $wgRCMaxAge , and for initial imports without history, generator=allpages&prop=revisions (gets the top revision of every page) should do.
Roan Kattouw (Catrope)
mediawiki-api@lists.wikimedia.org