Hi!
From looking at DB scheme I cannot find an efficient way of getting the
list of null revisions or opposite (no null revisions list). With LIMIT paging (for custom API). When I GROUP then ORDER and LIMIT, it behaves extremly slow. It seems that I should use very inefficient GROUP BY rev_text_id (and also MySQL not offering FIRST / LAST aggregate functions) and also there is no index on rev_text_id by default :-( I wish there was a field like rev_minor_edit but for detection of null revisions, such as these generated by XML import / export. They confuse the logic of my wiki synchronization script. However, even if I were able to persuade to include these features into the scheme, 1.15 which customers use, was already released some time ago, anyway :-( So probably the core patch is the only efficient way to solve my problem? Dmitriy
On Thu, Dec 2, 2010 at 6:23 PM, Dmitriy Sintsov questpc@rambler.ru wrote:
So probably the core patch is the only efficient way to solve my problem?
You can always supply a database patch with your extension to add indices you need to core tables.
Bryan
* Bryan Tong Minh bryan.tongminh@gmail.com [Thu, 2 Dec 2010 19:38:47 +0100]:
On Thu, Dec 2, 2010 at 6:23 PM, Dmitriy Sintsov questpc@rambler.ru wrote:
So probably the core patch is the only efficient way to solve my problem?
You can always supply a database patch with your extension to add indices you need to core tables.
Indices are not hard to add, that's true. However, even with indexes the GROUP BY rev_text_id query on large revision set is slow. I probably will have to patch Revision::newNullRevision to add a new field value there (for the existing it is possible to fill the new field with UPDATE, however there will be new null revisions). Dmitriy
On Thu, Dec 2, 2010 at 10:43 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
Indices are not hard to add, that's true. However, even with indexes the GROUP BY rev_text_id query on large revision set is slow. I probably will have to patch Revision::newNullRevision to add a new field value there (for the existing it is possible to fill the new field with UPDATE, however there will be new null revisions).
What is it that your system actually needs to be able to do this for? Is there an issue with loading up the previous text items, or are you trying to optimize storage on your end by not storing text twice when it happened to use the same text blob on the origin site?
Beware that there's not anything that really distinguishes null revisions from their predecessors, other than that they come later than the previous ones. Note that it's also possible for the earlier revision to get deleted while a later revision using the same text blob still remains.
The previously referenced text blob might also have originally come in in a much older revision, not the immediately preceding one; this may be legit for certain kinds of reverts, for instance.
-- brion
* Brion Vibber brion@pobox.com [Thu, 2 Dec 2010 12:15:18 -0800]:
What is it that your system actually needs to be able to do this for?
Is
there an issue with loading up the previous text items, or are you trying to optimize storage on your end by not storing text twice when it
happened
to use the same text blob on the origin site?
I try to synchronize "recent changes" of two wiki sites via XML chunks (consequtive groups of 10 revisions), created by WikiExporter. It mostly works (however I am still haven't checked all throughly, what will happen if an revision with earlier timestamp is trying to import over revision with older timestamp?), however, ImportReporter::reportPage also creates an extra null revision for every revision page imported for "informational purposes" ("Imported by WikiSync" in my case). Unfortunately, at the next run of synchronization, such revision becomes a difference between sites and synchronization reports that sites are not equal (even though there really was no changes, except for informational null revision).
Beware that there's not anything that really distinguishes null revisions from their predecessors, other than that they come later than the previous ones. Note that it's also possible for the earlier revision to get deleted while a later revision using the same text blob still remains.
That's really bad for me - I probably should patch the deletion as well, to remove a flag field of rev_null from null revision row, when it's non-null match of rev_text_id was deleted :-( Too much of patches of the core and I am even not sure that I can intercept all kinds of revision deletion - should check that).
With GROUP BY on large set being slow and FIRST / LAST aggregators unavailable, it probably would be easier to me just not to call ImportReporter from by derived WikiImporter class? Informational null revisions won't be simply created in such case. They are nice to end user, that's why I have tried to keep them.
The previously referenced text blob might also have originally come in in a much older revision, not the immediately preceding one; this may be legit for certain kinds of reverts, for instance.
Thanks for explanation. Dmitriy
On Thu, Dec 2, 2010 at 8:09 PM, Dmitriy Sintsov questpc@rambler.ru wrote:
- Brion Vibber brion@pobox.com [Thu, 2 Dec 2010 12:15:18 -0800]:
What is it that your system actually needs to be able to do this for?
Is
there an issue with loading up the previous text items, or are you trying to optimize storage on your end by not storing text twice when it
happened
to use the same text blob on the origin site?
I try to synchronize "recent changes" of two wiki sites via XML chunks (consequtive groups of 10 revisions), created by WikiExporter. It mostly works (however I am still haven't checked all throughly, what will happen if an revision with earlier timestamp is trying to import over revision with older timestamp?), however, ImportReporter::reportPage also creates an extra null revision for every revision page imported for "informational purposes" ("Imported by WikiSync" in my case). Unfortunately, at the next run of synchronization, such revision becomes a difference between sites and synchronization reports that sites are not equal (even though there really was no changes, except for informational null revision).
It sounds to me like what you need to do is recognize and skip your tool's edits, not null edits generally.
If these are all created by a particular user account, for instance, that should be pretty straightforward: compare the user ID value and skip the revision.
-- brion
* Brion Vibber brion@pobox.com [Thu, 2 Dec 2010 20:45:16 -0800]:
It sounds to me like what you need to do is recognize and skip your tool's edits, not null edits generally.
If these are all created by a particular user account, for instance, that should be pretty straightforward: compare the user ID value and skip
the
revision.
A good idea, I'll make a mandatory account name for synchronization. Probably should work, however is there any way to disable "interactive" edits for some particular account while allowing it to use Import / Export and API in general? I'll check whether denying 'edit' action for synchronization account still would allow to perform "automatic" import. Otherwise, one can only hope that synchronization bot account will not be misused for ordinary edits (which should not be skipped from synchronization). Anyway, I can provide such warning at the extension page, at least. Dmitriy
wikitech-l@lists.wikimedia.org