I propose that an additional checksum of the revision text be added to the mediawiki database and that this checksum be made available via the database dumps and api calls.
This additional field would allow many computations such as revert and noop detection without having to ask the system to provide the full text of revisions. For example, if I were to build a user script to show users which revisions have been reverted, it would be beneficial to not have to ask the API for the full text of a large list of revisions. On that same note, even when I need the full text of revisions, I could determine which revisions I do not need to request by determining that their content is exactly the same as one that has already been retrieved.
It does not seem that such a field would require considerably more storage or computational power since computing an MD5 checksum in PHP is cheap and storing 32 hex characters compared to the size of an articles text is not appreciable.
Thanks, -Aaron Halfaker
Also it could be used to say "do I really need to store this revision in the 'page' or 'archive' tables, or can I just refer to an existing identical revision".
The text storage backend could quite legitimately do that on its own. I'm not quite sure why the reference to page/archive tables: no two revisions are "identical" (different rev_timestamp if nothing else); each revision has a text_id to the text of the revision in the text table: you mean that a revision entry could potentially refer to an existing text_id if it was demonstrably identical, rather than creating a new entry and potentially duplicating the text itself. But the text table is not the final stage in the process, or at least it doesn't have to be; MediaWiki is happy as long as throwing that text_id into the database and cranking the handle churns out the appropriate text; it doesn't care how that text is stored or retrieved. Only in the default setting is each old_text field populated with the full text.
That said, I do agree that this should be done. We do it for images, we should do it for text, because it's useful for more than just data compression, as suggested by the OP. It could be used to make evaluation of reversions in extensions like AbuseFilter and FlaggedRevs much more effective and efficient, for instance. And it probably *could* be used to improve the compression of the fully-written text table.
--HM
jidanni@jidanni.org wrote in message news:87hbxlr3va.fsf@jidanni.org...
Also it could be used to say "do I really need to store this revision in the 'page' or 'archive' tables, or can I just refer to an existing identical revision".
Yeah, I meant the text table. You see this list is purposely not indexed in Google as per WMF policy... wait, a reference to what I was talking about is on the bottom of https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 .
2009/7/9 jidanni@jidanni.org:
Also it could be used to say "do I really need to store this revision in the 'page' or 'archive' tables, or can I just refer to an existing identical revision".
Careful - think what happens when a single revision is deleted, oversighted or suppressed.
(We will want to warn about all uses of that revision.)
- d.
How's that different from images? Currently I'm not aware that we warn about all uses of that image when taking comparable actions.
-Mike
On Thu, 2009-07-09 at 22:31 +0100, David Gerard wrote:
2009/7/9 jidanni@jidanni.org:
Also it could be used to say "do I really need to store this revision in the 'page' or 'archive' tables, or can I just refer to an existing identical revision".
Careful - think what happens when a single revision is deleted, oversighted or suppressed.
(We will want to warn about all uses of that revision.)
- d.
2009/7/10 Mike.lifeguard mikelifeguard@fastmail.fm:
On Thu, 2009-07-09 at 22:31 +0100, David Gerard wrote:
2009/7/9 jidanni@jidanni.org:
Also it could be used to say "do I really need to store this revision in the 'page' or 'archive' tables, or can I just refer to an existing identical revision".
Careful - think what happens when a single revision is deleted, oversighted or suppressed. (We will want to warn about all uses of that revision.)
How's that different from images? Currently I'm not aware that we warn about all uses of that image when taking comparable actions.
Because we do it with text WAY more.
- d.
On Fri, Jul 10, 2009 at 7:31 AM, David Gerarddgerard@gmail.com wrote:
Careful - think what happens when a single revision is deleted, oversighted or suppressed.
Isn't this an argument in favour of storing the text once and linking to it? If the text contains some personal information deemed worthy of suppressing, surely you'd want to suppress all copies of it. Well, most of the time, anyway.
Steve
2009/7/13 Steve Bennett stevagewp@gmail.com:
On Fri, Jul 10, 2009 at 7:31 AM, David Gerarddgerard@gmail.com wrote:
Careful - think what happens when a single revision is deleted, oversighted or suppressed.
Isn't this an argument in favour of storing the text once and linking to it? If the text contains some personal information deemed worthy of suppressing, surely you'd want to suppress all copies of it. Well, most of the time, anyway.
Certainly. But you want to know when this is happening.
- d.
Oversight applies to revisions, not text: in both the oversight extension and RevDeleted, the text table (where this change would have an affect) is not touched. Both processes simply impede the users' ability to access the text through the corresponding revision. I don't see any reason for that to change.
--HM
"David Gerard" dgerard@gmail.com wrote in message news:fbad4e140907130602i1d090481m3c5d9a16e2c4dbb2@mail.gmail.com...
2009/7/13 Steve Bennett stevagewp@gmail.com:
On Fri, Jul 10, 2009 at 7:31 AM, David Gerarddgerard@gmail.com wrote:
Careful - think what happens when a single revision is deleted, oversighted or suppressed.
Isn't this an argument in favour of storing the text once and linking to it? If the text contains some personal information deemed worthy of suppressing, surely you'd want to suppress all copies of it. Well, most of the time, anyway.
Certainly. But you want to know when this is happening.
- d.
wikitech-l@lists.wikimedia.org