Why does MW 1.5 still use HistoryBlobStubs? Wouldn't it be better to add another id (or hash) column to the revision table that references a specific text in a HistoryBlob, and move the flags column there, too? This way one would also get rid of the $mDefaultHash variable.
elwp@gmx.de wrote:
Why does MW 1.5 still use HistoryBlobStubs?
Because the code was already there.
Wouldn't it be better to add another id (or hash) column to the revision table that references a specific text in a HistoryBlob, and move the flags column there, too? This way one would also get rid of the $mDefaultHash variable.
Yes it would be better, we know this. Maybe for 1.6. The nice thing about the revision table is that it's small enough to be altered quickly. I think you can expect to see a VARCHAR(255) field containing something similar to a URI.
Speaking of which, are we going to have a report about external storage in 1.4 from JeLuF or Domas?
-- Tim Starling
Tim Starling:
The nice thing about the revision table is that it's small enough to be altered quickly. I think you can expect to see a VARCHAR(255) field containing something similar to a URI.
Is such a field really needed? The software could assemble an URI from existing information like http://$server/$project/$rev_id. This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
But I don't want to give unwanted advice here. The background of my question is that I have written a Perl program that compresses page histories much better than the currently used algorithm. And now I want to write PHP code so that MediaWiki can access the data. But HistoryBlobStubs make this more complicated.
This is how my method works: All revision texts are split into sections (the delimiter is "\n=="). Unchanged sections are stored only once. Sections are sorted by their headings. Then everything is compressed with deflate().
This has several advantages: * Less memory is used because many sections don't change from revision to revision. This is especially true for discussion pages. * Compression and decompression is fast because deflate() is used and it needs to (de)compress less data. * If the pages are larger than 32kB, compression is much better than can be achieved by simply concatenating revisions and compressing them with deflate() and even bzip2. On average, my method compresses about five times better than the currently used one. I have listed some values on http://meta.wikipedia.org/wiki/User:El/History_compression
Now that I've read that the revision texts are about to be relocated to external storage servers I wonder if it would be worth to write the PHP classes at all. What software is going to run on the storage servers? Is my method still competitive there?
elwp@gmx.de wrote:
Tim Starling:
The nice thing about the revision table is that it's small enough to be altered quickly. I think you can expect to see a VARCHAR(255) field containing something similar to a URI.
Is such a field really needed?
Yes, unless you store that information somewhere else.
The software could assemble an URI from existing information like http://$server/$project/$rev_id.
Now you're back to square one, with the requirement to store the actual location somewhere else and do two data fetches. Remember that revision records are independent of backend storage (multiple revisions may reference the same text, for metadata-only updates or reversions, and some text records may not be hooked to a currently active revision, for deletions). Backend storage may be in multiple places or different formats (compression, in-progress upgrades from previous storage systems, etc).
This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
You've got to update it *somewhere*.
-- brion vibber (brion @ pobox.com)
On 27/05/05, Brion Vibber brion@pobox.com wrote:
This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
You've got to update it *somewhere*.
Updating $serverurl or similar in a PHP file is easier than using str_replace or similar for every entry of your revision table to change the URL from the old to the new one.
Tomer Chachamu wrote:
On 27/05/05, Brion Vibber brion@pobox.com wrote:
This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
You've got to update it *somewhere*.
Updating $serverurl or similar in a PHP file is easier than using str_replace or similar for every entry of your revision table to change the URL from the old to the new one.
The opposite is actually true.
If you have a global setting that controls where you look, you have to migrate everything in one chunk. That means a lot of downtime (copying... copying... copying...), which helps nobody and gives you NO flexibility in paritioning or migrating data smoothly.
-- brion vibber (brion @ pobox.com)
On Fri, May 27, 2005 at 11:38:27PM +0100, Tomer Chachamu wrote:
On 27/05/05, Brion Vibber brion@pobox.com wrote:
This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
You've got to update it *somewhere*.
Updating $serverurl or similar in a PHP file is easier than using str_replace or similar for every entry of your revision table to change the URL from the old to the new one.
The current implementation uses logical names within the database and a logical name -> physical server mapping in LocalSettings.php.
The URL for an external revision is DB://cluster1/765765
In LocalSettings.php, the name cluster1 is mapped to a farm of three mysql servers:
$wgExternalServers = array( 'cluster1' => array( array( 'host'=> 'srv30', 'load' =>1)+$templateServer , array( 'host'=> 'srv28', 'load' =>1)+$templateServer , array( 'host'=> 'srv29', 'load' =>1)+$templateServer , ) );
This provides both redundancy and flexibility.
Regards, JeLuF
elwp@gmx.de wrote:
The background of my question is that I have written a Perl program that compresses page histories much better than the currently used algorithm. And now I want to write PHP code so that MediaWiki can access the data. But HistoryBlobStubs make this more complicated.
This is how my method works: All revision texts are split into sections (the delimiter is "\n=="). Unchanged sections are stored only once. Sections are sorted by their headings. Then everything is compressed with deflate().
Two questions spring to mind:
Firstly, when you say "unchanged sections are stored only once", does this apply even if someone changes a section and someone else reverts it, or if someone copies a section to another page? Maybe all the pages should be split into sections, and all the sections stored individually?
Secondly, how great will the dependence between a revision and the previous revision be? In other words, how many (compressed) revisions will have to be retrieved in order to reconstruct the (uncompressed) text of just one revision?
Timwi
Timwi:
Two questions spring to mind:
Firstly, when you say "unchanged sections are stored only once", does this apply even if someone changes a section and someone else reverts it,
Yes, if both revision texts reside in the same history blob. Up to 20 consecutive revisions are stored in one blob.
or if someone copies a section to another page?
No.
Maybe all the pages should be split into sections, and all the sections stored individually?
I doubt that this would improve the compression much, because texts aren't copied that often.
Secondly, how great will the dependence between a revision and the previous revision be? In other words, how many (compressed) revisions will have to be retrieved in order to reconstruct the (uncompressed) text of just one revision?
The complete history blob must be decompressed of course. But no previous revisions need to be reconstructed. At the beginning of the uncompressed history blob there is a section index for each revision followed by a list of (position, length)-pairs for each section. So when a revision text is to be extracted, this is what happens: * uncompress history blob * look up section list for the requested revision * loop up section offsets and lengths * concatenate sections
This is an example header (first 20 revisions of the german article "Stern"):
00000020 00000025 00000142 00000260 00000001 # 20 revisions, 25 different sections 0 # first revision has no heading: only one section 1 2 3 4 5 6 4 5 6 # conversion script: nothing changed 7 5 8 7 9 8 10 9 8 7 9 8 7 11 8 12 11 8 12 13 8 12 14 8 15 14 8 16 14 8 17 14 8 18 14 8 19 14 8 19 20 8 21 22 23 21 24 23 0 2579 # offset and length of the first section 2579 1176 ...
elwp@gmx.de wrote:
The complete history blob must be decompressed of course. But no previous revisions need to be reconstructed. At the beginning of the uncompressed history blob there is a section index for each revision followed by a list of (position, length)-pairs for each section. So when a revision text is to be extracted, this is what happens: [...]
Okay, thank you very much for this detailed explanation. I wasn't aware exactly how the "history blobs" work, so I didn't know there are "only" 20 revisions within each blob. I am relieved to learn that these history blobs are completely independent of each other, and so you always need to retrieve only one of them in order to reconstruct a particular revision of a particular article.
Of course, if an article has 100 consecutive revisions with a common section, you would still be storing that same section 5 times, but I guess that's not too bad.
Thanks again, Timwi
wikitech-l@lists.wikimedia.org