As I assume most people here know, each revision in the full history dumps for Mediawiki reports the complete page text. So even though an edit may have changed only a few characters, the entire page is reported for each revision. This is one of the reasons that full history dumps are very large.
Recently I've written some Python code to re-express the revision history into an "edit syntax", using an xml compatible notation for changes with expressions like:
<replace>, <delete>, <insert>, etc.
Since many revisions really only consist of small changes to the text, using the notation I've been developing can greatly reduce the size of the dump, while still maintaining a human readable syntax. For example, I recently ran it against the full history dump of ruwiki (179 GB uncompressed, 1.2 M pages, 11.2 M revisions), and got a 94% reduction in size (11.1 GB). Because it is still a text based format, it stacks well with traditional file compressors (bz2: 89% reduction - 1.24 GB; 7z: 91% reduction - 1.07 GB).
It also could be a precursor to analysis designed to work out "primary" authors and other tasks where one wants to know who is making large edits and who is making small, housekeeping edits.
Obviously, as a compressor it is most successful with large pages which have a large number of relatively minor revisions. For example, the enwiki history of [[Saturn]] (current size 57 kb, 4741 revisions) sees a 99.1% size reduction. I suspect that the size reduction on large wikis, like en or de, would be even greater than the 94% for ruwiki since larger wikis tend to have larger pages and more revisions per page.
The current version of my compressor averaged a little better than 250 revisions per second on ruwiki (about 12 hours total) on a 18-month-old desktop. However, as the CPU utilization was only 50-70% of a full processing core most of the time, I suspect that my choice to read and write from an external hard drive may have been the limiting factor. On a good machine, 400+ rev/s might be a plausible number for the current code. Or in short, the overhead for figuring out my edit syntax is relatively small compared to the generation time for the current dumps (which I'm guessing is limited by communication with the text data store).
My code has some quirks and known bugs, and I'd describe it as a late-stage alpha version at the moment. It still needs considerable work (not to mention documentation) before I would consider it to be something ready for general use.
However, I wanted to know if this is a project of interest to Mediawiki developers or other people. Placed in the dump chain, it could substantially reduce the size of the human readable dumps (at the expense that one would need to process through a series of edits if you wanted see the full-text of any specific revision). Or utilized for different purposes, it could help figure out major vs. minor editors, etc. If this project is mostly just a curiosity for my own use, then I will probably keep the code pretty crude. However, if other people are interested in using something like this, then I am willing to put more effort into developing something that is cleaner and more generally usable.
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
-Robert Rohde
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde rarohde@gmail.com wrote:
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which has to be de-compressed for dumping again, so I think your approach in addition to the diffcompression for storage is a really cool idea. Do you have the source anywhere so people can look at it?
Marco
Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde rarohde@gmail.com wrote:
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which has to be de-compressed for dumping again, so I think your approach in addition to the diffcompression for storage is a really cool idea. Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14 from 628 GB down to 30 GB. It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange. See the DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into another diff format for the dump output, as long as the operation is properly parallelized. I'm not sure how much value there is in trying to preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I reorder the revisions in order to preserve a good compression ratio in the scenario of regular page-blanking vandalism. Otherwise, every time the page is blanked and replaced with a small message, you have to store the entire revision again, which was the dominant use of space in certain test cases.
-- Tim Starling
On Wed, Jan 7, 2009 at 8:22 PM, Tim Starling tstarling@wikimedia.org wrote:
It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange.
Why not?
Aryeh Gregor wrote:
On Wed, Jan 7, 2009 at 8:22 PM, Tim Starling tstarling@wikimedia.org wrote:
It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange.
Why not?
Because it's an undocumented binary diff format with sequence information compressed in a MediaWiki-specific way, which is packed into an array and then serialized using PHP's undocumented serialization format, then compressed with zlib using undocumented parameters, then stored in a MediaWiki-specific object, then serialized again. It's not a process that's interchange-friendly.
-- Tim Starling
On Wed, Jan 7, 2009 at 5:22 PM, Tim Starling tstarling@wikimedia.org wrote:
Marco Schuster wrote:
On Wed, Jan 7, 2009 at 10:43 PM, Robert Rohde rarohde@gmail.com wrote:
So, I'd like to know whether there are people (besides myself) who are interested in seeing the full history dumps expressed in an edit syntax rather than the full-text syntax currently used.
AFAIR, Tim or Brion are working on a DB compressor with diffs, which has to be de-compressed for dumping again, so I think your approach in addition to the diffcompression for storage is a really cool idea. Do you have the source anywhere so people can look at it?
Yes, I just finished compressing the external storage clusters 13 and 14 from 628 GB down to 30 GB. It should help improve dump speed by reducing the disk read rate, but the format isn't suitable for interchange. See the DiffHistoryBlob class in HistoryBlob.php.
It would be possible to uncompress from my format and then recompress into another diff format for the dump output, as long as the operation is properly parallelized. I'm not sure how much value there is in trying to preserve the diff itself from storage to output.
The main difference between my code and the usual solutions is that I reorder the revisions in order to preserve a good compression ratio in the scenario of regular page-blanking vandalism. Otherwise, every time the page is blanked and replaced with a small message, you have to store the entire revision again, which was the dominant use of space in certain test cases.
My approach to that case was to hash each revision and keep a record of the hashes already seen for the article. If the same hash appeared more than once, my diff syntax instructed it to "<revert>" to the previous revision. My solution doesn't address the case that someone reverts page-blanking vandalism and edits the text at the same time, but that case is quite rare in practice.
For a bit more background, my current implementation breaks the article in the lines (newline character splits), and intelligently handles the following kinds of changes:
line insertion line deletion line replacement (major edits to a line) text replacement (small edits within a line) line reordering (handling the case of sections being reordering improves significantly upon diff generators that ignore this case) article truncation article replacement (or so many edits that it is more compact to simply specify the new version) article reversion to prior version appending to article
-Robert Rohde
Anthony was doing some similar work in October, based on RCS and reverse deltas. Results also reduced the sizes notably.
On Wed, Jan 7, 2009 at 6:25 PM, Keisial keisial@gmail.com wrote:
Anthony was doing some similar work in October, based on RCS and reverse deltas. Results also reduced the sizes notably.
http://blog.p2pedia.org/2008/10/anarchism.html
Using skip-deltas I think you could make a system fast enough to run live. At the very least it could be used as part of an incremental dump system. Using *smart* skip-deltas, you'd resolve the inefficiencies due to page-blanking vandalism.
I never managed to create an implementation, though. In particular, I [oddly] couldn't find a standalone diff program to make the deltas in RCS format, and never got around to writing my own.
One improvement over the diff format used by RCS would be to use smarter breakpoints, since wikitext tends to have a lot of really long lines with no line breaks. Using some simple heuristics to guess at sentence breaks would probably be useful there. It wouldn't have to be perfect, since reconstruction just concatenates all the text together regardless of where it was broken apart. Of course, results would be somewhat language specific.
I like Robert's approach, though. I'd love to see the results of applying the concept of skip-deltas to it, to achieve O(lg(N)) random access reads (with N being the number of revisions to the article being accessed). Or maybe I'm misreading him and he's already accomplished that?
Anthony wrote:
Using skip-deltas I think you could make a system fast enough to run live. At the very least it could be used as part of an incremental dump system. Using *smart* skip-deltas, you'd resolve the inefficiencies due to page-blanking vandalism.
One more possibility is to make md5 of every revision, then diff only between those that have unique md5s.
One improvement over the diff format used by RCS would be to use smarter breakpoints, since wikitext tends to have a lot of really long lines with no line breaks. Using some simple heuristics to guess at sentence breaks would probably be useful there. It wouldn't have to be perfect, since
I suggest looking into wdiff ( http://www.gnu.org/software/wdiff/ ).
On Thu, Jan 8, 2009 at 7:29 AM, Nikola Smolenski smolensk@eunet.yu wrote:
Anthony wrote:
Using skip-deltas I think you could make a system fast enough to run
live.
At the very least it could be used as part of an incremental dump system. Using *smart* skip-deltas, you'd resolve the inefficiencies due to page-blanking vandalism.
One more possibility is to make md5 of every revision, then diff only between those that have unique md5s.
My thought was that each stored revision would have two pieces of data, the diff and the rev id of the prior version on which it was based. By default the prior revision would be chosen using the standard algorithm to ensure lg(N) chains (see http://svn.collab.net/repos/svn/trunk/notes/skip-deltas), but in the case of an exact match with a prior revision you'd set the prior version to that exact match and the diff would be blank (for performance you'd want to throw in some checks to handle multiple reverts while keeping the chains as short as possible). This would probably be much easier with FSFS deltas (a la Subversion), but you could keep one (or a few) of the most recent revisions in full to handle the normal case efficiently.
MD5s could be kept to make the search for exact duplicates fast, but the advantage of this over just doing MD5s is that it could be extended to cover near-exact matches (e.g. an optimizer could run in the background finding near-duplicates and rearranging the chains appropriately). It also would handle the case of an MD5 match on non-matching text without special-casing, and this definitely needs to be handled because while an MD5 match is unlikely in random data you also have to worry about intentional attacks since MD5 is cryptographically insecure.
One improvement over the diff format used by RCS would be to use smarter
breakpoints, since wikitext tends to have a lot of really long lines with
no
line breaks. Using some simple heuristics to guess at sentence breaks
would
probably be useful there. It wouldn't have to be perfect, since
I suggest looking into wdiff ( http://www.gnu.org/software/wdiff/ ).
I'll have to think about whether or not wdiff would work with delta combining, which is part of the key to ensuring fast lookup of prior revisions.
Of course, this is probably all just a pipe dream for me - barring the discovery of tools that do exactly what I need, I don't have the time and/or the money to watch this get implemented.
Subversion does most of this, and tweaking Subversion to be more wiki-aware would be another possibility, but Subversion is incredibly bloated for this purpose. Plus I'd probably have to rewrite Mediawiki. Mediawiki doesn't allow easily pluggable "alternative databases", does it? (I seem to recall the database access not being very well encapsulated in MW.)
On Wed, Jan 7, 2009 at 4:43 PM, Robert Rohde rarohde@gmail.com wrote:
reduction in size (11.1 GB). Because it is still a text based format, it stacks well with traditional file compressors (bz2: 89% reduction - 1.24 GB; 7z: 91% reduction - 1.07 GB).
Ruwiki dumps currently show: pages-meta-history.xml.7z 1.3 GB
Not really all that much of a win post 7z-ing considering the current performance numbers you mentioned. (No doubt your code could be made faster... but at the same time 7z is not the state of the art in raw compression ratio)
Not that your format wouldn't have many uses... but it doesn't appear to offer significant gains for bulk transport. (in the future it would be helpful if you cited the current compressed size when comparing new compressed sizes)
On Wed, Jan 7, 2009 at 8:31 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Wed, Jan 7, 2009 at 4:43 PM, Robert Rohde rarohde@gmail.com wrote:
reduction in size (11.1 GB). Because it is still a text based format, it stacks well with traditional file compressors (bz2: 89% reduction - 1.24 GB; 7z: 91% reduction - 1.07 GB).
Ruwiki dumps currently show: pages-meta-history.xml.7z 1.3 GB
Not really all that much of a win post 7z-ing considering the current performance numbers you mentioned. (No doubt your code could be made faster... but at the same time 7z is not the state of the art in raw compression ratio)
Not that your format wouldn't have many uses... but it doesn't appear to offer significant gains for bulk transport. (in the future it would be helpful if you cited the current compressed size when comparing new compressed sizes)
Yes, you are right about that. For bulk transport and storage it is not a big improvement.
However, to work with ruwiki, for example, one generally needs to decompress it to the full 170 GB. To work with enwiki's full revision history, if such a dump is ever to exist again, would probably decompress to ~2 TB. 7z and bz2 are not great formats if one wants to extract only portions of the dump since there are few tools that would allow one to do so without first reinflating the whole file. Hence, one of the advantages I see in my format is being able to have a dump that is still <10% the full inflated size while also being able to parse out selected articles or selected revisions in a straightforward manner.
-Robert Rohde
On 1/7/09 9:16 PM, Robert Rohde wrote:
Yes, you are right about that. For bulk transport and storage it is not a big improvement.
However, to work with ruwiki, for example, one generally needs to decompress it to the full 170 GB. To work with enwiki's full revision history, if such a dump is ever to exist again, would probably decompress to ~2 TB. 7z and bz2 are not great formats if one wants to extract only portions of the dump since there are few tools that would allow one to do so without first reinflating the whole file. Hence, one of the advantages I see in my format is being able to have a dump that is still<10% the full inflated size while also being able to parse out selected articles or selected revisions in a straightforward manner.
*nod* this is an attractive option for that sort of case; you can pull the metadata and grep for interesting changes in a much more manageable file.
Note we'll also want to be considering different options for breaking up the dump into smaller files, which makes parallel generation more feasible as well as assisting some downloaders.
-- brion
Robert Rohde wrote:
However, to work with ruwiki, for example, one generally needs to decompress it to the full 170 GB. To work with enwiki's full revision history, if such a dump is ever to exist again, would probably decompress to ~2 TB. 7z and bz2 are not great formats if one wants to extract only portions of the dump since there are few tools that would allow one to do so without first reinflating the whole file. Hence, one of the advantages I see in my format is being able to have a dump that is still <10% the full inflated size while also being able to parse out selected articles or selected revisions in a straightforward manner.
-Robert Rohde
bzipping the pages by blocks as I did for my offline reader produces a file size similar to the the original* There may be ways to get similar results without having to rebuild the revisions. Also note that in both cases you still need an intermediate app to provide input dumps for those tools.
*112% measuring enwiki-20081008-pages-meta-current. Looking at ruwiki-20081228-history, both the original bz2 and my faster-access one are 8.2G.
On Sat, Jan 10, 2009 at 9:14 AM, Keisial keisial@gmail.com wrote:
bzipping the pages by blocks as I did for my offline reader produces a file size similar to the the original* There may be ways to get similar results without having to rebuild the revisions. Also note that in both cases you still need an intermediate app to provide input dumps for those tools.
*112% measuring enwiki-20081008-pages-meta-current. Looking at ruwiki-20081228-history, both the original bz2 and my faster-access one are 8.2G.
-history dumps and one off page dumps are pretty distinct cases: The history dumps have a lot more available redundancy.
For fast access articles you might want to consider compressing articles one-off with a a dictionary based pre-pass such as http://xwrt.sourceforge.net/
On 1/7/09 1:43 PM, Robert Rohde wrote:
Recently I've written some Python code to re-express the revision history into an "edit syntax", using an xml compatible notation for changes with expressions like:
<replace>,<delete>,<insert>, etc.
Cool!
The current version of my compressor averaged a little better than 250 revisions per second on ruwiki (about 12 hours total) on a 18-month-old desktop. However, as the CPU utilization was only 50-70% of a full processing core most of the time, I suspect that my choice to read and write from an external hard drive may have been the limiting factor. On a good machine, 400+ rev/s might be a plausible number for the current code.
It'd be good to compare this against the general-purpose bzip2 and 7zip LZMA compression...
However, I wanted to know if this is a project of interest to Mediawiki developers or other people. Placed in the dump chain, it could substantially reduce the size of the human readable dumps (at the expense that one would need to process through a series of edits if you wanted see the full-text of any specific revision).
Definitely of interest! If you haven't already, I'd love to see some documentation on the format on mediawiki.org, and it'd be great if we can host the dev code in source control, under extensions or tools for now, until we can integrate something directly into the export code.
-- brion
On Wed, Jan 7, 2009 at 9:53 PM, Brion Vibber brion@wikimedia.org wrote:
The current version of my compressor averaged a little better than 250 revisions per second on ruwiki (about 12 hours total) on a 18-month-old desktop. However, as the CPU utilization was only 50-70% of a full processing core most of the time, I suspect that my choice to read and write from an external hard drive may have been the limiting factor. On a good machine, 400+ rev/s might be a plausible number for the current code.
It'd be good to compare this against the general-purpose bzip2 and 7zip LZMA compression...
I started a process to recompress the ruwiki dump using the default settings on 7-Zip. After 5 minutes, it told me I had 16 hours remaining. So I would estimate that my revision compressor is on the same timescale and perhaps somewhat faster than 7-Zip. Again I was reading and writing to an external drive so there could be i/o effect in there as well.
it'd be great if we can host the dev code in source control, under extensions or tools for now, until we can integrate something directly into the export code.
Could someone walk me through how I would do that?
-Robert Rohde
2009/1/8 Brion Vibber brion@wikimedia.org:
Definitely of interest! If you haven't already, I'd love to see some documentation on the format on mediawiki.org, and it'd be great if we
I did some similar work a while ago using Python's difflib[1] as the diffing engine. Since difflib was much too slow when feeding it lists of single characters, I broke up the articles into sequences of words which improved the speed dramatically (but it's still not as fast as Robert's).
My goal was slightly different, and rather than producing exact revision deltas I was looking for "blame" information[2]. I also computed the SHA1-matching graph of reverts, which calculates the shortest path between the current revision and the first one, consequently skipping over page-blanking events in most cases.
The output for the first 1400 or so articles in enwiki can be found here: http://hewgill.com/~greg/wikiblame/
I would be interested in adapting my blame processor to use a faster diffing algorithm, since it took my machine many hours to process those 1400 articles.
[1]: http://python.org/doc/2.5/lib/module-difflib.html [2]: http://hewgill.com/journal/entries/461-wikipedia-blame
Greg Hewgill http://hewgill.com
Hoi, How many characters are there according to your software in the word Mbɔ́tɛ ? The correct answer is 5 Thanks, GerardM
2009/1/10 Greg Hewgill greg@hewgill.com
2009/1/8 Brion Vibber brion@wikimedia.org:
Definitely of interest! If you haven't already, I'd love to see some documentation on the format on mediawiki.org, and it'd be great if we
I did some similar work a while ago using Python's difflib[1] as the diffing engine. Since difflib was much too slow when feeding it lists of single characters, I broke up the articles into sequences of words which improved the speed dramatically (but it's still not as fast as Robert's).
My goal was slightly different, and rather than producing exact revision deltas I was looking for "blame" information[2]. I also computed the SHA1-matching graph of reverts, which calculates the shortest path between the current revision and the first one, consequently skipping over page-blanking events in most cases.
The output for the first 1400 or so articles in enwiki can be found here: http://hewgill.com/~greg/wikiblame/http://hewgill.com/%7Egreg/wikiblame/
I would be interested in adapting my blame processor to use a faster diffing algorithm, since it took my machine many hours to process those 1400 articles.
Greg Hewgill http://hewgill.com
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2009/1/11 Gerard Meijssen gerard.meijssen@gmail.com:
How many characters are there according to your software in the word Mbɔ́tɛ ? The correct answer is 5
Since I was working with the enwiki dump, I did not pay much attention to internationalisation issues. I arbitrarily defined a "word" as the Python regular expression: [\w\d]+
So, the answer to your question depends on how Python implements the \w word-matching regular expression atom:
"When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. "
Greg Hewgill http://hewgill.com
Hoi, We do have languages that are not supported with CLDR locales. Does Unicode on it own suffice ? Thanks. GerardM
2009/1/10 Greg Hewgill greg@hewgill.com
2009/1/11 Gerard Meijssen gerard.meijssen@gmail.com:
How many characters are there according to your software in the word
Mbɔ́tɛ
? The correct answer is 5
Since I was working with the enwiki dump, I did not pay much attention to internationalisation issues. I arbitrarily defined a "word" as the Python regular expression: [\w\d]+
So, the answer to your question depends on how Python implements the \w word-matching regular expression atom:
"When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. "
Greg Hewgill http://hewgill.com _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I've set Robert up with a SVN account (rarohde) to put up the proof of concept converter code.
-- brion
I've uploaded my demonstration code to:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/
The three files are
EditSyntax.py - the main file providing the functions for re-expressing the revision history into my "edit syntax".
ConvertToEditSyntax.py - a utility to perform compressions ConvertFromEditSyntax.py - a utility to perform decompressions
Both utility files can be run from the command line, i.e.:
ConvertToEditSyntax.py [-v] input_file output_file ConvertFromEditSyntax.py [-v] input_file output_file
"-v" is an optional flag for "verbose mode" which gives rolling feedback on the program's progress.
ConvertToEditSyntax expects as input either a full history database dump (i.e. "pages-meta-history.xml") or the output from Special:Export with revisions included.
A round-trip via ConvertTo followed by ConvertFrom should be 100% identical to the original.
For simple demonstration purposes, a smallish wiki (something with a 7z size of 5-10 MB), can provide a few hundred thousand revisions and take a couple minutes to execute. I've been partial to cowiki during testing, for no particular reason.
One can also use Special:Export to get histories of large pages to test. Using large pages from enwiki might be a good place to start because the content is easy to understand.
My largest test case has been ruwiki (12M revisions) which currently takes 10 hours and was described in the original email in this thread. Since that original email I've increased both the execution speed and the compression ratio by a little bit.
I've provided a little bit of internal documentation, but not a lot so far, so obviously feel free to ask questions if you have them.
-Robert Rohde
On Fri, Jan 16, 2009 at 11:21 AM, Brion Vibber brion@wikimedia.org wrote:
I've set Robert up with a SVN account (rarohde) to put up the proof of concept converter code.
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org