[Foundation-l] Wikipedia meets git

Gregory Maxwell gmaxwell at gmail.com
Sat Oct 17 08:18:06 UTC 2009


On Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail at inbox.org> wrote:
> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont at googlemail.com
>> if you want only the last 3 revisions checked out , it takes about 10
>> seconds and produces 300k of data.
>
> 10 seconds?  That's horrible.  Have you tried using svn?

On a reasonably fast network it actually only about 10 seconds to pull
the entire edit history from his repo, it would take less if the
history has been repacked as I described— but that kind of tight
repacking makes it take longer when you only want a portion of the
history.

Still— much of the neat things that can be done by having the article
in git are only possible if you have the complete history, for
example: generating a blame map needs the entire history.

It would be nice if the git archival format was more efficient for the
kinds of changes made in Wikipedia articles: Source code changes tends
to have short lines and changes tend to change a significant portion
of the lines, while edits on Wikipedia are far more likely to change
only part of a very long line (really, a paragraph).... so working
with line level deltas is efficient for source code while inefficient
for Wikipedia data.

On this repository a git fast-export --all | lzma -9  produces a
900kbyte output (505783 bytes if you want to be silly and use
PAQ8HP12, which is pretty much the state of the art for English text,
instead of LZMA). These methods don't provide fast random access but
it's still clear that there is a lot of room for improvement. ;)  I'm
not sure if anyone is working on improved compression for git for
these kinds of documents.

Getting the entire history of a frequently edited article like this
down to ~1-2mb is roughly where I think it's reasonable for someone
doing continued non-trivial work on the article to fetch the entire
history and thus gain access to functionality that needs most of the
history.



More information about the foundation-l mailing list