[Foundation-l] Wikipedia meets git

Anthony wikimail at inbox.org
Sat Oct 17 14:05:15 UTC 2009


On Sat, Oct 17, 2009 at 4:40 AM, jamesmikedupont at googlemail.com
<jamesmikedupont at googlemail.com> wrote:
>> It would be nice if the git archival format was more efficient for the
>> kinds of changes made in Wikipedia articles: Source code changes tends
>> to have short lines and changes tend to change a significant portion
>> of the lines, while edits on Wikipedia are far more likely to change
>> only part of a very long line (really, a paragraph).... so working
>> with line level deltas is efficient for source code while inefficient
>> for Wikipedia data.
>
> I have started to work on the blame code
> to bring it down to the char level and learn about it.

Char level would probably make it too inefficient to merge deltas.
Treating a period followed by a space as a line separator would
probably be more efficient.

The key to efficiency is to use skip deltas, though.  You build a
binary tree so accessing any revision requires the application of only
log(n) deltas.

I asked whether or not you tried svn, because svn already uses skip deltas.

Is the idea that the entire file would need to be transferred over the
Internet, though?  If so, I guess you wouldn't want to use skip deltas
- they greatly increase access time to early revisions, but at a
slight space penalty.



More information about the foundation-l mailing list