Protocol Buffers are not a bad idea, but I'm not sure about their overhead.
AFAIK, PB have overhead of 1 byte per field. If I'm counting correctly, with enwiki's 600M revisions and 8 fields per revision, that means total overhead of more than 4 GB. The fixed-size part of all revisions (i.e. without comment and text) amounts to ~22 GB. I think this means PB have too much overhead.
The overhead could be alleviated by using compression, but I didn't intend to compress metadata.
So, I think I will start without PB. If I later decide to compress metadata, I will also try to use PB and see if it works.
Also, I think that reading the binary format isn't going to be the biggest issue if you're implementing your own library for incremental dumps, especially if I'm going to use delta compression of revision texts.
Petr Onderka
On Mon, Jul 1, 2013 at 9:16 PM, Daniel Friesen daniel@nadir-seen-fire.comwrote:
Instead of XML "or" a proprietary binary format could we try using a standard binary format such as Protocol Buffers as a base to reduce the issues with having to implement the reading/writing in multiple languages?
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo tylerromeo@gmail.com wrote:
Petr is right on par with this one. The purpose of this version 2 for
dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l