Re: [Wikitech-l] Suggested file format of new incremental dumps

1 Jul 2013


      Protocol Buffers are not a bad idea, but I'm not sure about their overhead.
AFAIK, PB have overhead of 1 byte per field.
If I'm counting correctly, with enwiki's 600M revisions and 8 fields per
revision, that means total overhead of more than 4 GB.
The fixed-size part of all revisions (i.e. without comment and text)
amounts to ~22 GB.
I think this means PB have too much overhead.
The overhead could be alleviated by using compression, but I didn't intend
to compress metadata.
So, I think I will start without PB. If I later decide to compress
metadata, I will also try to use PB and see if it works.
Also, I think that reading the binary format isn't going to be the biggest
issue if you're implementing your own library for incremental dumps,
especially if I'm going to use delta compression of revision texts.
Petr Onderka
On Mon, Jul 1, 2013 at 9:16 PM, Daniel Friesen
daniel@nadir-seen-fire.comwrote:
...
Instead of XML "or" a proprietary binary format could we try using a
standard binary format such as Protocol Buffers as a base to reduce the
issues with having to implement the reading/writing in multiple languages?
--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo tylerromeo@gmail.com
wrote:
Petr is right on par with this one. The purpose of this version 2 for
...
dumps
is to allow protocol-specific incremental updating of the dump, which
would
be significantly more difficult in non-binary format.
*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
www.whizkidztech.com | tylerromeo@gmail.com
______________________________**_________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Suggested file format of new incremental dumps