On 01/07/13 23:21, Nicolas Torzec wrote:
In principle, I understand the need for binary formats and compression in a context with
On the other hand, plain text formats are easy to work with, especially for third-party
users and organizations.
Playing the devil advocate, I could even argue that you should keep the data dumps in
plain text, and keep your processing dead simple, and then let distributed processing
systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute
diffs whenever needed or on the fly.
Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what
the requirements are for this new incremental update format, and why?
Therefore, it is not easy to provide input and help.
- Nicolas Torzec.
The simplest possible dump format is the best, and there's already a
thriving ecosystem around the current XML dumps, which would be broken
by moving to a binary format. Binary file formats and APIs defined by
code are not the way to go if you want long-term archival that can
endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for
and added to the IT budget, instead of over-optimizing by using a
potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a
core part of the Foundation's mission. The value is in the data, which
is priceless. Computers and storage are (relatively) cheap by
comparison, and Wikipedia is growing significantly more slowly than the
year-on-year improvements in storage, processing and communication
links. Moreover, re-making the dumps every time provides defence in
depth against subtle database corruption that might slowly corrupt a
Please keep the dumps themselves simple and their format stable, and, as
Nicolas says, do the clever stuff elsewhere, in which you can use
whatever efficient representation you like to do the processing.