Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

2 Jul 2013

      On 01/07/13 23:21, Nicolas Torzec wrote:
...
Hi there,
In principle, I understand the need for binary formats and compression in a context with limited resources.
On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why?
Therefore, it is not easy to provide input and help.
Cheers.

Nicolas Torzec.

+1
The simplest possible dump format is the best, and there's already a 
thriving ecosystem around the current XML dumps, which would be broken 
by moving to a binary format. Binary file formats and APIs defined by 
code are not the way to go if you want long-term archival that can 
endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for 
and added to the IT budget, instead of over-optimizing by using a 
potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a 
core part of the Foundation's mission. The value is in the data, which 
is priceless. Computers and storage are (relatively) cheap by 
comparison, and Wikipedia is growing significantly more slowly than the 
year-on-year improvements in storage, processing and communication 
links.  Moreover, re-making the dumps every time provides defence in 
depth against subtle database corruption that might slowly corrupt a 
database dump.
Please keep the dumps themselves simple and their format stable, and, as 
Nicolas says, do the clever stuff elsewhere, in which you can use 
whatever efficient representation you like to do the processing.
Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps