Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

8 Jul 2013

Sorry, reading back over this thread late.

...
  What I hope for is a format that allows dumps to be
produced much more
 rapidly, where the time to produce the incrementals grows only as the
 number of edits per time frame grows 
Curious: what's happening currently that makes the time to produce
incrementals grow more quickly than that?

On Tue, Jul 2, 2013 at 4:41 AM, Ariel T. Glenn &lt;ariel(a)wikimedia.org&gt; wrote:

...
  Στις 02-07-2013, ημέρα Τρι, και ώρα 11:47 +0100, ο/η
Neil Harris έγραψε:

  The simplest possible dump format is the best,
and there's already a
 thriving ecosystem around the current XML dumps, which would be broken
 by moving to a binary format. Binary file formats and APIs defined by
 code are not the way to go if you want long-term archival that can
 endure through decades of technological change.

 If more money is needed for dump processing, it should be budgeted for
 and added to the IT budget, instead of over-optimizing by using a
 potentially fragile, and therefore risky, binary format.

 Archival in a stable format is not a luxury or an optional extra; it's a
 core part of the Foundation's mission. The value is in the data, which
 is priceless. Computers and storage are (relatively) cheap by
 comparison, and Wikipedia is growing significantly more slowly than the
 year-on-year improvements in storage, processing and communication
 links.  Moreover, re-making the dumps every time provides defence in
 depth against subtle database corruption that might slowly corrupt a
 database dump. 
 A point of information: we already do not produce dumps every time from
 scratch; we re-use old revisions because if we did not it would take
 months and months to generate the en wikipedia dumps, something which is
 clearly untenable.

 The question now is how we are going to use those old revisions.  Right
 now we uncompress the entire previous dump, write new information where
 needed, and recompress it all (which would take several weeks for en
 wikipedia history dumps if we didn't run 27 jobs at once).

 What I hope for is a format that allows dumps to be produced much more
 rapidly, where the time to produce the incrementals grows only as the
 number of edits per time frame grows, and where the time to produce new
 fulls via the incrementals is bounded in a much better fashion than we
 have now.

 And I expect that we would have a library or scripts that provide for
 conversion of a new-format dump to the good old XML, so that all the
 tools folks use now will continue to work.

 Ariel

 Please keep the dumps themselves simple and their format stable, and, as
 Nicolas says, do the clever stuff elsewhere, in which you can use
 whatever efficient representation you like to do the processing.

 Neil 

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps