Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

3 Jul 2013

I'm primarily a Windows guy, so I'm trying to write the code in a portable
way and I will make sure the application works on both Linux and Windows.

Petr Onderka

On Wed, Jul 3, 2013 at 4:49 PM, Erik Zachte &lt;ezachte(a)wikimedia.org&gt; wrote:

...
   it will now be
a command line application that outputs the data as  uncompressed XML, in the same
format as current dumps.

 That will help a great deal. But I assume your application will be for
 Linux only?
 So it would help to still generate current compressed dumps, as post
 processing step, and store them online for download.

 One of the reasons of xml dumps is platform independence, both from
 producer side (we had ever evolving SQL dumps earlier), and consumer side
 (not everyone uses Linux).

 Erik Zachte

 -----Original Message-----
 From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:
 wikitech-l-bounces(a)lists.wikimedia.org] On Behalf Of Petr Onderka
 Sent: Wednesday, July 03, 2013 4:04 PM
 To: Wikimedia developers; Wikipedia Xmldatadumps-l
 Subject: Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new
 incremental dumps

 A reply to all those who basically want to keep the current XML dumps:

 I have decided to change the primary way of reading the dumps: it will now
 be a command line application that outputs the data as uncompressed XML, in
 the same format as current dumps.

 This way, you should be able to use the new dumps with minimal changes to
 your code.

 Keeping the dumps in a text-based format doesn't make sense, because that
 can't be updated efficiently, which is the whole reason for the new dumps.

 Petr Onderka

 On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen &lt;byrial(a)vip.cybercity.dk
 wrote: 
  Hi,

 As a regular of user of dump files I would not want a "fancy" file
 format with indexes stored as trees etc.

 I parse all the dump files (both for SQL tables and the XML files)
 with a one pass parser which inserts the data I want (which sometimes
 is only a small fraction of the total amount of data in the file) into
 my local database. I will normally never store uncompressed dump
 files, but pipe the uncompressed data directly from bunzip or gunzip
 to my parser to save disk space. Therefore it is important to me that
 the format is simple enough for a one pass parser.

 I cannot really imagine who would use a library with object oriented
 API to read dump files. No matter what it would be inefficient and
 have fewer features and possibilities than using a real database.

 I could live with a binary format, but I have doubts if it is a good  idea.
  It will be harder to take sure that your parser
is working correctly,
 and you have to consider things like endianness, size of integers,
 format of floats etc. which give no problems in text formats. The
 binary files may be smaller uncompressed (which I don't store anyway)
 but not necessary when compressed, as the compression will do better on  text
files.

 Regards,
 - Byrial

 ______________________________**_________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.**wikimedia.org
 &lt;Xmldatadumps-l(a)lists.wikimedia.org&gt;
 https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<httpstps:
 //lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>
  _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps