Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2 Jul 2013

+1

And given how messy the revision data can be, having the possibility of
actually inspecting it with a text editor is a great boon.

That said, there may be other use cases that I am not aware of for which a
binary format might be useful, but if you just need to parse and pipe to a
DB, text is the best option.

Giovanni
On Jul 1, 2013 5:10 PM, "Byrial Jensen" &lt;byrial(a)vip.cybercity.dk&gt; wrote:

...
  Hi,

 As a regular of user of dump files I would not want a "fancy" file format
 with indexes stored as trees etc.

 I parse all the dump files (both for SQL tables and the XML files) with a
 one pass parser which inserts the data I want (which sometimes is only a
 small fraction of the total amount of data in the file) into my local
 database. I will normally never store uncompressed dump files, but pipe the
 uncompressed data directly from bunzip or gunzip to my parser to save disk
 space. Therefore it is important to me that the format is simple enough for
 a one pass parser.

 I cannot really imagine who would use a library with object oriented API
 to read dump files. No matter what it would be inefficient and have fewer
 features and possibilities than using a real database.

 I could live with a binary format, but I have doubts if it is a good idea.
 It will be harder to take sure that your parser is working correctly, and
 you have to consider things like endianness, size of integers, format of
 floats etc. which give no problems in text formats. The binary files may be
 smaller uncompressed (which I don't store anyway) but not necessary when
 compressed, as the compression will do better on text files.

 Regards,
 - Byrial

 ______________________________**_________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.**wikimedia.org &lt;Xmldatadumps-l(a)lists.wikimedia.org&gt;

https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://…

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps