A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now
be a command line application that outputs the data as uncompressed XML, in
the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to
your code.
Keeping the dumps in a text-based format doesn't make sense, because that
can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote;wrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format
with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a
one pass parser which inserts the data I want (which sometimes is only a
small fraction of the total amount of data in the file) into my local
database. I will normally never store uncompressed dump files, but pipe the
uncompressed data directly from bunzip or gunzip to my parser to save disk
space. Therefore it is important to me that the format is simple enough for
a one pass parser.
I cannot really imagine who would use a library with object oriented API
to read dump files. No matter what it would be inefficient and have fewer
features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea.
It will be harder to take sure that your parser is working correctly, and
you have to consider things like endianness, size of integers, format of
floats etc. which give no problems in text formats. The binary files may be
smaller uncompressed (which I don't store anyway) but not necessary when
compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.**wikimedia.org <Xmldatadumps-l(a)lists.wikimedia.org>
https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://…