Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

3 Jul 2013


      The problem is that appending is not enough, especially if you want to keep
the current format.
1. With the current format you almost could append new pages, but not new
revisions of existing pages, because they belong in the middle of the XML.
2. We also need to handle deletions (and undeletions) of pages and
revisions.
3. There are also "current" dumps, which always contain only the most
recent revision of a page.
And another advantage of the binary format is that you *can* seek easily.
If you're looking for a specific page or revision, you don't have to go
through the whole file, you can tell the application what you want, it will
look it up and output only that.
Also, even if you couldn't seek, I don't see how is this any worse than the
current situation, when you also can't seek into a specific position of the
compressed XML (unless you use multistream dumps).
Petr Onderka
On Wed, Jul 3, 2013 at 4:45 PM, Giovanni Luca Ciampaglia <
glciampagl@gmail.com> wrote:
...
Petr, could you please elaborate more on this last claim? If turning the
dump generation into an incremental process is the task you are interested
in solving, then I don't understand how text constitutes a problem. Text
files can be appended to as any regular file and it shouldn't be difficult
to do this in a way that preserves the XML structure valid.
As I said, having the possibility to seek and inspect the files manually
is a tremendous boon when debugging your code. With what you propose that
would be possible but more complicate, since one cannot seek at a specific
position of stdout without going through the whole contents.
Best
Giovanni
On Jul 3, 2013 4:05 PM, "Petr Onderka" gsvick@gmail.com wrote:
...
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will
now be a command line application that outputs the data as uncompressed
XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to
your code.
Keeping the dumps in a text-based format doesn't make sense, because that
can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
...
Hi,
As a regular of user of dump files I would not want a "fancy" file
format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with
a one pass parser which inserts the data I want (which sometimes is only a
small fraction of the total amount of data in the file) into my local
database. I will normally never store uncompressed dump files, but pipe the
uncompressed data directly from bunzip or gunzip to my parser to save disk
space. Therefore it is important to me that the format is simple enough for
a one pass parser.
I cannot really imagine who would use a library with object oriented API
to read dump files. No matter what it would be inefficient and have fewer
features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good
idea. It will be harder to take sure that your parser is working correctly,
and you have to consider things like endianness, size of integers, format
of floats etc. which give no problems in text formats. The binary files may
be smaller uncompressed (which I don't store anyway) but not necessary when
compressed, as the compression will do better on text files.
Regards,

Byrial

______________________________**_________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.**wikimedia.orgXmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps