Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

1 Jul 2013

...

 What is the intended format of the dump files? The page makes it sound like
 it will have a binary format, which I'm not opposed to, but is definitely
 something you should decide on.

Yes, it is a binary format, I will make that clearer on the page.

The advantage of a binary format is that it's smaller, which I think is
quite important.

I think the main advantages of text-based formats is that there are lots of
tools for the common ones (XML and JSON) and that they are human readable.
But those tools wouldn't be very useful, because we certainly want to have
some sort of custom compression scheme and the tools wouldn't be able to
work with that.
And I think human readability is mostly useful if we want others to be able
to write their own code that directly accesses the data.
And, because of the custom compression, doing that won't be that easy
anyway. And hopefully, it won't be necessary, because there will be a nice
library usable by everyone (see below).

...
  Also, I really like the idea of writing it in a low
level language and then
 having bindings for something higher. However, unless you plan of having
 multiple language bindings (e.g., *both* C# and Python), you may want to
 pick a different route. For example, if you decide to only bind to Python,
 you can use something like Cython, which would allow you to write
 pseudo-Python that is still compiled to C. Of course, if you want multiple
 language bindings, this is likely no longer an option.

Right now, everyone can read the dumps in their favorite language.
If I write the library interface well, writing bindings for it for another
language should be relatively trivial, so everyone can keep using their
favorite language.

And I admit, I'm proposing doing it this way partially because of selfish
reasons: I'd like to use this library in my future C# code.
But I realize creating something that works only in C# doesn't make sense,
because most people in this community don't use it.
So, to me writing the code so that it can be used from anywhere makes the
most sense

Petr Onderka

...
   On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka
&lt;gsvick(a)gmail.com&gt; wrote:

  For my GSoC project Incremental data dumps [1],
I'm creating a new file
 format to replace Wikimedia's XML data dumps.
 A sketch of how I imagine the file format to look like is at
 http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.

 What do you think? Does it make sense? Would it work for your use case?
 Any comments or suggestions are welcome.

 Petr Onderka
 [[User:Svick]]

 [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 
_______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps