For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on.
Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsvick@gmail.com wrote:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on.
Yes, it is a binary format, I will make that clearer on the page.
The advantage of a binary format is that it's smaller, which I think is quite important.
I think the main advantages of text-based formats is that there are lots of tools for the common ones (XML and JSON) and that they are human readable. But those tools wouldn't be very useful, because we certainly want to have some sort of custom compression scheme and the tools wouldn't be able to work with that. And I think human readability is mostly useful if we want others to be able to write their own code that directly accesses the data. And, because of the custom compression, doing that won't be that easy anyway. And hopefully, it won't be necessary, because there will be a nice library usable by everyone (see below).
Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option.
Right now, everyone can read the dumps in their favorite language. If I write the library interface well, writing bindings for it for another language should be relatively trivial, so everyone can keep using their favorite language.
And I admit, I'm proposing doing it this way partially because of selfish reasons: I'd like to use this library in my future C# code. But I realize creating something that works only in C# doesn't make sense, because most people in this community don't use it. So, to me writing the code so that it can be used from anywhere makes the most sense
Petr Onderka
On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsvick@gmail.com wrote:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka έγραψε:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Dumps v 2.0 finally on the horizon!
A few comments/questions:
I was envisioning that we would produce "diff dumps" in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls?
It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story.
I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time...
We need to get some of the wikidata users in on the model/format dicussion, to see what use they plan to make of those fields and what would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts?
Ariel
I was envisioning that we would produce "diff dumps" in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls?
What I originally imagined is that the full dump will be modified directly and a description of the changes made to it will be also written to the diff dump. But now I think that creating the diff and then applying it makes more sense, because it's simpler. But I also think that doing the two at the same time will be faster, because it's less work (no need to read and parse the diff). So what I imagine now is something like this:
1. Read information about a change in a page/revision 2. Create diff object in memory 3. Write the diff object to the diff file 4. Apply the diff object to the full dump
It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story.
I do expect that full dump of enwiki using this compression would be way too big. So yes, this was meant just to have something working, so that I can concentrate on doing compression properly later (after the mid-term).
I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time...
If I understand the code in XmlDumpWriter.openPage correctly, that data comes from the page_restrictions row [1], which doesn't seem to be used in non-ancient versions of MediaWiki.
I did think about versioning the page and revision objects in the dump, but I'm not sure how exactly to handle upgrades from one version to another. For now, I think I'll have just one global "data version" per file, but I'll make sure that adding a version to each object in the future will be possible.
We need to get some of the wikidata users in on the model/format discussion, to see what use they plan to make of those fields and what would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts?
If possible, I would prefer to keep everything in a single file. If that won't be possible, I think it makes sense to split on page ids, but make the split id visible (probably in the file name) and unchanging from month to month. If it turns out that a single chunk grows too big, we might consider adding a "split" instruction to diff dumps, but that's probably not necessary now.
Petr Onderka
[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards, - Byrial
+1
And given how messy the revision data can be, having the possibility of actually inspecting it with a text editor is a great boon.
That said, there may be other use cases that I am not aware of for which a binary format might be useful, but if you just need to parse and pipe to a DB, text is the best option.
Giovanni On Jul 1, 2013 5:10 PM, "Byrial Jensen" byrial@vip.cybercity.dk wrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thanks, that sounds like a good solution.
On Wed, Jul 3, 2013 at 3:04 PM, Petr Onderka gsvick@gmail.com wrote:
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial@vip.cybercity.dk
wrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe
the
uncompressed data directly from bunzip or gunzip to my parser to save
disk
space. Therefore it is important to me that the format is simple enough
for
a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good
idea.
It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may
be
smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org <Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l%3E
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Petr, could you please elaborate more on this last claim? If turning the dump generation into an incremental process is the task you are interested in solving, then I don't understand how text constitutes a problem. Text files can be appended to as any regular file and it shouldn't be difficult to do this in a way that preserves the XML structure valid.
As I said, having the possibility to seek and inspect the files manually is a tremendous boon when debugging your code. With what you propose that would be possible but more complicate, since one cannot seek at a specific position of stdout without going through the whole contents.
Best
Giovanni On Jul 3, 2013 4:05 PM, "Petr Onderka" gsvick@gmail.com wrote:
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
The problem is that appending is not enough, especially if you want to keep the current format.
1. With the current format you almost could append new pages, but not new revisions of existing pages, because they belong in the middle of the XML. 2. We also need to handle deletions (and undeletions) of pages and revisions. 3. There are also "current" dumps, which always contain only the most recent revision of a page.
And another advantage of the binary format is that you *can* seek easily. If you're looking for a specific page or revision, you don't have to go through the whole file, you can tell the application what you want, it will look it up and output only that.
Also, even if you couldn't seek, I don't see how is this any worse than the current situation, when you also can't seek into a specific position of the compressed XML (unless you use multistream dumps).
Petr Onderka
On Wed, Jul 3, 2013 at 4:45 PM, Giovanni Luca Ciampaglia < glciampagl@gmail.com> wrote:
Petr, could you please elaborate more on this last claim? If turning the dump generation into an incremental process is the task you are interested in solving, then I don't understand how text constitutes a problem. Text files can be appended to as any regular file and it shouldn't be difficult to do this in a way that preserves the XML structure valid.
As I said, having the possibility to seek and inspect the files manually is a tremendous boon when debugging your code. With what you propose that would be possible but more complicate, since one cannot seek at a specific position of stdout without going through the whole contents.
Best
Giovanni On Jul 3, 2013 4:05 PM, "Petr Onderka" gsvick@gmail.com wrote:
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.orgXmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
That will help a great deal. But I assume your application will be for Linux only? So it would help to still generate current compressed dumps, as post processing step, and store them online for download.
One of the reasons of xml dumps is platform independence, both from producer side (we had ever evolving SQL dumps earlier), and consumer side (not everyone uses Linux).
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Petr Onderka Sent: Wednesday, July 03, 2013 4:04 PM To: Wikimedia developers; Wikipedia Xmldatadumps-l Subject: Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps: //lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I'm primarily a Windows guy, so I'm trying to write the code in a portable way and I will make sure the application works on both Linux and Windows.
Petr Onderka
On Wed, Jul 3, 2013 at 4:49 PM, Erik Zachte ezachte@wikimedia.org wrote:
it will now be a command line application that outputs the data as
uncompressed XML, in the same format as current dumps.
That will help a great deal. But I assume your application will be for Linux only? So it would help to still generate current compressed dumps, as post processing step, and store them online for download.
One of the reasons of xml dumps is platform independence, both from producer side (we had ever evolving SQL dumps earlier), and consumer side (not everyone uses Linux).
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto: wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Petr Onderka Sent: Wednesday, July 03, 2013 4:04 PM To: Wikimedia developers; Wikipedia Xmldatadumps-l Subject: Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial@vip.cybercity.dk
wrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good
idea.
It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on
text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps: //lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
At 03-07-2013 18:29, Petr Onderka wrote:
I'm primarily a Windows guy, so I'm trying to write the code in a portable way and I will make sure the application works on both Linux and Windows.
That sounds good. Just remember that portable not only means that it works on different operating systems on the same computer architecture, but also on different architectures. What programming language do you intend to use?
I'm writing it in C++. If you want, you can follow my progress in the operations/dumps/incremental repo, branch gsoc [1] (but there isn't almost anything there yet). And I don't have any computers with non-x86 architecture, so I won't be able to test that.
[1]: https://git.wikimedia.org/log/operations%2Fdumps%2Fincremental/refs%2Fheads%...
On Wed, Jul 3, 2013 at 10:13 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
At 03-07-2013 18:29, Petr Onderka wrote:
I'm primarily a Windows guy, so I'm trying to write the code in a portable way and I will make sure the application works on both Linux and Windows.
That sounds good. Just remember that portable not only means that it works on different operating systems on the same computer architecture, but also on different architectures. What programming language do you intend to use?
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
You should look into maybe using cmake or some other automated build system to handle the cross-platform compatibility. Also, are you planning on using C++11 features? (Just asking because I'm a big C++11 fan. ;) ).
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Wed, Jul 3, 2013 at 5:17 PM, Petr Onderka gsvick@gmail.com wrote:
I'm writing it in C++. If you want, you can follow my progress in the operations/dumps/incremental repo, branch gsoc [1] (but there isn't almost anything there yet). And I don't have any computers with non-x86 architecture, so I won't be able to test that.
[1]:
https://git.wikimedia.org/log/operations%2Fdumps%2Fincremental/refs%2Fheads%...
On Wed, Jul 3, 2013 at 10:13 PM, Byrial Jensen <byrial@vip.cybercity.dk
wrote:
At 03-07-2013 18:29, Petr Onderka wrote:
I'm primarily a Windows guy, so I'm trying to write the code in a portable way and I will make sure the application works on both Linux and Windows.
That sounds good. Just remember that portable not only means that it
works
on different operating systems on the same computer architecture, but
also
on different architectures. What programming language do you intend to
use?
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org <Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l%3E
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jul 3, 2013 at 11:29 PM, Tyler Romeo tylerromeo@gmail.com wrote:
You should look into maybe using cmake or some other automated build system to handle the cross-platform compatibility.
I will look into that.
Also, are you planning on using C++11 features? (Just asking because I'm a big C++11 fan. ;) ).
Yeah, I'm already using unique_ptr. And I will use lambdas if I think they would be useful in some code.
Petr Onderka
Keeping the dumps in a text-based format doesn't make sense, because that
can't be updated efficiently, which is the whole reason for the new dumps.
First, glad to see there's motion here.
It's definitely true that recompressing the entire history to .bz2 or .7z goes very, very slowly. Also, I don't know of an existing tool that lets you just insert new data here and there without compressing all of the unchanged data as well. Those point towards some sort of format change.
I'm not sure a new format has to be sparse or indexed to get around those two big problems.
For full-history dumps, delta coding (or the related idea of long-range redundancy compression) runs faster than bzip2 or 7z and produces good compression ratios on full-history dumps, based on some testshttps://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3 . (I'm going to focus mostly on full-history dumps here because they're the hard case and one Ariel said is currently painful--not everything here will apply to latest-revs dumps.)
For inserting data, you do seemingly need to break the file up into independently-compressed sections containing just one page's revision history or a fragment of it, so you can add new diff(s) to a page's revision history without decompressing and recompressing the previous revisions. (Removing previously-dumped revisions is another story, but it's rarer.) You'd be in new territory just doing that; I don't know of existing compression tools that really allow that.
You could do those two things, though, while still keeping full-history dumps a once-every-so-often batch process that produces a sorted file. The time to rewrite the file, stripped of the big compression steps, could be bearable--a disk can read or write about 100 MB/s, so just copying the 70G of the .7z enwiki dumps is well under an hour; if the part bound by CPU and other steps is smallish, you're OK.
A format like the proposed one, with revisions inserted wherever there's free space when they come in, will also eventually fragment the revision history for one page (I think Ariel alluded to this in some early notes). Unlike sequential read/writes, seeks are something HDDs are sadly pretty slow at (hence the excitement about solid-state disks); if thousands of revisions are coming in a day, it eventually becomes slow to read things in the old page/revision order, and you need fancy techniques to defrag (maybe a big external-memory sort http://en.wikipedia.org/wiki/External_sorting) or you need to only read the dump on fast hardware that can handle the seeks. Doing occasional batch jobs that produce sorted files could help avoid the fragmentation question.
There's a great quote about the difficulty of "constructing a software design...to make it so simple that there are obviously no deficiencies." (Wikiquote came through with the full text/attribution, of coursehttp://en.wikiquote.org/wiki/C._A._R._Hoare.) I admit it's tricky and people can disagree about what's simple enough or even what approach is simpler of two choices, but it's something to strive for.
Anyway, I'm wary about going into the technical weeds of other folks' projects, because, hey, it's your project! I'm trying to map out the options in the hope that you could get a product you're happier with and maybe give you more time in a tight three-month schedule to improve on your work and not just complete it. Whatever you do, good luck and I'm interested to see the results!
On Wed, Jul 3, 2013 at 7:04 AM, Petr Onderka gsvick@gmail.com wrote:
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to your code.
Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.
Petr Onderka
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Hi,
As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.
I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.
I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.
I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.
Regards,
- Byrial
______________________________**_________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer randall@wawd.com wrote:
Keeping the dumps in a text-based format doesn't make sense, because
that can't be updated efficiently, which is the whole reason for the new dumps.
First, glad to see there's motion here.
It's definitely true that recompressing the entire history to .bz2 or .7z goes very, very slowly. Also, I don't know of an existing tool that lets you just insert new data here and there without compressing all of the unchanged data as well. Those point towards some sort of format change.
I'm not sure a new format has to be sparse or indexed to get around those two big problems.
For full-history dumps, delta coding (or the related idea of long-range redundancy compression) runs faster than bzip2 or 7z and produces good compression ratios on full-history dumps, based on some testshttps://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3 . (I'm going to focus mostly on full-history dumps here because they're the hard case and one Ariel said is currently painful--not everything here will apply to latest-revs dumps.)
For inserting data, you do seemingly need to break the file up into independently-compressed sections containing just one page's revision history or a fragment of it, so you can add new diff(s) to a page's revision history without decompressing and recompressing the previous revisions. (Removing previously-dumped revisions is another story, but it's rarer.) You'd be in new territory just doing that; I don't know of existing compression tools that really allow that.
You could do those two things, though, while still keeping full-history dumps a once-every-so-often batch process that produces a sorted file. The time to rewrite the file, stripped of the big compression steps, could be bearable--a disk can read or write about 100 MB/s, so just copying the 70G of the .7z enwiki dumps is well under an hour; if the part bound by CPU and other steps is smallish, you're OK.
A format like the proposed one, with revisions inserted wherever there's free space when they come in, will also eventually fragment the revision history for one page (I think Ariel alluded to this in some early notes). Unlike sequential read/writes, seeks are something HDDs are sadly pretty slow at (hence the excitement about solid-state disks); if thousands of revisions are coming in a day, it eventually becomes slow to read things in the old page/revision order, and you need fancy techniques to defrag (maybe a big external-memory sort http://en.wikipedia.org/wiki/External_sorting) or you need to only read the dump on fast hardware that can handle the seeks. Doing occasional batch jobs that produce sorted files could help avoid the fragmentation question.
These are some interesting ideas.
You're right that the copying the whole dump is fast enough (it would probably add about an hour to a process that currently takes several days). But it would also pretty much force the use of delta compression. And while I would like to use delta compression, I don't think it's a good idea to be forced to use it, because I might not have the time for it or it might not be good enough.
Because of that, I decided to stay with my indexed approach.
There's a great quote about the difficulty of "constructing a software design...to make it so simple that there are obviously no deficiencies." (Wikiquote came through with the full text/attribution, of coursehttp://en.wikiquote.org/wiki/C._A._R._Hoare.) I admit it's tricky and people can disagree about what's simple enough or even what approach is simpler of two choices, but it's something to strive for.
Anyway, I'm wary about going into the technical weeds of other folks' projects, because, hey, it's your project! I'm trying to map out the options in the hope that you could get a product you're happier with and maybe give you more time in a tight three-month schedule to improve on your work and not just complete it. Whatever you do, good luck and I'm interested to see the results!
Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome.
Petr Onderka
Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome.
If I had one more (very late) comment, it would be to glance at existing libraries for building blocks.
For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level options like SQLite are widely used. LevelDBhttps://code.google.com/p/leveldb/ is pretty cool too.
For delta coding, there's xdelta3 http://xdelta.org/, open-vcdiffhttps://code.google.com/p/open-vcdiff/, and Git'shttp://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized delta https://github.com/git/git/blob/master/diff-delta.c codehttps://github.com/git/git/blob/master/patch-delta.c. (rzip http://rzip.samba.org//rsync are wicked awesome, but not as easy to just drop in as a library.)
Some of the libraries are already used by a lot of projects, which helps make their formats a sort of de-facto standard--i.e., we'll need tools to read BDB and SQLite for a long time. That might help slightly with concerns some folks raised about relying on a new binary format for long-term archival. BDB 4, GDBM, and SQLite3 modules are part of the Python distribution, for example, and xdelta3 reads/writes VCDIFF (a formathttp://tools.ietf.org/html/rfc3284 also usedhttp://en.wikipedia.org/wiki/Shared_Dictionary_Compression_Over_HTTP by Chrome) and is supposed to be buildablehttps://code.google.com/p/xdelta/wiki/LanguageInterfaceas a module too.
Hope this helps.
Randall
On Wed, Jul 10, 2013 at 6:34 AM, Petr Onderka gsvick@gmail.com wrote:
On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer randall@wawd.com wrote:
Keeping the dumps in a text-based format doesn't make sense, because
that can't be updated efficiently, which is the whole reason for the new dumps.
First, glad to see there's motion here.
It's definitely true that recompressing the entire history to .bz2 or .7z goes very, very slowly. Also, I don't know of an existing tool that lets you just insert new data here and there without compressing all of the unchanged data as well. Those point towards some sort of format change.
I'm not sure a new format has to be sparse or indexed to get around those two big problems.
For full-history dumps, delta coding (or the related idea of long-range redundancy compression) runs faster than bzip2 or 7z and produces good compression ratios on full-history dumps, based on some testshttps://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3 . (I'm going to focus mostly on full-history dumps here because they're the hard case and one Ariel said is currently painful--not everything here will apply to latest-revs dumps.)
For inserting data, you do seemingly need to break the file up into independently-compressed sections containing just one page's revision history or a fragment of it, so you can add new diff(s) to a page's revision history without decompressing and recompressing the previous revisions. (Removing previously-dumped revisions is another story, but it's rarer.) You'd be in new territory just doing that; I don't know of existing compression tools that really allow that.
You could do those two things, though, while still keeping full-history dumps a once-every-so-often batch process that produces a sorted file. The time to rewrite the file, stripped of the big compression steps, could be bearable--a disk can read or write about 100 MB/s, so just copying the 70G of the .7z enwiki dumps is well under an hour; if the part bound by CPU and other steps is smallish, you're OK.
A format like the proposed one, with revisions inserted wherever there's free space when they come in, will also eventually fragment the revision history for one page (I think Ariel alluded to this in some early notes). Unlike sequential read/writes, seeks are something HDDs are sadly pretty slow at (hence the excitement about solid-state disks); if thousands of revisions are coming in a day, it eventually becomes slow to read things in the old page/revision order, and you need fancy techniques to defrag (maybe a big external-memory sorthttp://en.wikipedia.org/wiki/External_sorting) or you need to only read the dump on fast hardware that can handle the seeks. Doing occasional batch jobs that produce sorted files could help avoid the fragmentation question.
These are some interesting ideas.
You're right that the copying the whole dump is fast enough (it would probably add about an hour to a process that currently takes several days). But it would also pretty much force the use of delta compression. And while I would like to use delta compression, I don't think it's a good idea to be forced to use it, because I might not have the time for it or it might not be good enough.
Because of that, I decided to stay with my indexed approach.
There's a great quote about the difficulty of "constructing a software design...to make it so simple that there are obviously no deficiencies." (Wikiquote came through with the full text/attribution, of coursehttp://en.wikiquote.org/wiki/C._A._R._Hoare.) I admit it's tricky and people can disagree about what's simple enough or even what approach is simpler of two choices, but it's something to strive for.
Anyway, I'm wary about going into the technical weeds of other folks' projects, because, hey, it's your project! I'm trying to map out the options in the hope that you could get a product you're happier with and maybe give you more time in a tight three-month schedule to improve on your work and not just complete it. Whatever you do, good luck and I'm interested to see the results!
Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome.
Petr Onderka
For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level options like SQLite are widely used. LevelDBhttps://code.google.com/p/leveldb/ is pretty cool too.
I think that with the amount of data we're dealing with, it makes sense to have the file format under tight control. For example, saving a single byte on each revision means total savings of ~500 MB for enwiki.
In any case, at this point it would be more work to switch to one of those than to keep using the format I created.
For delta coding, there's xdelta3 http://xdelta.org/, open-vcdiffhttps://code.google.com/p/open-vcdiff/, and Git'shttp://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized delta https://github.com/git/git/blob/master/diff-delta.c codehttps://github.com/git/git/blob/master/patch-delta.c. (rzip http://rzip.samba.org//rsync are wicked awesome, but not as easy to just drop in as a library.)
I'm certainly going to try to use some library for delta compression, because they seem to do pretty much exactly what's needed here. Thanks for the suggestions.
Petr Onderka
Hi there,
In principle, I understand the need for binary formats and compression in a context with limited resources. On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why? Therefore, it is not easy to provide input and help.
Cheers. - Nicolas Torzec.
PS: Anyway, thanks a lot for your great work on the data backends, behind the scene ;)
From: Petr Onderka <gsvick@gmail.commailto:gsvick@gmail.com> Date: Monday, July 1, 2013 11:15 AM To: Wikimedia developers <wikitech-l@lists.wikimedia.orgmailto:wikitech-l@lists.wikimedia.org> Cc: Wikipedia Xmldatadumps-l <xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org> Subject: Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
I was envisioning that we would produce "diff dumps" in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls?
What I originally imagined is that the full dump will be modified directly and a description of the changes made to it will be also written to the diff dump. But now I think that creating the diff and then applying it makes more sense, because it's simpler. But I also think that doing the two at the same time will be faster, because it's less work (no need to read and parse the diff). So what I imagine now is something like this:
1. Read information about a change in a page/revision 2. Create diff object in memory 3. Write the diff object to the diff file 4. Apply the diff object to the full dump
It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story.
I do expect that full dump of enwiki using this compression would be way too big. So yes, this was meant just to have something working, so that I can concentrate on doing compression properly later (after the mid-term).
I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time...
If I understand the code in XmlDumpWriter.openPage correctly, that data comes from the page_restrictions row [1], which doesn't seem to be used in non-ancient versions of MediaWiki.
I did think about versioning the page and revision objects in the dump, but I'm not sure how exactly to handle upgrades from one version to another. For now, I think I'll have just one global "data version" per file, but I'll make sure that adding a version to each object in the future will be possible.
We need to get some of the wikidata users in on the model/format discussion, to see what use they plan to make of those fields and what would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts?
If possible, I would prefer to keep everything in a single file. If that won't be possible, I think it makes sense to split on page ids, but make the split id visible (probably in the file name) and unchanging from month to month. If it turns out that a single chunk grows too big, we might consider adding a "split" instruction to diff dumps, but that's probably not necessary now.
Petr Onderka
[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
On 01/07/13 23:21, Nicolas Torzec wrote:
Hi there,
In principle, I understand the need for binary formats and compression in a context with limited resources. On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why? Therefore, it is not easy to provide input and help.
Cheers.
- Nicolas Torzec.
+1
The simplest possible dump format is the best, and there's already a thriving ecosystem around the current XML dumps, which would be broken by moving to a binary format. Binary file formats and APIs defined by code are not the way to go if you want long-term archival that can endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for and added to the IT budget, instead of over-optimizing by using a potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a core part of the Foundation's mission. The value is in the data, which is priceless. Computers and storage are (relatively) cheap by comparison, and Wikipedia is growing significantly more slowly than the year-on-year improvements in storage, processing and communication links. Moreover, re-making the dumps every time provides defence in depth against subtle database corruption that might slowly corrupt a database dump.
Please keep the dumps themselves simple and their format stable, and, as Nicolas says, do the clever stuff elsewhere, in which you can use whatever efficient representation you like to do the processing.
Neil
Στις 02-07-2013, ημέρα Τρι, και ώρα 11:47 +0100, ο/η Neil Harris έγραψε:
The simplest possible dump format is the best, and there's already a thriving ecosystem around the current XML dumps, which would be broken by moving to a binary format. Binary file formats and APIs defined by code are not the way to go if you want long-term archival that can endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for and added to the IT budget, instead of over-optimizing by using a potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a core part of the Foundation's mission. The value is in the data, which is priceless. Computers and storage are (relatively) cheap by comparison, and Wikipedia is growing significantly more slowly than the year-on-year improvements in storage, processing and communication links. Moreover, re-making the dumps every time provides defence in depth against subtle database corruption that might slowly corrupt a database dump.
A point of information: we already do not produce dumps every time from scratch; we re-use old revisions because if we did not it would take months and months to generate the en wikipedia dumps, something which is clearly untenable.
The question now is how we are going to use those old revisions. Right now we uncompress the entire previous dump, write new information where needed, and recompress it all (which would take several weeks for en wikipedia history dumps if we didn't run 27 jobs at once).
What I hope for is a format that allows dumps to be produced much more rapidly, where the time to produce the incrementals grows only as the number of edits per time frame grows, and where the time to produce new fulls via the incrementals is bounded in a much better fashion than we have now.
And I expect that we would have a library or scripts that provide for conversion of a new-format dump to the good old XML, so that all the tools folks use now will continue to work.
Ariel
Please keep the dumps themselves simple and their format stable, and, as Nicolas says, do the clever stuff elsewhere, in which you can use whatever efficient representation you like to do the processing.
Neil
Sorry, reading back over this thread late.
What I hope for is a format that allows dumps to be produced much more rapidly, where the time to produce the incrementals grows only as the number of edits per time frame grows
Curious: what's happening currently that makes the time to produce incrementals grow more quickly than that?
On Tue, Jul 2, 2013 at 4:41 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 02-07-2013, ημέρα Τρι, και ώρα 11:47 +0100, ο/η Neil Harris έγραψε:
The simplest possible dump format is the best, and there's already a thriving ecosystem around the current XML dumps, which would be broken by moving to a binary format. Binary file formats and APIs defined by code are not the way to go if you want long-term archival that can endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for and added to the IT budget, instead of over-optimizing by using a potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a core part of the Foundation's mission. The value is in the data, which is priceless. Computers and storage are (relatively) cheap by comparison, and Wikipedia is growing significantly more slowly than the year-on-year improvements in storage, processing and communication links. Moreover, re-making the dumps every time provides defence in depth against subtle database corruption that might slowly corrupt a database dump.
A point of information: we already do not produce dumps every time from scratch; we re-use old revisions because if we did not it would take months and months to generate the en wikipedia dumps, something which is clearly untenable.
The question now is how we are going to use those old revisions. Right now we uncompress the entire previous dump, write new information where needed, and recompress it all (which would take several weeks for en wikipedia history dumps if we didn't run 27 jobs at once).
What I hope for is a format that allows dumps to be produced much more rapidly, where the time to produce the incrementals grows only as the number of edits per time frame grows, and where the time to produce new fulls via the incrementals is bounded in a much better fashion than we have now.
And I expect that we would have a library or scripts that provide for conversion of a new-format dump to the good old XML, so that all the tools folks use now will continue to work.
Ariel
Please keep the dumps themselves simple and their format stable, and, as Nicolas says, do the clever stuff elsewhere, in which you can use whatever efficient representation you like to do the processing.
Neil
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Στις 07-07-2013, ημέρα Κυρ, και ώρα 21:09 -0700, ο/η Randall Farmer έγραψε:
Sorry, reading back over this thread late.
What I hope for is a format that allows dumps to be produced much
more
rapidly, where the time to produce the incrementals grows only as
the
number of edits per time frame grows
Curious: what's happening currently that makes the time to produce incrementals grow more quickly than that?
We don't produce true incrementals now; we produce 'adds/changes' dumps which don't acocunt for deletions, oversights, page moves, etc. And you can't add them onto a full to get a new full. When they are produced, I want them to behave as described above.
Ariel
xmldatadumps-l@lists.wikimedia.org