For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on.
Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsvick@gmail.com wrote:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on.
Yes, it is a binary format, I will make that clearer on the page.
The advantage of a binary format is that it's smaller, which I think is quite important.
I think the main advantages of text-based formats is that there are lots of tools for the common ones (XML and JSON) and that they are human readable. But those tools wouldn't be very useful, because we certainly want to have some sort of custom compression scheme and the tools wouldn't be able to work with that. And I think human readability is mostly useful if we want others to be able to write their own code that directly accesses the data. And, because of the custom compression, doing that won't be that easy anyway. And hopefully, it won't be necessary, because there will be a nice library usable by everyone (see below).
Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option.
Right now, everyone can read the dumps in their favorite language. If I write the library interface well, writing bindings for it for another language should be relatively trivial, so everyone can keep using their favorite language.
And I admit, I'm proposing doing it this way partially because of selfish reasons: I'd like to use this library in my future C# code. But I realize creating something that works only in C# doesn't make sense, because most people in this community don't use it. So, to me writing the code so that it can be used from anywhere makes the most sense
Petr Onderka
On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsvick@gmail.com wrote:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 07/01/2013 12:48:11 PM, Petr Onderka - gsvick@gmail.com wrote:
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on.
Yes, it is a binary format, I will make that clearer on the page.
The advantage of a binary format is that it's smaller, which I think is quite important.
In my experience binary formats have very little to recommend them.
They are definitely more obscure. They sometimes suffer from endian problems. They require special code to read and write.
In my experience I have found that the notion that they offer an advantage by being "smaller" is somewhat misguided.
In particular, with XML, there is generally a very high degree of redundancy in the text, far more than in normal writing.
The consequence of this regularity is that text based XML often compresses very, very well.
I remember one particular instance where we were generating 30-50 Megabytes of XML a day and needed to send it from the USA to the UK every day, in a situation where our leased data rate was really limiting. We were surprised and pleased to discover that zipping the files reduced them to only 1-2 MB. I have been skeptical of claims that binary formats are more efficient on the wire (where it matters most) ever since.
I think you should do some experiments versus compressed XML to justify your claimed benefits of using a binary format.
Jim
<snip>
Compressed XML is what the current dumps use and it doesn't work well because: * it can't be edited * it doesn't support seeking
I think the only way to solve this is "obscure" and requires special code to read and write. (And endianness is not a problem if the specification says which one it uses and the implementation sticks to it.)
Theoretically, I could use compressed XML in internal data structures, but I think that just combines the disadvantages of both.
So, the size is not the main reason not to use XML, it's just one of the reasons.
Petr Onderka
On Mon, Jul 1, 2013 at 7:26 PM, wican.x.jimlaur@dfgh.net wrote:
On 07/01/2013 12:48:11 PM, Petr Onderka - gsvick@gmail.com wrote:
What is the intended format of the dump files? The page makes it sound
like
it will have a binary format, which I'm not opposed to, but is
definitely
something you should decide on.
Yes, it is a binary format, I will make that clearer on the page.
The advantage of a binary format is that it's smaller, which I think is quite important.
In my experience binary formats have very little to recommend them.
They are definitely more obscure. They sometimes suffer from endian problems. They require special code to read and write.
In my experience I have found that the notion that they offer an advantage by being "smaller" is somewhat misguided.
In particular, with XML, there is generally a very high degree of redundancy in the text, far more than in normal writing.
The consequence of this regularity is that text based XML often compresses very, very well.
I remember one particular instance where we were generating 30-50 Megabytes of XML a day and needed to send it from the USA to the UK every day, in a situation where our leased data rate was really limiting. We were surprised and pleased to discover that zipping the files reduced them to only 1-2 MB. I have been skeptical of claims that binary formats are more efficient on the wire (where it matters most) ever since.
I think you should do some experiments versus compressed XML to justify your claimed benefits of using a binary format.
Jim
<snip>
-- Jim Laurino wican.x.jimlaur@dfgh.net Please direct any reply to the list. Only mail from the listserver reaches this address.
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
Petr is right on par with this one. The purpose of this version 2 for dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Jul 1, 2013 at 2:54 PM, Petr Onderka gsvick@gmail.com wrote:
Compressed XML is what the current dumps use and it doesn't work well because:
- it can't be edited
- it doesn't support seeking
I think the only way to solve this is "obscure" and requires special code to read and write. (And endianness is not a problem if the specification says which one it uses and the implementation sticks to it.)
Theoretically, I could use compressed XML in internal data structures, but I think that just combines the disadvantages of both.
So, the size is not the main reason not to use XML, it's just one of the reasons.
Petr Onderka
On Mon, Jul 1, 2013 at 7:26 PM, wican.x.jimlaur@dfgh.net wrote:
On 07/01/2013 12:48:11 PM, Petr Onderka - gsvick@gmail.com wrote:
What is the intended format of the dump files? The page makes it sound
like
it will have a binary format, which I'm not opposed to, but is
definitely
something you should decide on.
Yes, it is a binary format, I will make that clearer on the page.
The advantage of a binary format is that it's smaller, which I think is quite important.
In my experience binary formats have very little to recommend them.
They are definitely more obscure. They sometimes suffer from endian problems. They require special code to read and write.
In my experience I have found that the notion that they offer an
advantage
by being "smaller" is somewhat misguided.
In particular, with XML, there is generally a very high degree of redundancy in the text, far more than in normal writing.
The consequence of this regularity is that text based XML often
compresses
very, very well.
I remember one particular instance where we were generating 30-50 Megabytes of XML a day and needed to send it from the USA to the UK every day, in a situation where our leased data rate was really limiting. We
were
surprised and pleased to discover that zipping the files reduced them to only 1-2 MB. I have been skeptical of claims that binary formats are more efficient on the wire (where it matters most) ever since.
I think you should do some experiments versus compressed XML to justify your claimed benefits of using a binary format.
Jim
<snip>
-- Jim Laurino wican.x.jimlaur@dfgh.net Please direct any reply to the list. Only mail from the listserver reaches this address.
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01.07.2013 22:56, Tyler Romeo wrote:
Petr is right on par with this one. The purpose of this version 2 for dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
Why the dumps cannot be just split into daily or weekly XML files (optionally compressed ones). Such way the seeking will be performed by simply opening YYYY.MM.DD.xml file. It is so much simplier than going for binary git-like formats. Which would take a bit less space but are more prone to bugs and impossible to extract and analyze/edit via text/XML processing utils. Dmitriy
I think this would work well only for the use case where you're always looking though the whole history of all pages.
How would you find the current revision of a specific page? Or all revisions of a page? What if you don't want the whole history, just current versions of all pages? And don't forget about deletions (and undeletions).
You could somewhat solve some of these problems (e.g. by adding indexes), but I don't think you can solve all of them.
Petr Onderka
On Mon, Jul 1, 2013 at 9:13 PM, Dmitriy Sintsov questpc@rambler.ru wrote:
On 01.07.2013 22:56, Tyler Romeo wrote:
Petr is right on par with this one. The purpose of this version 2 for dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
Why the dumps cannot be just split into daily or weekly XML files
(optionally compressed ones). Such way the seeking will be performed by simply opening YYYY.MM.DD.xml file. It is so much simplier than going for binary git-like formats. Which would take a bit less space but are more prone to bugs and impossible to extract and analyze/edit via text/XML processing utils. Dmitriy
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
Instead of XML "or" a proprietary binary format could we try using a standard binary format such as Protocol Buffers as a base to reduce the issues with having to implement the reading/writing in multiple languages?
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo tylerromeo@gmail.com wrote:
Petr is right on par with this one. The purpose of this version 2 for dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
Protocol Buffers are not a bad idea, but I'm not sure about their overhead.
AFAIK, PB have overhead of 1 byte per field. If I'm counting correctly, with enwiki's 600M revisions and 8 fields per revision, that means total overhead of more than 4 GB. The fixed-size part of all revisions (i.e. without comment and text) amounts to ~22 GB. I think this means PB have too much overhead.
The overhead could be alleviated by using compression, but I didn't intend to compress metadata.
So, I think I will start without PB. If I later decide to compress metadata, I will also try to use PB and see if it works.
Also, I think that reading the binary format isn't going to be the biggest issue if you're implementing your own library for incremental dumps, especially if I'm going to use delta compression of revision texts.
Petr Onderka
On Mon, Jul 1, 2013 at 9:16 PM, Daniel Friesen daniel@nadir-seen-fire.comwrote:
Instead of XML "or" a proprietary binary format could we try using a standard binary format such as Protocol Buffers as a base to reduce the issues with having to implement the reading/writing in multiple languages?
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo tylerromeo@gmail.com wrote:
Petr is right on par with this one. The purpose of this version 2 for
dumps is to allow protocol-specific incremental updating of the dump, which would be significantly more difficult in non-binary format.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka έγραψε:
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome.
Petr Onderka [[User:Svick]]
Dumps v 2.0 finally on the horizon!
A few comments/questions:
I was envisioning that we would produce "diff dumps" in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls?
It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story.
I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time...
We need to get some of the wikidata users in on the model/format dicussion, to see what use they plan to make of those fields and what would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts?
Ariel
I was envisioning that we would produce "diff dumps" in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls?
What I originally imagined is that the full dump will be modified directly and a description of the changes made to it will be also written to the diff dump. But now I think that creating the diff and then applying it makes more sense, because it's simpler. But I also think that doing the two at the same time will be faster, because it's less work (no need to read and parse the diff). So what I imagine now is something like this:
1. Read information about a change in a page/revision 2. Create diff object in memory 3. Write the diff object to the diff file 4. Apply the diff object to the full dump
It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story.
I do expect that full dump of enwiki using this compression would be way too big. So yes, this was meant just to have something working, so that I can concentrate on doing compression properly later (after the mid-term).
I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time...
If I understand the code in XmlDumpWriter.openPage correctly, that data comes from the page_restrictions row [1], which doesn't seem to be used in non-ancient versions of MediaWiki.
I did think about versioning the page and revision objects in the dump, but I'm not sure how exactly to handle upgrades from one version to another. For now, I think I'll have just one global "data version" per file, but I'll make sure that adding a version to each object in the future will be possible.
We need to get some of the wikidata users in on the model/format discussion, to see what use they plan to make of those fields and what would be most convenient for them.
It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts?
If possible, I would prefer to keep everything in a single file. If that won't be possible, I think it makes sense to split on page ids, but make the split id visible (probably in the file name) and unchanging from month to month. If it turns out that a single chunk grows too big, we might consider adding a "split" instruction to diff dumps, but that's probably not necessary now.
Petr Onderka
[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
How are you dealing with extensibility?
We need to be able to extend the format. The fields of data we need to export change over time (just look at the changelog for our export's XSD file https://www.mediawiki.org/xml/export-0.7.xsd).
Here are some things in that XML format you are missing in the incremental: - Redirect info - Upload info - Log items - Liquid Threads support
And something that I don't think we've thought about support for in our current export format, ContentHandler. There's metadata for it missing from our dumps and the data format is somewhat different than our text dumps have traditionally expected.
Hello,
My wiki's giving an error generating SVG thumbnails, e.g.
Cannot parse integer value '-h214' for -w
Has anyone come across a solution for this? I'm seeing it on many sites around the net including my own - I think it started after I upgraded to 1.19.
Here's a live example: http://www.organicdesign.co.nz/File:Nginx-logo.svg
Thanks, Aran
Hi,
On Mon, 2013-07-01 at 20:54 -0300, Aran wrote:
My wiki's giving an error generating SVG thumbnails, e.g. Cannot parse integer value '-h214' for -w
Does this refer to creating bitmap thumbnails from SVG files? In that case, which SVGConverter is used to generate thumbnails?
andre
This is most likely bug 45054, fixed in MediaWiki 1.21. It has a rather simple workaround, too, see https://bugzilla.wikimedia.org/show_bug.cgi?id=45054 .
Yep that's my problem, thanks a lot :-)
On 02/07/13 07:28, Bartosz Dziewoński wrote:
This is most likely bug 45054, fixed in MediaWiki 1.21. It has a rather simple workaround, too, see https://bugzilla.wikimedia.org/show_bug.cgi?id=45054 .
Hi Guys,
I've just upgraded my wiki from 1.19.2 to 1.21.1 to fix the SVG rendering problem which now is all fine, but now my Math rendering has broken. I'm getting the following error:
Failed to parse (PNG conversion failed; check for correct installation of latex and dvipng (or dvips + gs + convert))
This error seems very common, but none of the solutions I've found have worked (creating latex.fmt, running fmtutil-sys --all, setting $wgTexvc etc).
All the packages are installed and were running fine for 1.19, I've downloaded Extension:Math for 1.21 and ran 'make' which generated a texvc binary with no errors.
Any ideas what may be wrong?
Thanks, Aran
I've found that the logged shell command actually does execute properly and creates the .png when executed manually from shell - even when I execute it as the www-data user that the web-server runs as.
But from the wiki it creates the tmp/hash.tex file, but not the png, and there's nothing logged anywhere to say why it's not been able to do it.
On 02/07/13 12:32, Aran wrote:
Hi Guys,
I've just upgraded my wiki from 1.19.2 to 1.21.1 to fix the SVG rendering problem which now is all fine, but now my Math rendering has broken. I'm getting the following error:
Failed to parse (PNG conversion failed; check for correct installation of latex and dvipng (or dvips + gs + convert))
This error seems very common, but none of the solutions I've found have worked (creating latex.fmt, running fmtutil-sys --all, setting $wgTexvc etc).
All the packages are installed and were running fine for 1.19, I've downloaded Extension:Math for 1.21 and ran 'make' which generated a texvc binary with no errors.
Any ideas what may be wrong?
Thanks, Aran
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jul 1, 2013 at 10:15 PM, Daniel Friesen daniel@nadir-seen-fire.comwrote:
How are you dealing with extensibility?
We need to be able to extend the format. The fields of data we need to export change over time (just look at the changelog for our export's XSD file https://www.mediawiki.org/xml/**export-0.7.xsdhttps://www.mediawiki.org/xml/export-0.7.xsd ).
I have touched on this in answer to Ariel's email. I think that for now, there will be just a single data version number in the header of the dump file. But I will make sure to leave the possibility of having a version number on each object open.
Here are some things in that XML format you are missing in the incremental:
- Redirect info
- Upload info
- Log items
- Liquid Threads support
I should have gone to the source instead of assuming that looking at few samples is enough. I will add redirect and upload info to the format description.
As far as I know, log items are in a separate XML dump and I'm not planning to replace that one.
Unless I'm mistaken, Liquid Threads don't have much of a future and are used only on few wikis like mediawiki.org. Does anyone actually use this information from the dumps?
And something that I don't think we've thought about support for in our current export format, ContentHandler. There's metadata for it missing from our dumps and the data format is somewhat different than our text dumps have traditionally expected.
The current dumps already store model and format. Is there something else needed for ContentHandler? The dumps don't really care what is the format or encoding of the revision text, it's just a byte stream to them.
Petr Onderka
On Tue, Jul 2, 2013 at 2:18 PM, Petr Onderka gsvick@gmail.com wrote:
Unless I'm mistaken, Liquid Threads don't have much of a future and are used only on few wikis like mediawiki.org. Does anyone actually use this information from the dumps?
LiquidThreads is an extension. I don't think extension dumps are within the scope of this, unless we provide some sort of generic "extensions can add stuff to the dump" hook.
The current dumps already store model and format.
Is there something else needed for ContentHandler? The dumps don't really care what is the format or encoding of the revision text, it's just a byte stream to them.
I'm not familiar with the current dump format, but what is being referred to here is that if you set $wgContentHandlerUseDB to true, then the content type (i.e., whether it is Wikitext, or JS/CSS, etc.) can be stored in the database rather than being determined statically by namespace.
*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
wikitech-l@lists.wikimedia.org