Hi,
after a month of work on my GSoC project Incremental Dumps [1], I think I have now something worth sharing and talking about, though it's still far from complete.
What the code can do now is to read a pages-history XML dump and create the various kinds of dumps (pages/stub, current/history) in the new format from that. It can then convert a dump in the new format back to XML.
The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily).
If you want, you can also try running the code. [4] It's not production-quality yet (e.g. it doesn't report errors properly), but it should work. Compilation instructions are in the README file.
Any comments or questions are welcome.
Petr Onderka User:Svick
[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps [2]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_o... [3]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Speci... [4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc
Petr Onderka wrote:
The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily).
I didn't participate in the earlier discussion, but here is some late feedback:
- The magic number WMID (WikiMedia Incremental Dump, I guess) should be MWID or MWBD instead.
- The flags are a bit convoluted. Sometimes a flag is used for a feature being present, sometimes for a feature being absent, it can be mingled with options.
- I think the timestamps *are* the number of seconds from the start date (not taking leap seconds into account).
- I don't see the benefit of this storage of maps. If you are searching in-file, you still need to traverse O(n) values after you checked the keys for the one you wanted. If you first load it in memory, it seems preferable to have the value alongside its key.
- Add aliases inside the namespaces map?
- I would consider adding to the page/revision objects pointers/lengths to the next one, for easy traversal.
- Is the redirect target useful?
- I would consider allowing revdeleted fields available in the dump (for private dumps by the owner).
- You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.
- On mediawiki, hiding the page text doesn't hide the rev_len (I hide one revision on that page for showing it: https://www.mediawiki.org/w/index.php?title=User:Svick/Incremental_dumps/Fil... )
Proposal for the revision flag:
0x01: minor edit
Bits 2-3 deal with the user:
0x00: only user text is provided 0x02: there is userid + user text 0x04: the contributor is an IPv4 anonymous user 0x06: the contributor is an IPv6 anonymous user
The high nibble matches the rev_deleted field:
0x08: this revision has a non-default model (else the format is text/x-wiki) 0x10: the text of this revision was deleted 0x20: the comment of this revision was deleted 0x40: the contributor of this revision was deleted 0x80: the deleted contents are restricted
I didn't participate in the earlier discussion, but here is some late feedback:
Thanks for that, better late than never.
- The magic number WMID (WikiMedia Incremental Dump, I guess) should be MWID
or MWBD instead.
I guess you're right. If BD is supposed to mean binary dumps, then that could be a reasonable name, except that it's not used anywhere. So I think that MWID makes most sense.
- The flags are a bit convoluted. Sometimes a flag is used for a feature
being present, sometimes for a feature being absent, it can be mingled with options.
What specifically do you mean? I think the only place where I didn't go with what was most logical to me is having the flag on for current dump (and off for history dump). I did it this way because it was convenient in my implementation, but I could easily change that.
- I think the timestamps *are* the number of seconds from the start date
(not taking leap seconds into account).
Only if each month had 31 days, which is obviously not true.
- I don't see the benefit of this storage of maps. If you are searching
in-file, you still need to traverse O(n) values after you checked the keys for the one you wanted. If you first load it in memory, it seems preferable to have the value alongside its key.
I am always loading the whole map into memory. You're right, your way makes more sense, I have changed that.
- I would consider adding to the page/revision objects pointers/lengths to
the next one, for easy traversal.
I don't see how that would add anything. You can already use indexes for relatively easy traversal.
- Add aliases inside the namespaces map?
- Is the redirect target useful?
- I would consider allowing revdeleted fields available in the dump (for
private dumps by the owner).
I'm just mirroring XML dumps (the only exception is the <restrictions> tag, which I'm quite sure is not useful). So, if you think these changes make sense, they would have to be first made in XML dumps and then I would implement them in incremental dumps.
Specifically about redirect target, I was told that this was a requested feature, so some people thought it was useful.
- You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.
I think it makes sense to be paranoid about this, since for enwiki history dump, one wasted byte for each revision means 0.5 GB wasted in total.
Saving the SHA-1s directly was on my TODO list, because I assumed it would be relatively complicated. But now I found out that porting wfBaseConvert() would be simple, so I did just that.
- On mediawiki, hiding the page text doesn't hide the rev_len
In XML dumps, it does.
Proposal for the revision flag:
0x01: minor edit
Bits 2-3 deal with the user:
0x00: only user text is provided 0x02: there is userid + user text 0x04: the contributor is an IPv4 anonymous user 0x06: the contributor is an IPv6 anonymous user
The high nibble matches the rev_deleted field:
0x08: this revision has a non-default model (else the format is
text/x-wiki) 0x10: the text of this revision was deleted 0x20: the comment of this revision was deleted 0x40: the contributor of this revision was deleted
0x80: the deleted contents are restricted
I don't see how flipping the bit for model helps anything. Basically, I consider "all pages have their model explicitly specified" as default and text/x-wiki as a special case to save space for the most common model. On the other hand, I think that you consider "text/x-wiki is the model of all pages" as the default and "this page has the model explicitly specified" as the special case. I think that both options make some sense, but I don't see big difference between them.
Also, why is "the deleted contents are restricted" flag on each revision instead of a single global flag for the whole dump? So that you can support admin-level dumps, where some deleted fields are visible and some not?
Petr Onderka User:Svick
On 29/08/13 16:07, Petr Onderka wrote:
- The flags are a bit convoluted. Sometimes a flag is used for a feature
being present, sometimes for a feature being absent, it can be mingled with options.
What specifically do you mean? I think the only place where I didn't go with what was most logical to me is having the flag on for current dump (and off for history dump). I did it this way because it was convenient in my implementation, but I could easily change that.
For instance the 1 byte dump kind flags: 0x01 for pages dump: a dump with revision text 0x02 for current dump: a dump without old revisions of pages 0x04 for articles dump: a dump that doesn't contain pages from talk namespaces and the User namespace
I would have expected a flag for “contains current revisions”, “another for contains old revisions”, other for “contains logs”... Probably there would be a bit for “includes meta namespaces”, although a bitmap may have been more suited (or included in the namespace map?).
- Add aliases inside the namespaces map?
- Is the redirect target useful?
- I would consider allowing revdeleted fields available in the dump (for
private dumps by the owner).
I'm just mirroring XML dumps (the only exception is the<restrictions> tag, which I'm quite sure is not useful). So, if you think these changes make sense, they would have to be first made in XML dumps and then I would implement them in incremental dumps.
Specifically about redirect target, I was told that this was a requested feature, so some people thought it was useful.
I think it's a legacy field and you should check the redirect table, but I would have to look at the code to confirm it.
- You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.
I think it makes sense to be paranoid about this, since for enwiki history dump, one wasted byte for each revision means 0.5 GB wasted in total.
I'm not saying it's too bad to be a bit paranoid about it. You may be interested in using variable-length integers, by the way, perhaps in a similar way to the sqlite ones.
Saving the SHA-1s directly was on my TODO list, because I assumed it would be relatively complicated. But now I found out that porting wfBaseConvert() would be simple, so I did just that.
I find the "(little endian)" mention a bit strange. If I convert the chars into hex byte by byte I get the normal sha-1, right?
Proposal for the revision flag:
0x01: minor edit
Bits 2-3 deal with the user:
0x00: only user text is provided 0x02: there is userid + user text 0x04: the contributor is an IPv4 anonymous user 0x06: the contributor is an IPv6 anonymous user
The high nibble matches the rev_deleted field:
0x08: this revision has a non-default model (else the format is
text/x-wiki) 0x10: the text of this revision was deleted 0x20: the comment of this revision was deleted 0x40: the contributor of this revision was deleted
0x80: the deleted contents are restricted
I don't see how flipping the bit for model helps anything. Basically, I consider "all pages have their model explicitly specified" as default and text/x-wiki as a special case to save space for the most common model. On the other hand, I think that you consider "text/x-wiki is the model of all pages" as the default and "this page has the model explicitly specified" as the special case. I think that both options make some sense, but I don't see big difference between them.
Yes, it's completely equivalent. It seems to me to have more sense this way, probably because it used to be text/x-wiki everywhere.
Also, why is "the deleted contents are restricted" flag on each revision instead of a single global flag for the whole dump? So that you can support admin-level dumps, where some deleted fields are visible and some not?
If you made an "owner dump" with all data, you need that bit to differenciate between deleted data any sysop may see and deleted data only oversighters can view.
Finally, I would also support different encodings of the revision text (eg. a byte in the header with values identity, lzma, mixed, pointer...)
Regards
xmldatadumps-l@lists.wikimedia.org