Petr Onderka wrote:
The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily).
I didn't participate in the earlier discussion, but here is some late feedback:
- The magic number WMID (WikiMedia Incremental Dump, I guess) should be MWID or MWBD instead.
- The flags are a bit convoluted. Sometimes a flag is used for a feature being present, sometimes for a feature being absent, it can be mingled with options.
- I think the timestamps *are* the number of seconds from the start date (not taking leap seconds into account).
- I don't see the benefit of this storage of maps. If you are searching in-file, you still need to traverse O(n) values after you checked the keys for the one you wanted. If you first load it in memory, it seems preferable to have the value alongside its key.
- Add aliases inside the namespaces map?
- I would consider adding to the page/revision objects pointers/lengths to the next one, for easy traversal.
- Is the redirect target useful?
- I would consider allowing revdeleted fields available in the dump (for private dumps by the owner).
- You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.
- On mediawiki, hiding the page text doesn't hide the rev_len (I hide one revision on that page for showing it: https://www.mediawiki.org/w/index.php?title=User:Svick/Incremental_dumps/Fil... )
Proposal for the revision flag:
0x01: minor edit
Bits 2-3 deal with the user:
0x00: only user text is provided 0x02: there is userid + user text 0x04: the contributor is an IPv4 anonymous user 0x06: the contributor is an IPv6 anonymous user
The high nibble matches the rev_deleted field:
0x08: this revision has a non-default model (else the format is text/x-wiki) 0x10: the text of this revision was deleted 0x20: the comment of this revision was deleted 0x40: the contributor of this revision was deleted 0x80: the deleted contents are restricted