Petr Onderka wrote:
The XML output is almost the same as existing XML
dumps, but there are
some differences [2].
The current state of the new format also now has a detailed
specification [3] (this describes the current version, the format is
still in flux and can change daily).
I didn't participate in the earlier discussion, but here is some late
feedback:
- The magic number WMID (WikiMedia Incremental Dump, I guess) should be
MWID or MWBD instead.
- The flags are a bit convoluted. Sometimes a flag is used for a feature
being present, sometimes for a feature being absent, it can be mingled
with options.
- I think the timestamps *are* the number of seconds from the start date
(not taking leap seconds into account).
- I don't see the benefit of this storage of maps. If you are searching
in-file, you still need to traverse O(n) values after you checked the
keys for the one you wanted. If you first load it in memory, it seems
preferable to have the value alongside its key.
- Add aliases inside the namespaces map?
- I would consider adding to the page/revision objects pointers/lengths
to the next one, for easy traversal.
- Is the redirect target useful?
- I would consider allowing revdeleted fields available in the dump (for
private dumps by the owner).
- You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.
- On mediawiki, hiding the page text doesn't hide the rev_len (I hide
one revision on that page for showing it:
https://www.mediawiki.org/w/index.php?title=User:Svick/Incremental_dumps/Fi…
)
Proposal for the revision flag:
0x01: minor edit
Bits 2-3 deal with the
user:
0x00: only user text is provided
0x02: there is userid + user text
0x04: the contributor is an IPv4 anonymous user
0x06: the contributor is an IPv6 anonymous user
The high nibble matches the rev_deleted field:
0x08: this revision has a non-default model (else
the format is text/x-wiki)
0x10: the text of this revision was deleted
0x20: the comment of this revision was deleted
0x40: the contributor of this revision was deleted
0x80: the deleted contents are restricted