Re: [Xmldatadumps-l] First preview version of incremental dumps

29 Aug 2013

...
  I didn't participate in the earlier discussion,
but here is some late
 feedback: 
Thanks for that, better late than never.

...
  - The magic number WMID (WikiMedia Incremental Dump, I
guess) should be MWID
 or MWBD instead. 
I guess you're right. If BD is supposed to mean binary dumps, then
that could be a reasonable name, except that it's not used anywhere.
So I think that MWID makes most sense.

...
  - The flags are a bit convoluted. Sometimes a flag is
used for a feature
 being present, sometimes for a feature being absent, it can be mingled with
 options. 
What specifically do you mean?
I think the only place where I didn't go with what was most logical to
me is having the flag on for current dump (and off for history dump).
I did it this way because it was convenient in my implementation, but
I could easily change that.

...
  - I think the timestamps *are* the number of seconds
from the start date
 (not taking leap seconds into account). 
Only if each month had 31 days, which is obviously not true.

...
  - I don't see the benefit of this storage of maps.
If you are searching
 in-file, you still need to traverse O(n) values after you checked the keys
 for the one you wanted. If you first load it in memory, it seems preferable
 to have the value alongside its key. 
I am always loading the whole map into memory.
You're right, your way makes more sense, I have changed that.

...
  - I would consider adding to the page/revision objects
 pointers/lengths to
 the next one, for easy traversal. 
I don't see how that would add anything.
You can already use indexes for relatively easy traversal.

...
  - Add aliases inside the namespaces map?
 - Is the redirect target useful?
 - I would consider allowing revdeleted fields available in the dump (for
 private dumps by the owner). 
I'm just mirroring XML dumps (the only exception is the <restrictions>
tag, which I'm quite sure is not useful).
So, if you think these changes make sense, they would have to be first
made in XML dumps and then I would implement them in incremental
dumps.

Specifically about redirect target, I was told that this was a
requested feature, so some people thought it was useful.

...
  - You are paranoid about wasting bytes, but left the
SHA-1 base-36 encoded. 
I think it makes sense to be paranoid about this, since for enwiki
history dump, one wasted byte for each revision means 0.5 GB wasted in
total.

Saving the SHA-1s directly was on my TODO list, because I assumed it
would be relatively complicated.
But now I found out that porting wfBaseConvert() would be simple, so I
did just that.

...
  - On mediawiki, hiding the page text doesn't hide
the rev_len 
In XML dumps, it does.

...
  Proposal for the revision flag:

     0x01: minor edit 
 Bits 2-3 deal with the user:
      0x00: only user text is provided

     0x02: there is userid + user text
     0x04: the contributor is an IPv4 anonymous user
     0x06: the contributor is an IPv6 anonymous user 

 The high nibble matches the rev_deleted field:

     0x08: this revision has a non-default model (else the format is
 text/x-wiki)
     0x10: the text of this revision was deleted
     0x20: the comment of this revision was deleted
     0x40: the contributor of this revision was deleted 
>     0x80: the deleted contents are restricted 
I don't see how flipping the bit for model helps anything.
Basically, I consider "all pages have their model explicitly
specified" as default and text/x-wiki as a special case to save space
for the most common model.
On the other hand, I think that you consider "text/x-wiki is the model
of all pages" as the default and "this page has the model explicitly
specified" as the special case.
I think that both options make some sense, but I don't see big
difference between them.

Also, why is "the deleted contents are restricted" flag on each
revision instead of a single global flag for the whole dump? So that
you can support admin-level dumps, where some deleted fields are
visible and some not?

Petr Onderka
User:Svick

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] First preview version of incremental dumps