Re: [Xmldatadumps-l] First preview version of incremental dumps

5 Sep 2013

      On 29/08/13 16:07, Petr Onderka wrote:
...
...

The flags are a bit convoluted. Sometimes a flag is used for a feature

being present, sometimes for a feature being absent, it can be mingled with
options.
What specifically do you mean?
I think the only place where I didn't go with what was most logical to
me is having the flag on for current dump (and off for history dump).
I did it this way because it was convenient in my implementation, but
I could easily change that.
For instance the 1 byte dump kind flags:
     0x01 for pages dump: a dump with revision text
     0x02 for current dump: a dump without old revisions of pages
     0x04 for articles dump: a dump that doesn't contain pages from talk 
namespaces and the User namespace
I would have expected a flag for “contains current revisions”, “another 
for contains old revisions”, other for “contains logs”...
Probably there would be a bit for “includes meta namespaces”, although a 
bitmap may have been more suited (or included in the namespace map?).
...
...

Add aliases inside the namespaces map?
Is the redirect target useful?
I would consider allowing revdeleted fields available in the dump (for

private dumps by the owner).
I'm just mirroring XML dumps (the only exception is the<restrictions>
tag, which I'm quite sure is not useful).
So, if you think these changes make sense, they would have to be first
made in XML dumps and then I would implement them in incremental
dumps.
Specifically about redirect target, I was told that this was a
requested feature, so some people thought it was useful.
I think it's a legacy field and you should check the redirect table, but 
I would have to look at the code to confirm it.
...
...

You are paranoid about wasting bytes, but left the SHA-1 base-36 encoded.

I think it makes sense to be paranoid about this, since for enwiki
history dump, one wasted byte for each revision means 0.5 GB wasted in
total.
I'm not saying it's too bad to be a bit paranoid about it. You may be 
interested in using variable-length integers, by the way, perhaps in a 
similar way to the sqlite ones.
...
Saving the SHA-1s directly was on my TODO list, because I assumed it
would be relatively complicated.
But now I found out that porting wfBaseConvert() would be simple, so I
did just that.
I find the "(little endian)" mention a bit strange. If I convert the 
chars into hex byte by byte I get the normal sha-1, right?
...
...
Proposal for the revision flag:
...
 0x01: minor edit

Bits 2-3 deal with the user:
...
 0x00: only user text is provided

 0x02: there is userid + user text
 0x04: the contributor is an IPv4 anonymous user
 0x06: the contributor is an IPv6 anonymous user

The high nibble matches the rev_deleted field:
...
 0x08: this revision has a non-default model (else the format is

text/x-wiki)
     0x10: the text of this revision was deleted
     0x20: the comment of this revision was deleted
     0x40: the contributor of this revision was deleted
...
 0x80: the deleted contents are restricted

I don't see how flipping the bit for model helps anything.
Basically, I consider "all pages have their model explicitly
specified" as default and text/x-wiki as a special case to save space
for the most common model.
On the other hand, I think that you consider "text/x-wiki is the model
of all pages" as the default and "this page has the model explicitly
specified" as the special case.
I think that both options make some sense, but I don't see big
difference between them.
Yes, it's completely equivalent. It seems to me to have more sense this 
way, probably because it used to be text/x-wiki everywhere.
...
Also, why is "the deleted contents are restricted" flag on each
revision instead of a single global flag for the whole dump? So that
you can support admin-level dumps, where some deleted fields are
visible and some not?
If you made an "owner dump" with all data, you need that bit to 
differenciate between deleted data any sysop may see and deleted data 
only oversighters can view.
Finally, I would also support different encodings of the revision text 
(eg. a byte in the header with values identity, lzma, mixed, pointer...)
Regards

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] First preview version of incremental dumps