Hi,
Some questions on the new dump options. I noticed that the XML dump files use exactly the same content model and format for the new model as they used for the old. This is not so great as it reduces the utility of the <model> information greatly if the same model is used for incompatible content. I am now trying to find a way to write code that supports both old and new dumps. Hence my questions:
(1) The most recent full dump that is available contains the old format. The most recent current dump that is available contains the new format. Is it possible that a single dump contains both formats?
(2a) If the answer to (1) is no: what are/will be the first (or last) full/current/daily dump files that use the new format?
(2b) If the answer to (1) is yes: what is the revision number at which the change was made (i.e., what is the largest revision number that is still in the old format)?
Many thanks,
Markus
Hi again,
I am in Berlin today and got my answers first hand, so for the record, here they are:
On 01.09.2014 16:07, Markus Krötzsch wrote:
Hi,
Some questions on the new dump options. I noticed that the XML dump files use exactly the same content model and format for the new model as they used for the old. This is not so great as it reduces the utility of the <model> information greatly if the same model is used for incompatible content. I am now trying to find a way to write code that supports both old and new dumps. Hence my questions:
(1) The most recent full dump that is available contains the old format. The most recent current dump that is available contains the new format. Is it possible that a single dump contains both formats?
No, the dump-creating code transforms all content into the appropriate JSON during export. The data you see in dumps is always in the format that is generated by the most recent code that was used when the dump file was created, and hence all revisions are in the same format.
Currently, the XML-based revision dumps use different code for this than the code used in JSON dumps and API. In the near future, this will be unified.
(2a) If the answer to (1) is no: what are/will be the first (or last) full/current/daily dump files that use the new format?
I did not get an answer to this question, but since it is certain that each file is in a single format, a viable strategy is to parse with the new format first; if there are errors, try parsing with the old format; if this succeeds even once, the whole remaining file should be parsed in the old format.
(2b) If the answer to (1) is yes: what is the revision number at which the change was made (i.e., what is the largest revision number that is still in the old format)?
Not applicable.
Markus
On Tue, Sep 2, 2014 at 5:21 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi again,
I am in Berlin today and got my answers first hand, so for the record, here they are:
(2a) If the answer to (1) is no: what are/will be the first (or last) full/current/daily dump files that use the new format?
I did not get an answer to this question,
It looks like http://dumps.wikimedia.org/wikidatawiki/20140823/ uses the new format, although these dumps started before the format switch and ended after. There's a possibility that they have some strange mix of both formats. (?)
Next full xml dumps will have the new format. Switch for daily dumps should have been on August 27.
Cheers, Katie
but since it is certain that each file is in a single format, a viable
strategy is to parse with the new format first; if there are errors, try parsing with the old format; if this succeeds even once, the whole remaining file should be parsed in the old format.
(2b) If the answer to (1) is yes: what is the revision number at which the change was made (i.e., what is the largest revision number that is still in the old format)?
Not applicable.
Markus
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
wikidata-tech@lists.wikimedia.org