tl;dr:
In the xml dumps, I want to change <text> <sha1> <model> <format> to <model> <format> <text> <sha1>
However, this is a breaking change to our XML schema. See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
Background:
While trying to fix bug 72361, I ran into an issue with our current XML dump format:
The <model> and <format> tags are placed *after* the <text> tag. This means that we don't know how to handle the text when we process XML events in a stream - we'd have to buffer the text, wait until we know model and format, and then process it. A pain.
The current order has no deeper meaning - it is, indeed, my own fault: i didn't think this through when adding these tags. I propose to change the order of the tags now, to make stream processing easier.
That would technically be a breaking change to the dump format, incompatible with https://www.mediawiki.org/xml/export-0.8.xsd and export-0.9.xsd. I doubt however that any consumers rely on the current placement of <model> and <format>, as it is extremely inconvenient (compare bug 72361), but you never know.
I propose to release a new XSD version 0.10 with the order changed, and mention it in the release notes. Should be fine.
Any objections?
-- daniel
I spend a lot of time processing the XML dumps that this will affect. I just wanted to chime in to say that this change makes sense to me and it won't affect my work.
-Aaron
On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler daniel@brightbyte.de wrote:
tl;dr:
In the xml dumps, I want to change <text> <sha1> <model> <format> to <model> <format> <text> <sha1>
However, this is a breaking change to our XML schema. See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
Background:
While trying to fix bug 72361, I ran into an issue with our current XML dump format:
The <model> and <format> tags are placed *after* the <text> tag. This means that we don't know how to handle the text when we process XML events in a stream - we'd have to buffer the text, wait until we know model and format, and then process it. A pain.
The current order has no deeper meaning - it is, indeed, my own fault: i didn't think this through when adding these tags. I propose to change the order of the tags now, to make stream processing easier.
That would technically be a breaking change to the dump format, incompatible with https://www.mediawiki.org/xml/export-0.8.xsd and export-0.9.xsd. I doubt however that any consumers rely on the current placement of <model> and <format>, as it is extremely inconvenient (compare bug 72361), but you never know.
I propose to release a new XSD version 0.10 with the order changed, and mention it in the release notes. Should be fine.
Any objections?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thank you Google for hiding the start of this thread in my spam folder
_<
I'm going to have to change my import tools for the new format, but that's the way it goes; it's a reasonable change. Have you checked with folks on the xml data dumps list to see who might be affected?
Ariel
Στις 23-10-2014, ημέρα Πεμ, και ώρα 09:52 -0500, ο/η Aaron Halfaker έγραψε:
I spend a lot of time processing the XML dumps that this will affect. I just wanted to chime in to say that this change makes sense to me and it won't affect my work.
-Aaron
On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler daniel@brightbyte.de wrote:
tl;dr:
In the xml dumps, I want to change <text> <sha1> <model> <format> to <model> <format> <text> <sha1>
However, this is a breaking change to our XML schema. See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
Background:
While trying to fix bug 72361, I ran into an issue with our current XML dump format:
The <model> and <format> tags are placed *after* the <text> tag. This means that we don't know how to handle the text when we process XML events in a stream - we'd have to buffer the text, wait until we know model and format, and then process it. A pain.
The current order has no deeper meaning - it is, indeed, my own fault: i didn't think this through when adding these tags. I propose to change the order of the tags now, to make stream processing easier.
That would technically be a breaking change to the dump format, incompatible with https://www.mediawiki.org/xml/export-0.8.xsd and export-0.9.xsd. I doubt however that any consumers rely on the current placement of <model> and <format>, as it is extremely inconvenient (compare bug 72361), but you never know.
I propose to release a new XSD version 0.10 with the order changed, and mention it in the release notes. Should be fine.
Any objections?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hoi, You may want to wait until the dumps are fixed. Magnus fixed the one but last dump by hand. The following dump is still broken. Wait until we KNOW the dumps are ok. Thanks, GerardM
On 27 October 2014 21:58, Ariel T. Glenn aglenn@wikimedia.org wrote:
Thank you Google for hiding the start of this thread in my spam folder
_<
I'm going to have to change my import tools for the new format, but that's the way it goes; it's a reasonable change. Have you checked with folks on the xml data dumps list to see who might be affected?
Ariel
Στις 23-10-2014, ημέρα Πεμ, και ώρα 09:52 -0500, ο/η Aaron Halfaker έγραψε:
I spend a lot of time processing the XML dumps that this will affect. I just wanted to chime in to say that this change makes sense to me and it won't affect my work.
-Aaron
On Thu, Oct 23, 2014 at 9:06 AM, Daniel Kinzler daniel@brightbyte.de wrote:
tl;dr:
In the xml dumps, I want to change <text> <sha1> <model> <format> to <model> <format> <text> <sha1>
However, this is a breaking change to our XML schema. See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
Background:
While trying to fix bug 72361, I ran into an issue with our current XML dump format:
The <model> and <format> tags are placed *after* the <text> tag. This means that we don't know how to handle the text when we process
XML
events in a stream - we'd have to buffer the text, wait until we know model
and
format, and then process it. A pain.
The current order has no deeper meaning - it is, indeed, my own fault:
i
didn't think this through when adding these tags. I propose to change the
order
of the tags now, to make stream processing easier.
That would technically be a breaking change to the dump format, incompatible with https://www.mediawiki.org/xml/export-0.8.xsd and
export-0.9.xsd. I
doubt however that any consumers rely on the current placement of <model> and <format>, as it is extremely inconvenient (compare bug 72361), but you never know.
I propose to release a new XSD version 0.10 with the order changed, and mention it in the release notes. Should be fine.
Any objections?
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Am 27.10.2014 22:08, schrieb Gerard Meijssen:
Hoi, You may want to wait until the dumps are fixed. Magnus fixed the one but last dump by hand. The following dump is still broken. Wait until we KNOW the dumps are ok.
Gerard, what exactly do you mean? The only problem I know of is the fact that we are still outputting content using the old serialization format for some revisions. Changing the tag order is, strange as it may sound, needed to fix that problem. Bugs:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72348 https://bugzilla.wikimedia.org/show_bug.cgi?id=72361 https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
Am 27.10.2014 21:58, schrieb Ariel T. Glenn:
Thank you Google for hiding the start of this thread in my spam folder
_<
I'm going to have to change my import tools for the new format, but that's the way it goes; it's a reasonable change. Have you checked with folks on the xml data dumps list to see who might be affected?
Not yet, shall do that now.
Thanks! -- daniel
I noticed that the dump format version number went from "0.9" to "0.10".
I wonder if this format is documented somewhere or if some code might expect "1.0"?
Andrew Dunbar (hippietrail)
On 28 October 2014 20:45, Daniel Kinzler daniel@brightbyte.de wrote:
Am 27.10.2014 21:58, schrieb Ariel T. Glenn:
Thank you Google for hiding the start of this thread in my spam folder
_<
I'm going to have to change my import tools for the new format, but that's the way it goes; it's a reasonable change. Have you checked with folks on the xml data dumps list to see who might be affected?
Not yet, shall do that now.
Thanks! -- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Am 23.10.2014 16:06, schrieb Daniel Kinzler:
tl;dr:
In the xml dumps, I want to change <text> <sha1> <model> <format> to <model> <format> <text> <sha1>
However, this is a breaking change to our XML schema. See https://bugzilla.wikimedia.org/show_bug.cgi?id=72417
There is now a patch up for review:
wikitech-l@lists.wikimedia.org