David Gerard wrote:
On 06/02/2008, Brion Vibber
<brion(a)wikimedia.org> wrote:
While reviewing some other code, I went in and
started ripping up some
of the file type & validity checks in MediaWiki's upload system, as
they've been driving me nuts for some time.
One quick subproject was tossing in an XML well-formedness check for SVG
files. For the curious, here's a report on the invalid files I
encountered while testing this with files from Commons:
http://meta.wikimedia.org/wiki/SVG_validity_checks
This is worth noting on
mediawiki.org, really.
Of particular interest are invalid SVGs created by editing tools. I
have a Bastard SVG From Hell I like to throw at things (I hope to have
a copy I can release soon ;-) ) created by OmniGraffle.
Oooh ooh can I have a copy? Right now I only really care that we can
pass it as well-formed XML and recognize it as SVG, but that's the sort
of thing that's great to test. :D
The W3C
validator hates it. Inkscape, rsvg, Safari, WebKit, Opera, Firefox and
Minefield all misrender it to a greater or lesser degree. (I've yet to
throw it at Batik.) But it's an SVG created by an editing program in
current use ...
I was surprised to see a bad SVG from Inkscape - does opening and
saving it in the current stable Inkscape sanitise it?
Technically it's invalid XML -- Inkscape should refuse to open it. :)
On my spot checks, Inkscape is willing to take the ones with undeclared
namespaces (and saves them correctly, yay) but won't open the ones that
are outright malformed (bad char encoding, bad element nesting).
How sanitisable are the bad SVGs you found? How
automatable would a
sanitisation process be, e.g. from a command-line invocation of
Inkscape?
For some well-known typical prefixes (xlink, sodipodi, RDF) we could
fairly easily insert a namespace declaration. For others we might have
to give up. :)
The mystery 'ns:' one seems to be specific to Adobe Illustrator's
export, for instance, and variously shows up as 'ns:' or 'ns0:' prefix
in files I googled up.
Again, most likely the original files were fine, but some combination of
editing manually or with other tools may have corrupted them.
-- brion