However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.
DIV tags *are* part of the abstract representation - it's the CSS that handles the display.