On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
I don't think that a Flag Day for some exceedingly esoteric construction which needs to be cleaned up to make a formal parser necessary is completely impossible, but it would have to be pretty negligible, pretty important, or both... it goes back to that circle I mentioned.
So what if we had a "lossless" wikisyntax to XML converter? It seems like that wouldn't be an impossibility (given we're already parsing wikisyntax to _HTML_).
What are the reactions to e.g. converting the backend to use that XML storage, then enforcing it on the editor side, as well?
Obviously we'd have to be clever on the conversion (like making VERY sure it's a "lossless" switch, and finding a computationally feasible way to get it done - maybe update every article as it's touched?).
To my way of thinking, if we had an XML backend store and a reliable conversion path, then we could: a) Provide wikisyntax editing to those who want it (by filtering through the converter) b) Develop meaningful wysiwyg editing tools without having to first reimplement the wikisyntax parser in javascript and every other language we want to touch. c) Allow direct access to the XML, making all kinds of researchers happy. d) Incrementally roll out changes to bring things more in line with Semantic Web, again with conversion paths.
Engineering wise, a "lossless" path to me could be developed by developing these components: 1. Wikisyntax <-> WikiXML converters. 2. WikiXML -> HTML renderer.
Determing that it is working properly can be done by testing against the Wikipedia corpus. If we can go from WikiXML to Wikisyntax and back, byte-exact, we've acheived our goal. Maybe it's ok to relax that restriction (especially if we can determine in some other way the page is corrupt or invalid - or maybe we have a list of exceptions), but I think it's one that's both acheivable and reasonable.
We may also want to do validation on the HTML render path; if we want to be really strict we can require that the conversion path gives identical output (perhaps sans whitespace?) to the current parser & renderer.
Once we have everything in XML, there are a number of good tools and standards to enable us to be Unicode compliant, to do various kinds of conversions and updates on the XML, and otherwise process our data, so we can evolve it forward to meet our needs.
In any case - if we find that having a lossless path would satisfy the constraints, then those who are interested can focus on writing a validation framework... and then they can go implement it. :)