On Tue, May 25, 2004 at 09:00:29PM -0400, delirium@hackish.org wrote:
This seems like a pretty hackish solution long-term. The HTML dump has some semantic information, but it also has a lot of HTML-ish cruft in it. The wikitext doesn't have all the semantic information anyone might want, but it's much better than the HTML version. If you're going to do anything reasonably intelligent with the output (other than just display the rendered HTML), or output it to some different format (TeX is the one I've been working on), and want it automated, a lot of that information will be useful.
But that's not the purpose. The CD is just meant for reading the articles and perhaps printing them. If someone wanted to do data mining or convert to a different format or whatever else they want to do with the semantic markup, they should get the database dump.
So, basically: Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)
instead of: Wikitext --> one presentation format (HTML) --> another presentation format
...seems better to me.
The latter version is sort of like compiling a C++ program into x86 assembly and then transforming it into PowerPC assembly from that, rather than doing wha gcc does--compiling C++ into an abstract intermediate representation, which can then be output to x86 assembly or PowerPC assembly or whatever you might like.
It does bring up another point though: even in the wikitext there isn't as much semantic information as might be nice. Some is hard to come up with good markup for, but some is fairly easy--for example, encouraging people to use <math> tags for everying that's logically math, even short things like "the variable <math>g</math> is..." instead of using manual non-logical formatting commands like "italicize". Or even worse, using HTML-specific stuff like fancy divs.
That'd be nice, but unfortunately its not enforceable. The web was originally intended to have semantic tags. But guess what happened? There are two reasons why it won't work on wikipedia. People generally think in terms of presentational rather than structural markup. Might be a result of WYSIWYG word processors, I don't know. Second, even those feel semantic markup is important (and I offer myself as an example) aren't going to be bothered to write <math>x</math> instead of ''x''. The primary goal of wikipedia is parsability by humans rather than computers, and the former application is currently so predominant over the latter that I'm not willing to inconvenience myself for the sake of semantic markup.
Arvind