On Tue, May 25, 2004 at 09:00:29PM -0400, delirium(a)hackish.org wrote:
This seems like a pretty hackish solution long-term.
The HTML dump has some
semantic information, but it also has a lot of HTML-ish cruft in it. The
wikitext doesn't have all the semantic information anyone might want, but it's
much better than the HTML version. If you're going to do anything reasonably
intelligent with the output (other than just display the rendered HTML), or
output it to some different format (TeX is the one I've been working on), and
want it automated, a lot of that information will be useful.
But that's not the purpose. The CD is just meant for reading the articles and
perhaps printing them. If someone wanted to do data mining or convert to a
different format or whatever else they want to do with the semantic markup,
they should get the database dump.
So, basically:
Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)
instead of:
Wikitext --> one presentation format (HTML) --> another presentation format
...seems better to me.
The latter version is sort of like compiling a C++ program into x86 assembly and
then transforming it into PowerPC assembly from that, rather than doing wha gcc
does--compiling C++ into an abstract intermediate representation, which can
then be output to x86 assembly or PowerPC assembly or whatever you might like.
It does bring up another point though: even in the wikitext there isn't as much
semantic information as might be nice. Some is hard to come up with good
markup for, but some is fairly easy--for example, encouraging people to use
<math> tags for everying that's logically math, even short things like
"the
variable <math>g</math> is..." instead of using manual non-logical
formatting
commands like "italicize". Or even worse, using HTML-specific stuff like
fancy
divs.
That'd be nice, but unfortunately its not enforceable. The web was originally
intended to have semantic tags. But guess what happened? There are two reasons
why it won't work on wikipedia. People generally think in terms of presentational
rather than structural markup. Might be a result of WYSIWYG word processors, I
don't know. Second, even those feel semantic markup is important (and I offer
myself as an example) aren't going to be bothered to write <math>x</math>
instead of ''x''. The primary goal of wikipedia is parsability by humans
rather
than computers, and the former application is currently so predominant over the
latter that I'm not willing to inconvenience myself for the sake of semantic
markup.
Arvind
--
Its all GNU to me