Hi,
Still, doesn't this mean the parser needs to
recognise
"#REDIRECT <linkpattern>" as a special token? And doesn't that, in
turn, present a problem if we want to retain MagicWord i18n?
Not really. We can still recognise redirects with a regexp (or anything
else in PHP) before passing the page to the parser.
I'm not
sure why you think allowing all translations on all Wikipedias
would be a "step backwards"? Or do you seriously think someone would use
the Chinese translation of <math> on the English Wikipedia? :)
It's a step backwards because instead of being able to say "yes, we
have an i18n system for you to customise these 'magic words' according
to your language and preference", we will be saying "the parser knows
a few variants for each of these 'magic words'; if you want to add any
more, tell us, and we'll recompile the parser for you".
First of all, even in the current system there is no way for server
admins to customise the magic words without modifying actual source
code. Secondly, you're making it sound like recompiling the parser was
some sort of monumental task.
Here's an idea. One could provide a .c or .h file where #define
statements are used to define the magic words, and then make sure that
if you modify it, you only need to recompile the binary (i.e. invoke
gcc) but you don't need flex, bison, or swig. But even if you were to
require flex, bison and swig, even then the recompilation can be
automated by a simple script.
Here's another reason why I think the parser should recognise all
variants of the magic words. Think about the alternative. The
alternative is to have "<xyz>" mean "invoke the math extension"
on one
Wikipedia, but "<xyz>" on another. What is the point in allowing
for articles to exist that rely on "<xyz>" meaning
"<xyz>" when it
means "math" elsewhere?
And no, I don't think anyone will need the Chinese
translation of
<math> on the English Wikipedia, and for that precise reason see no
reason the English wikipedia should be parsing for it.
It doesn't make a performance difference, if that's what you're worrying
about.
* We could
replace the "other-language" words with the "this-language"
words upon save. I.e. if someone wrote <math> on the Chinese
Wikipedia, it would automatically be changed into "<" + some Chinese
characters + ">" before storing it in the DB.
Sorry, I'm not with you on this one - are you suggesting that the
Chinese parser be specifically compiled to only cope with the Chinese
magic words? If so, perhaps you misunderstood my problem with having
all variants coded in: its not that they will all work everywhere, but
that adding new ones would (if I understand the whole yacc concept
correctly) require recompiling a new parser, rather than just tweaking
the appropriate language or settings file.
To the first part: No, the idea was to have one parser that recognises
everything, but to expose to the Chinese users only the Chinese variant
even if someone typed in the Swahili one. (Haha.) But of course that
doesn't solve your problem. I mentioned above that recompiling the
parser is by no means difficult, and that the current system also
requires editing source code, but if you still think that it's a
problem, then we're stuck because I don't know what else we can do.
So maybe you're right, and the only workable
solution is to have all
variants hard-coded in the parser. I guess this is where we come to
regret adopting an "extension" syntax that matches/conflicts with the
syntax used by "allowed bits of HTML".
True. If we had something like [!math x^2 + y^2 = z^2 !], then we could
say "everything in [! ... !] is an extension". Would make life much easier.
If we want to keep true customisation of magic words
(where "editing
the source" != "customisation") the best idea I've come up with is:
1) hard-code all allowed HTML into the parser. This means maximum
efficiency for those bits, and the ability to handle relationships
between them, etc.
2) treat everything else matching "<"+some_letters+">" as an
"extension" and spew out its contents as one element of the parse
tree. If the receiving PHP script then says "there's no such
extension", it escapes the "<" and ">", and passes the
contents back
to be parsed normally.
Unfortunately, this opens a whole other can of worms. What if there is
no end tag? What if there is other mark-up partly inside and partly
outside the "extension" block?
He said, "What the ''<swearword>'' is going on?"
I think, considering all of these problems we have discussed, it makes a
real lot of sense to formulate a "rule" that the design of the parser
should fulfill: The parser must know in advance how to parse everything.
The resulting parse tree must not depend on anything other than the
input wiki text.
Greetings,
Timwi