On Mon, 20 Sep 2004 16:40:30 +0100, Timwi timwi@gmx.net wrote:
Rowan Collins wrote:
I know the current version doesn't do anything, but I've been meaning for a while to finalise a patch to show a message saying "This is a redirect to [[foo]]".
This has already been done in 1.4.
Hm, figures! I really must start finishing what I start, rather than leaving it on a "to-do" list for six months and then discovering its redundant... Still, doesn't this mean the parser needs to recognise "#REDIRECT <linkpattern>" as a special token? And doesn't that, in turn, present a problem if we want to retain MagicWord i18n?
- Things like __NOTOC__ and stuff can be handled like this:
- Regard *everything* of the form __CAPITALLETTERS__ as a special token
Actually, it can be lower case currently.
Indeed. I didn't know that. But it isn't a problem at all. Even with it being case-insensitive, I don't think it's asking too much of the users to put <nowiki> around anything that looks like these, since they are rarely enough intended to be actual text. I would highly doubt that any significant amount of articles is currently relying on them being text.
True. So long as the ones that function now will still function, we should be fine.
- HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.
<snip>
I'm not sure why you think allowing all translations on all Wikipedias would be a "step backwards"? Or do you seriously think someone would use the Chinese translation of <math> on the English Wikipedia? :)
It's a step backwards because instead of being able to say "yes, we have an i18n system for you to customise these 'magic words' according to your language and preference", we will be saying "the parser knows a few variants for each of these 'magic words'; if you want to add any more, tell us, and we'll recompile the parser for you". In my opinion this is quite a big deal - essentially, we are dropping a feature of the software. Not that I've come up with a workable alternative yet, it just seems a shame.
And no, I don't think anyone will need the Chinese translation of <math> on the English Wikipedia, and for that precise reason see no reason the English wikipedia should be parsing for it.
- We could replace the "other-language" words with the "this-language" words upon save. I.e. if someone wrote <math> on the Chinese Wikipedia, it would automatically be changed into "<" + some Chinese characters + ">" before storing it in the DB.
Sorry, I'm not with you on this one - are you suggesting that the Chinese parser be specifically compiled to only cope with the Chinese magic words? If so, perhaps you misunderstood my problem with having all variants coded in: its not that they will all work everywhere, but that adding new ones would (if I understand the whole yacc concept correctly) require recompiling a new parser, rather than just tweaking the appropriate language or settings file.
- Alternatively, we could have the parser recognise only the canonical (English) words, and have the PHP software replace non-English magic words with the canonical (English) words before invoking the parser. I am uncomfortable with this solution because it resorts to the same kind of patchwork that is erking me about the current not-a-parser.
I agree, this would not be at all elegant.
Perhaps extensions could be made to return a parse sub-tree (even if it only has one element). Then we could use a HTML "extension" bound to all allowed HTML tags, which just called the original parser back on the contents of the tags.
This is an interesting thought, but I think it is inefficient with regards to performance. If the parser knows about allowed HTML tags (and the difference between an HTML tag and an extension) beforehand, this extra step would be saved.
Yeah, as soon as I wrote it, I realised that it would end up rather expensive efficiency-wise.
Additionally, your idea works only for tags that are independent of other tags; it would not work well with tables.
Indeed, I hadn't thought of the necessity to parse whole bunches of HTML tags relative to each other. I suppose you could have HTML tags explicitly defined, and have a not-extension that spat back things which looked like extensions but weren't; it would still mean that any disallowed HTML or HTML-like tag would add an extra call to the parser, and I'm not sure if that would be an acceptable price or not.
So maybe you're right, and the only workable solution is to have all variants hard-coded in the parser. I guess this is where we come to regret adopting an "extension" syntax that matches/conflicts with the syntax used by "allowed bits of HTML".
If we want to keep true customisation of magic words (where "editing the source" != "customisation") the best idea I've come up with is: 1) hard-code all allowed HTML into the parser. This means maximum efficiency for those bits, and the ability to handle relationships between them, etc. 2) treat everything else matching "<"+some_letters+">" as an "extension" and spew out its contents as one element of the parse tree. If the receiving PHP script then says "there's no such extension", it escapes the "<" and ">", and passes the contents back to be parsed normally.
This would be acceptable IFF a) cases of erroneous <tags> were pretty rare, so the load created by parsing them separately was not too high; AND b) it was acceptable that erroneous tags inside complex HTML structures would break that structure - e.g. "<table><tr><foo><td>Some text</td></foo></tr></table>" would not parse correctly, because the second call to the parser would contain just "<td>Some text</td>", which would probably be impossible to parse properly in isolation. Of course, even this could be avoided if the entire text were re-parsed, with the erroneous tags escaped, but that's adding yet more overhead...