Re: [Wikitech-l] Re: Test my lex/yacc parser!

23 Sep 2004

      On Mon, 20 Sep 2004 16:40:30 +0100, Timwi timwi@gmx.net wrote:
...
Rowan Collins wrote:
...
I know the current version doesn't do anything, but I've been meaning
for a while to finalise a patch to show a message saying "This is a
redirect to [[foo]]".
This has already been done in 1.4.
Hm, figures! I really must start finishing what I start, rather than
leaving it on a "to-do" list for six months and then discovering its
redundant... Still, doesn't this mean the parser needs to recognise
"#REDIRECT <linkpattern>" as a special token? And doesn't that, in
turn, present a problem if we want to retain MagicWord i18n?
...
...
...

Things like __NOTOC__ and stuff can be handled like this:
Regard *everything* of the form __CAPITALLETTERS__ as a special
token

Actually, it can be lower case currently.
Indeed. I didn't know that. But it isn't a problem at all. Even with it
being case-insensitive, I don't think it's asking too much of the users
to put <nowiki> around anything that looks like these, since they are
rarely enough intended to be actual text. I would highly doubt that any
significant amount of articles is currently relying on them being text.
True. So long as the ones that function now will still function, we
should be fine.
...
...
...

HTML tags and extension names are either not internationalised, or all
translations of them are made to work on all Wikipedias.

<snip>
...
I'm not sure why you think allowing all translations on all Wikipedias
would be a "step backwards"? Or do you seriously think someone would use
the Chinese translation of <math> on the English Wikipedia? :)
It's a step backwards because instead of being able to say "yes, we
have an i18n system for you to customise these 'magic words' according
to your language and preference", we will be saying "the parser knows
a few variants for each of these 'magic words'; if you want to add any
more, tell us, and we'll recompile the parser for you". In my opinion
this is quite a big deal - essentially, we are dropping a feature of
the software. Not that I've come up with a workable alternative yet,
it just seems a shame.
And no, I don't think anyone will need the Chinese translation of
<math> on the English Wikipedia, and for that precise reason see no
reason the English wikipedia should be parsing for it.
...

We could replace the "other-language" words with the "this-language"
 words upon save. I.e. if someone wrote <math> on the Chinese
 Wikipedia, it would automatically be changed into "<" + some Chinese
 characters + ">" before storing it in the DB.

Sorry, I'm not with you on this one - are you suggesting that the
Chinese parser be specifically compiled to only cope with the Chinese
magic words? If so, perhaps you misunderstood my problem with having
all variants coded in: its not that they will all work everywhere, but
that adding new ones would (if I understand the whole yacc concept
correctly) require recompiling a new parser, rather than just tweaking
the appropriate language or settings file.
...

Alternatively, we could have the parser recognise only the canonical
 (English) words, and have the PHP software replace non-English magic
 words with the canonical (English) words before invoking the parser.
 I am uncomfortable with this solution because it resorts to the same
 kind of patchwork that is erking me about the current not-a-parser.

I agree, this would not be at all elegant.
...
...
Perhaps extensions could be made to return a parse sub-tree (even if
it only has one element).  Then we could use a HTML "extension" bound
to all allowed HTML tags, which just called the original parser back
on the contents of the tags.
This is an interesting thought, but I think it is inefficient with
regards to performance. If the parser knows about allowed HTML tags (and
the difference between an HTML tag and an extension) beforehand, this
extra step would be saved.
Yeah, as soon as I wrote it, I realised that it would end up rather
expensive efficiency-wise.
...
Additionally, your idea works only for tags
that are independent of other tags; it would not work well with tables.
Indeed, I hadn't thought of the necessity to parse whole bunches of
HTML tags relative to each other. I suppose you could have HTML tags
explicitly defined, and have a not-extension that spat back things
which looked like extensions but weren't; it would still mean that any
disallowed HTML or HTML-like tag would add an extra call to the
parser, and I'm not sure if that would be an acceptable price or not.
So maybe you're right, and the only workable solution is to have all
variants hard-coded in the parser. I guess this is where we come to
regret adopting an "extension" syntax that matches/conflicts with the
syntax used by "allowed bits of HTML".
If we want to keep true customisation of magic words (where "editing
the source" != "customisation") the best idea I've come up with is:
1) hard-code all allowed HTML into the parser. This means maximum
efficiency for those bits, and the ability to handle relationships
between them, etc.
2) treat everything else matching "<"+some_letters+">" as an
"extension" and spew out its contents as one element of the parse
tree. If the receiving PHP script then says "there's no such
extension", it escapes the "<" and ">", and passes the contents back
to be parsed normally.
This would be acceptable IFF a) cases of erroneous <tags> were pretty
rare, so the load created by parsing them separately was not too high;
AND b) it was acceptable that erroneous tags inside complex HTML
structures would break that structure - e.g. "<table><tr><foo><td>Some
text</td></foo></tr></table>" would not parse correctly, because the
second call to the parser would contain just "<td>Some text</td>",
which would probably be impossible to parse properly in isolation. Of
course, even this could be avoided if the entire text were re-parsed,
with the erroneous tags escaped, but that's adding yet more
overhead...
-- 
Rowan Collins BSc
[IMSoP]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Test my lex/yacc parser!