Test my lex/yacc parser!

List overview All Threads
Download

newer

older

HEADS UP: major database schema...

No Webalizer stats since September...

Timwi

18 Sep 2004 18 Sep '04

11:19 a.m.

Hi!

I've posted about this before, but here it is again! :) In my lex/yacc-based parser, I have now finished the table wiki syntax as well as template inclusions. Please test everything! :)

http://timwi.dyndns.org/wiki/tmp.php

In particular, please test to see whether any invalid mark-up produces an error. It shouldn't. It is always supposed to generate *something*. If you do find an error, please e-mail me the full test case you entered that produces it. (Alternatively, of course, ICQ and AIM work just as well.)

Things I know I haven't done yet: - HTML tags (currently everything between <something> and </something> is treated as an 'extension'; I need to limit that to <nowiki>, <pre>, <math>, <hiero>, <music> and <chem> for now) - [http://url/ These sort of links] - hrs (horizontal rules) - the new -{ language variants }- syntax

If there is absolutely anything else missing, please let me know!

(Please notice that the above URL is a bit slow at times when my computer is a bit loaded. Also notice that it is only available when my computer is turned on, which is during the day-time in the UK.)

Thanks! Timwi

Show replies by date

Jens Frank

18 Sep 18 Sep

11:45 a.m.

On Sat, Sep 18, 2004 at 10:19:56AM +0100, Timwi wrote:

...

http://timwi.dyndns.org/wiki/tmp.php
In particular, please test to see whether any invalid mark-up produces an error.

There is no invalid Wiki markup :-)

JeLuF

Jens Frank

11:48 a.m.

On Sat, Sep 18, 2004 at 10:19:56AM +0100, Timwi wrote:

...

Hi!

I've posted about this before, but here it is again! :) In my lex/yacc-based parser, I have now finished the table wiki syntax as well as template inclusions. Please test everything! :)
http://timwi.dyndns.org/wiki/tmp.php

INPUT: {{stub}} OUTPUT: <error />

Regards,

JeLuF

Jens Frank

11:53 a.m.

On Sat, Sep 18, 2004 at 10:19:56AM +0100, Timwi wrote:

...

Hi!

I've posted about this before, but here it is again! :) In my lex/yacc-based parser, I have now finished the table wiki syntax as well as template inclusions. Please test everything! :)
http://timwi.dyndns.org/wiki/tmp.php

INPUT: [[stub|''link]]trail'' OUTPUT: Proxy error message: "Document contains no data". Segmentation fault?

Regards,

JeLuF

Timwi

12:47 p.m.

Jens Frank wrote:

...

INPUT: {{stub}} OUTPUT:

<error />

Please send me your *entire* test string, as otherwise I can't reproduce the problem! If I type just "{{stub}}", with or without a trailing newline, I do not get an error. I get:

...

INPUT: [[stub|''link]]trail'' OUTPUT: Proxy error message: "Document contains no data". Segmentation fault?

Whoa! Thanks loads for this one, Segmentation faults are of course very important to catch.

The error was that in parsetree.c:convertPipeSeriesToText(), I used a freeRecursively() where it should have been just free(). While I was at it, I've also performance-optimised it. :)

Thanks, Timwi

Stephan Walter

8:08 p.m.

Timwi wrote:

...

If there is absolutely anything else missing, please let me know!

Try some UTF-8 salad:

* a Umlaut: ä * o Umlaut: ö * u Umlaut: ü * c cedilla: ç * Euro: € * en-dash: –

The euro and en-dash sign just get removed, the others are replaced by some &#xxx; stuff.

This is probably just a problem with the different encodings of php/lex, but still annoying for most non-english speakers...

Cheers, Stephan

Timwi

19 Sep 19 Sep

12:06 a.m.

Stephan Walter wrote:

...

Try some UTF-8 salad:

The page has its encoding declared as ISO-8859-1. The behaviour you described is correct in this case.

Timwi

Rowan Collins

7:14 p.m.

On Sat, 18 Sep 2004 10:19:56 +0100, Timwi timwi@gmx.net wrote:

...

Things I know I haven't done yet:

HTML tags (currently everything between <something> and </something> is treated as an 'extension'; I need to limit that to <nowiki>, <pre>, <math>, <hiero>, <music> and <chem> for now)

[http://url/ These sort of links]

hrs (horizontal rules)

the new -{ language variants }- syntax

If there is absolutely anything else missing, please let me know!

Have you worked out how to deal with "MagicWord" i18n yet? I ask because in the current not-a-parser, there is no definition of what a magic word looks like, just a class that can be asked "is this a one of those". Probably with a properly defined grammar, we would have to limit the style of magic-words, so that just likely tokens could be checked against the current list of magicwords. [I don't actually know if any translations make use of this feature, but it would be a shame to lose it.] I think all default magic words currently fall into one of: "#"<word> (as in "#REDIRECT") "__"<word>"__" (as in "__NOTOC__" et al) "<"<word>">" and matching "</"<word>">" (for extensions, and whitelisted HTML tags)

You may already have thought this through and come to your own conclusions, but this approach certainly seems more efficient than having to check *every* token against a run-time list.

-- Rowan Collins BSc [IMSoP]

Timwi

20 Sep 20 Sep

1:05 a.m.

Rowan Collins wrote:

...

Have you worked out how to deal with "MagicWord" i18n yet?

Not entirely, but mostly :)

Here are my thoughts:

* Redirects are not passed through the lex/yacc parser at all. They can be recognised with a regular expression that takes the magic words into account.

* Things like __NOTOC__ and stuff can be handled like this: * Regard *everything* of the form __CAPITALLETTERS__ as a special token * Have the post-processing step remove and process them * What to do with unrecognised ones is open to debate. Options: 1. turn them back into text 2. ignore them. They're rare enough to require <nowiki> if you want to actually write them.

* The "Media", "Image" and other namespaces are handled in post-processing. The parser sees them as text within the link target.

* The template pseudo-variables (e.g. CURRENTMONTH) are similarly handled in post-processing.

* HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.

Timwi

Rowan Collins

2:56 p.m.

On Mon, 20 Sep 2004 00:05:23 +0100, Timwi timwi@gmx.net wrote:

...

Rowan Collins wrote:

...
Have you worked out how to deal with "MagicWord" i18n yet?

Not entirely, but mostly :)

Here are my thoughts:

Redirects are not passed through the lex/yacc parser at all. They can be recognised with a regular expression that takes the magic words into account.

I know the current version doesn't do anything, but I've been meaning for a while to finalise a patch to show a message saying "This is a redirect to [[foo]]". It's always annoyed me that it parses as though it were a numbered list. I was hoping we could then post-process it to say "This is a *broken* redirect", and even "This is a double redirect (and therefore broken)" etc. How hard would it be to recognise "first token of text begins with #"

...

Things like __NOTOC__ and stuff can be handled like this:

Regard *everything* of the form __CAPITALLETTERS__ as a special token

Actually, it can be lower case currently. Unless we're going to hunt the database for examples where it is, best just treat __anystringofletters__ as needing to be investigated.

...

The template pseudo-variables (e.g. CURRENTMONTH) are similarly handled in post-processing.

By which, do you mean they are treated as templates and then recognised as magic after? Just curious.

...

HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.

That seems a bit of a step backwards to me. Actually, everything that looks like a SGML tag has to be treated one of three ways:

a) it is an extension, and everything from there to its partner should be unparsed / sent somewhere else for parsing b) it's an allowed HTML tag, and should be put in the parse-tree as that kind of element, with its contents parsed "independently" (sort of) c) it is neither of the above, and needs entity escaping so that it doesn't get as far as the browser still looking like HTML

Perhaps extensions could be made to return a parse sub-tree (even if it only has one element). Then we could use a HTML "extension" bound to all allowed HTML tags, which just called the original parser back on the contents of the tags. Similarly, a no-match handler would escape the tags in question and then parse the whole string back for normal parsing. Or would that be hopelessly inefficient?

-- Rowan Collins BSc [IMSoP]

Ævar Arnfjörð Bjarmason

3:18 p.m.

On Mon, 20 Sep 2004 13:56:36 +0100, Rowan Collins rowan.collins@gmail.com wrote:

...

On Mon, 20 Sep 2004 00:05:23 +0100, Timwi timwi@gmx.net wrote:

...
Rowan Collins wrote:

...
Have you worked out how to deal with "MagicWord" i18n yet?

Not entirely, but mostly :)

Here are my thoughts:

Redirects are not passed through the lex/yacc parser at all. They can be recognised with a regular expression that takes the magic words into account.

I know the current version doesn't do anything, but I've been meaning for a while to finalise a patch to show a message saying "This is a redirect to [[foo]]". It's always annoyed me that it parses as though it were a numbered list. I was hoping we could then post-process it to say "This is a *broken* redirect", and even "This is a double redirect (and therefore broken)" etc. How hard would it be to recognise "first token of text begins with #"

How about just changing the first redirect to point to the page the second one is pointing to? That is change: A -> B -> C to A -> C so when people show up at a they just see "this is a redirect" rather than "this is a broken redirect" due to the software having solved that automatically.

...

...

Things like __NOTOC__ and stuff can be handled like this:

Regard *everything* of the form __CAPITALLETTERS__ as a special token

Actually, it can be lower case currently. Unless we're going to hunt the database for examples where it is, best just treat __anystringofletters__ as needing to be investigated.

...

The template pseudo-variables (e.g. CURRENTMONTH) are similarly handled in post-processing.

By which, do you mean they are treated as templates and then recognised as magic after? Just curious.

...

HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.

That seems a bit of a step backwards to me. Actually, everything that looks like a SGML tag has to be treated one of three ways:

a) it is an extension, and everything from there to its partner should be unparsed / sent somewhere else for parsing b) it's an allowed HTML tag, and should be put in the parse-tree as that kind of element, with its contents parsed "independently" (sort of) c) it is neither of the above, and needs entity escaping so that it doesn't get as far as the browser still looking like HTML

Perhaps extensions could be made to return a parse sub-tree (even if it only has one element). Then we could use a HTML "extension" bound to all allowed HTML tags, which just called the original parser back on the contents of the tags. Similarly, a no-match handler would escape the tags in question and then parse the whole string back for normal parsing. Or would that be hopelessly inefficient?

-- Rowan Collins BSc [IMSoP]

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Timwi

5:40 p.m.

Rowan Collins wrote:

...

I know the current version doesn't do anything, but I've been meaning for a while to finalise a patch to show a message saying "This is a redirect to [[foo]]".

This has already been done in 1.4.

...

...

Things like __NOTOC__ and stuff can be handled like this:

Regard *everything* of the form __CAPITALLETTERS__ as a special token

Actually, it can be lower case currently. Unless we're going to hunt the database for examples where it is, best just treat __anystringofletters__ as needing to be investigated.

Indeed. I didn't know that. But it isn't a problem at all. Even with it being case-insensitive, I don't think it's asking too much of the users to put <nowiki> around anything that looks like these, since they are rarely enough intended to be actual text. I would highly doubt that any significant amount of articles is currently relying on them being text.

...

...

The template pseudo-variables (e.g. CURRENTMONTH) are similarly handled in post-processing.

By which, do you mean they are treated as templates and then recognised as magic after? Just curious.

Yep, that's right.

...

...

HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.

That seems a bit of a step backwards to me. Actually, everything that looks like a SGML tag has to be treated one of three ways:

a) it is an extension, and everything from there to its partner should be unparsed / sent somewhere else for parsing b) it's an allowed HTML tag, and should be put in the parse-tree as that kind of element, with its contents parsed "independently" (sort of) c) it is neither of the above, and needs entity escaping so that it doesn't get as far as the browser still looking like HTML

I am perfectly happy with this, but since the parser is a stand-alone module, I cannot treat a particular word as case (a) on one Wikipedia but case (c) on another.

I'm not sure why you think allowing all translations on all Wikipedias would be a "step backwards"? Or do you seriously think someone would use the Chinese translation of <math> on the English Wikipedia? :)

But if you still insist on this, then I have two suggestions:

* We could replace the "other-language" words with the "this-language" words upon save. I.e. if someone wrote <math> on the Chinese Wikipedia, it would automatically be changed into "<" + some Chinese characters + ">" before storing it in the DB.

* Alternatively, we could have the parser recognise only the canonical (English) words, and have the PHP software replace non-English magic words with the canonical (English) words before invoking the parser. I am uncomfortable with this solution because it resorts to the same kind of patchwork that is erking me about the current not-a-parser.

...

Perhaps extensions could be made to return a parse sub-tree (even if it only has one element). Then we could use a HTML "extension" bound to all allowed HTML tags, which just called the original parser back on the contents of the tags.

This is an interesting thought, but I think it is inefficient with regards to performance. If the parser knows about allowed HTML tags (and the difference between an HTML tag and an extension) beforehand, this extra step would be saved. Additionally, your idea works only for tags that are independent of other tags; it would not work well with tables.

Timwi

Rowan Collins

23 Sep 23 Sep

7:16 p.m.

On Mon, 20 Sep 2004 16:40:30 +0100, Timwi timwi@gmx.net wrote:

...

Rowan Collins wrote:

...
I know the current version doesn't do anything, but I've been meaning for a while to finalise a patch to show a message saying "This is a redirect to [[foo]]".

This has already been done in 1.4.

Hm, figures! I really must start finishing what I start, rather than leaving it on a "to-do" list for six months and then discovering its redundant... Still, doesn't this mean the parser needs to recognise "#REDIRECT <linkpattern>" as a special token? And doesn't that, in turn, present a problem if we want to retain MagicWord i18n?

...

...
...

Things like __NOTOC__ and stuff can be handled like this:

Regard *everything* of the form __CAPITALLETTERS__ as a special token

Actually, it can be lower case currently.

Indeed. I didn't know that. But it isn't a problem at all. Even with it being case-insensitive, I don't think it's asking too much of the users to put <nowiki> around anything that looks like these, since they are rarely enough intended to be actual text. I would highly doubt that any significant amount of articles is currently relying on them being text.

True. So long as the ones that function now will still function, we should be fine.

...

...
...

HTML tags and extension names are either not internationalised, or all translations of them are made to work on all Wikipedias.

<snip>

...

I'm not sure why you think allowing all translations on all Wikipedias would be a "step backwards"? Or do you seriously think someone would use the Chinese translation of <math> on the English Wikipedia? :)

It's a step backwards because instead of being able to say "yes, we have an i18n system for you to customise these 'magic words' according to your language and preference", we will be saying "the parser knows a few variants for each of these 'magic words'; if you want to add any more, tell us, and we'll recompile the parser for you". In my opinion this is quite a big deal - essentially, we are dropping a feature of the software. Not that I've come up with a workable alternative yet, it just seems a shame.

And no, I don't think anyone will need the Chinese translation of <math> on the English Wikipedia, and for that precise reason see no reason the English wikipedia should be parsing for it.

...

We could replace the "other-language" words with the "this-language" words upon save. I.e. if someone wrote <math> on the Chinese Wikipedia, it would automatically be changed into "<" + some Chinese characters + ">" before storing it in the DB.

Sorry, I'm not with you on this one - are you suggesting that the Chinese parser be specifically compiled to only cope with the Chinese magic words? If so, perhaps you misunderstood my problem with having all variants coded in: its not that they will all work everywhere, but that adding new ones would (if I understand the whole yacc concept correctly) require recompiling a new parser, rather than just tweaking the appropriate language or settings file.

...

Alternatively, we could have the parser recognise only the canonical (English) words, and have the PHP software replace non-English magic words with the canonical (English) words before invoking the parser. I am uncomfortable with this solution because it resorts to the same kind of patchwork that is erking me about the current not-a-parser.

I agree, this would not be at all elegant.

...

...
Perhaps extensions could be made to return a parse sub-tree (even if it only has one element). Then we could use a HTML "extension" bound to all allowed HTML tags, which just called the original parser back on the contents of the tags.

This is an interesting thought, but I think it is inefficient with regards to performance. If the parser knows about allowed HTML tags (and the difference between an HTML tag and an extension) beforehand, this extra step would be saved.

Yeah, as soon as I wrote it, I realised that it would end up rather expensive efficiency-wise.

...

Additionally, your idea works only for tags that are independent of other tags; it would not work well with tables.

Indeed, I hadn't thought of the necessity to parse whole bunches of HTML tags relative to each other. I suppose you could have HTML tags explicitly defined, and have a not-extension that spat back things which looked like extensions but weren't; it would still mean that any disallowed HTML or HTML-like tag would add an extra call to the parser, and I'm not sure if that would be an acceptable price or not.

So maybe you're right, and the only workable solution is to have all variants hard-coded in the parser. I guess this is where we come to regret adopting an "extension" syntax that matches/conflicts with the syntax used by "allowed bits of HTML".

If we want to keep true customisation of magic words (where "editing the source" != "customisation") the best idea I've come up with is: 1) hard-code all allowed HTML into the parser. This means maximum efficiency for those bits, and the ability to handle relationships between them, etc. 2) treat everything else matching "<"+some_letters+">" as an "extension" and spew out its contents as one element of the parse tree. If the receiving PHP script then says "there's no such extension", it escapes the "<" and ">", and passes the contents back to be parsed normally.

This would be acceptable IFF a) cases of erroneous <tags> were pretty rare, so the load created by parsing them separately was not too high; AND b) it was acceptable that erroneous tags inside complex HTML structures would break that structure - e.g. "<table><tr><foo><td>Some text</td></foo></tr></table>" would not parse correctly, because the second call to the parser would contain just "<td>Some text</td>", which would probably be impossible to parse properly in isolation. Of course, even this could be avoided if the entire text were re-parsed, with the erroneous tags escaped, but that's adding yet more overhead...

-- Rowan Collins BSc [IMSoP]

Timwi

9:09 p.m.

Hi,

...

Still, doesn't this mean the parser needs to recognise "#REDIRECT <linkpattern>" as a special token? And doesn't that, in turn, present a problem if we want to retain MagicWord i18n?

Not really. We can still recognise redirects with a regexp (or anything else in PHP) before passing the page to the parser.

...

...
I'm not sure why you think allowing all translations on all Wikipedias would be a "step backwards"? Or do you seriously think someone would use the Chinese translation of <math> on the English Wikipedia? :)

It's a step backwards because instead of being able to say "yes, we have an i18n system for you to customise these 'magic words' according to your language and preference", we will be saying "the parser knows a few variants for each of these 'magic words'; if you want to add any more, tell us, and we'll recompile the parser for you".

First of all, even in the current system there is no way for server admins to customise the magic words without modifying actual source code. Secondly, you're making it sound like recompiling the parser was some sort of monumental task.

Here's an idea. One could provide a .c or .h file where #define statements are used to define the magic words, and then make sure that if you modify it, you only need to recompile the binary (i.e. invoke gcc) but you don't need flex, bison, or swig. But even if you were to require flex, bison and swig, even then the recompilation can be automated by a simple script.

Here's another reason why I think the parser should recognise all variants of the magic words. Think about the alternative. The alternative is to have "<xyz>" mean "invoke the math extension" on one Wikipedia, but "<xyz>" on another. What is the point in allowing for articles to exist that rely on "<xyz>" meaning "<xyz>" when it means "math" elsewhere?

...

And no, I don't think anyone will need the Chinese translation of <math> on the English Wikipedia, and for that precise reason see no reason the English wikipedia should be parsing for it.

It doesn't make a performance difference, if that's what you're worrying about.

...

...

We could replace the "other-language" words with the "this-language" words upon save. I.e. if someone wrote <math> on the Chinese Wikipedia, it would automatically be changed into "<" + some Chinese characters + ">" before storing it in the DB.

Sorry, I'm not with you on this one - are you suggesting that the Chinese parser be specifically compiled to only cope with the Chinese magic words? If so, perhaps you misunderstood my problem with having all variants coded in: its not that they will all work everywhere, but that adding new ones would (if I understand the whole yacc concept correctly) require recompiling a new parser, rather than just tweaking the appropriate language or settings file.

To the first part: No, the idea was to have one parser that recognises everything, but to expose to the Chinese users only the Chinese variant even if someone typed in the Swahili one. (Haha.) But of course that doesn't solve your problem. I mentioned above that recompiling the parser is by no means difficult, and that the current system also requires editing source code, but if you still think that it's a problem, then we're stuck because I don't know what else we can do.

...

So maybe you're right, and the only workable solution is to have all variants hard-coded in the parser. I guess this is where we come to regret adopting an "extension" syntax that matches/conflicts with the syntax used by "allowed bits of HTML".

True. If we had something like [!math x^2 + y^2 = z^2 !], then we could say "everything in [! ... !] is an extension". Would make life much easier.

...

If we want to keep true customisation of magic words (where "editing the source" != "customisation") the best idea I've come up with is:

hard-code all allowed HTML into the parser. This means maximum

efficiency for those bits, and the ability to handle relationships between them, etc. 2) treat everything else matching "<"+some_letters+">" as an "extension" and spew out its contents as one element of the parse tree. If the receiving PHP script then says "there's no such extension", it escapes the "<" and ">", and passes the contents back to be parsed normally.

Unfortunately, this opens a whole other can of worms. What if there is no end tag? What if there is other mark-up partly inside and partly outside the "extension" block?

He said, "What the ''<swearword>'' is going on?"

I think, considering all of these problems we have discussed, it makes a real lot of sense to formulate a "rule" that the design of the parser should fulfill: The parser must know in advance how to parse everything. The resulting parse tree must not depend on anything other than the input wiki text.

Greetings, Timwi

Rowan Collins

24 Sep 24 Sep

1:36 a.m.

On Thu, 23 Sep 2004 20:09:21 +0100, Timwi timwi@gmx.net wrote:

...

...
Still, doesn't this mean the parser needs to recognise "#REDIRECT <linkpattern>" as a special token? And doesn't that, in turn, present a problem if we want to retain MagicWord i18n?

Not really. We can still recognise redirects with a regexp (or anything else in PHP) before passing the page to the parser.

But why make that a special case? Why say "before using the nice eficient real parser, use a not-a-parser to check for the #REDIRECT directive, and have it do some voodoo" Far better to just have the parser recognise "#REDIRECT" (and any variants anyone wants) and output a parse tree with a special redirect node.

...

First of all, even in the current system there is no way for server admins to customise the magic words without modifying actual source code.

Well, technically, no, but Language*.php and LocalSettings.php are more like configuration files that happen to be executable for convenience. Editing the declaration of $wgMagicWordsEn in Language.php is no more difficult or involved than, say, editing a .ini file.

...

Secondly, you're making it sound like recompiling the parser was some sort of monumental task.

Actually, I have to admit I had no idea how difficult it would be, but I assumed it would mean having at least a compiler, if not a compiler-compiler and a whole load of other tools. Editing PHP doesn't need that kind of thing, and the way its designed now, you needn't notice your editing code.

...

Here's an idea. One could provide a .c or .h file where #define statements are used to define the magic words, and then make sure that if you modify it, you only need to recompile the binary (i.e. invoke gcc) but you don't need flex, bison, or swig. But even if you were to require flex, bison and swig, even then the recompilation can be automated by a simple script.

If it were possible to only require a c compiler, it would certainly be a favour to other admins running MediaWiki. It's going to be annoying enough for some of them to have to deal with a binary part as well as PHP.

...

...
So maybe you're right, and the only workable solution is to have all variants hard-coded in the parser. I guess this is where we come to regret adopting an "extension" syntax that matches/conflicts with the syntax used by "allowed bits of HTML".

True. If we had something like [!math x^2 + y^2 = z^2 !], then we could say "everything in [! ... !] is an extension". Would make life much easier.

It's oh so tempting to say "let's change it" but a) I'd be mobbed by everyone who voted for the current syntax (which includes myself) and b) we'd have to go through changing exisitng uses of <math>, or make it a special case, or something.

...

I think, considering all of these problems we have discussed, it makes a real lot of sense to formulate a "rule" that the design of the parser should fulfill: The parser must know in advance how to parse everything. The resulting parse tree must not depend on anything other than the input wiki text.

Yep, I think you're probably right on that one. And as you say, the more things that can e done inside the parser, the better, since outside means PHP, and is likely to be less efficient.

-- Rowan Collins BSc [IMSoP]

Timwi

29 Sep 29 Sep

10:36 p.m.

Sorry for the late reply.

Rowan Collins wrote:

...

On Thu, 23 Sep 2004 20:09:21 +0100, Timwi timwi@gmx.net wrote:

...
Not really. We can still recognise redirects with a regexp (or anything else in PHP) before passing the page to the parser.

But why make that a special case? Why say "before using the nice eficient real parser, use a not-a-parser to check for the #REDIRECT directive, and have it do some voodoo" Far better to just have the parser recognise "#REDIRECT" (and any variants anyone wants) and output a parse tree with a special redirect node.

Why is that "better"? I prefer my suggestion because: * it might be more efficient because it means that we don't have to invoke the external parser just to find out whether what the user just submitted is a redirect or not * it means the parser needn't be programmed to recognise redirects (makes the code simpler) * it means we can assume that parse trees will be articles. Otherwise all output code would have to consider this special case. How should a class that is supposed to output LaTeX code react when you give it a redirect?

...

...
First of all, even in the current system there is no way for server admins to customise the magic words without modifying actual source code.

Well, technically, no, but Language*.php and LocalSettings.php are more like configuration files that happen to be executable for convenience. Editing the declaration of $wgMagicWordsEn in Language.php is no more difficult or involved than, say, editing a .ini file.

True. As I said, we can make it work quite the same way using #defines, except of course that the module would need to be recompiled.

...

Actually, I have to admit I had no idea how difficult it would be, but I assumed it would mean having at least a compiler, if not a compiler-compiler and a whole load of other tools. Editing PHP doesn't need that kind of thing, and the way its designed now, you needn't notice your editing code.

That is also true. But I really don't see why it's so hard to have a compiler?

...

If it were possible to only require a c compiler, it would certainly be a favour to other admins running MediaWiki. It's going to be annoying enough for some of them to have to deal with a binary part as well as PHP.

As I mentioned before, it is *not* necessary for anyone to "deal with" anything. People can continue to use the old not-a-parser if they want!

...

...
I think, considering all of these problems we have discussed, it makes a real lot of sense to formulate a "rule" that the design of the parser should fulfill: The parser must know in advance how to parse everything. The resulting parse tree must not depend on anything other than the input wiki text.

Yep, I think you're probably right on that one. And as you say, the more things that can e done inside the parser, the better, since outside means PHP, and is likely to be less efficient.

I had an alternative idea. Currently I'm passing the wiki text as a string parameter to the function that does the actual parsing. I could have it accept a second parameter, a language code, which would influence the magic words.

Just an idea.

Timwi

Rowan Collins

30 Sep 30 Sep

12:21 a.m.

On Wed, 29 Sep 2004 22:36:36 +0200, Timwi timwi@gmx.net wrote:

...

Sorry for the late reply.

No probs: it's about 3 months quicker than my replies to messages from friends seem to end up being... :-/

...

...
...
Not really. We can still recognise redirects with a regexp (or anything else in PHP) before passing the page to the parser.

But why make that a special case? Why say "before using the nice eficient real parser, use a not-a-parser to check for the #REDIRECT directive, and have it do some voodoo" Far better to just have the parser recognise "#REDIRECT" (and any variants anyone wants) and output a parse tree with a special redirect node.

Why is that "better"? I prefer my suggestion because:

Well, why is the new parser in general better? Because it separates out parsing into a proper parser rather than mingling at as special-case checks everywhere, and thus makes the software easier to maintain. This is parsing too, so lets put it in the parser, not in special-case checks everywhere.

...

it might be more efficient because it means that we don't have to invoke the external parser just to find out whether what the user just submitted is a redirect or not

We wouldn't invoke the parser *just* to do that. We'll still need a parse-on-save for things like updating link tables; if it comes back saying "this is a redirect", we use that information to set the is_redirect flag.

...

it means the parser needn't be programmed to recognise redirects (makes the code simpler)

OTOH, the code *outside* the parser will have to, and more: if you want to avoid putting redirects through the parser at all, that code has to do the following: 1] spot that the page is a redirect 2] determine what it is a redirect to 3] find all the article's entries in the links tables (links, brokenlinks, categorylinks) and delete all those except the target of the redirect. 1 has to be done one way or another anyway; 2 and 3 would be natural side-effects of parsing *any* page before save (unless I've completely misunderstood the current code structure, or you have some magic way of avoiding this), so why duplicate them in special case code just for redirects?

...

it means we can assume that parse trees will be articles. Otherwise all output code would have to consider this special case. How should a class that is supposed to output LaTeX code react when you give it a redirect?

No, all output code will need a way of rendering a <redirect> element - that's no more a special case than any other element. If you had an output system that was guaranteed *never* to deal with an "&redirect=no" request, you could simply leave this output undefined; otherwise, the parser and output will have to do *something* with the result, and I've always hated the 'misinterpret it as a numbered list' approach.

I imagine an outputter to something static like LaTeX would want to either: * always follow redirects to their destination (so it won't see any actual redirect pages anyway; except for double-redirects, but they're broken anyway); note that this has nothing to do with the parser whatsoever, since it is a navigation issue, and shouldn't require access to the page's content at all (currently, the page's text *is* accessed, somewhere in Title.php I believe, but it needn't be) or: * render redirects as cross-references [e.g. "Such and such: See So and so"] (in which case having the parser output some explanation that this is a redirect would be very helpful indeed).

...

...
Actually, I have to admit I had no idea how difficult it would be, but I assumed it would mean having at least a compiler, if not a compiler-compiler and a whole load of other tools. Editing PHP doesn't need that kind of thing, and the way its designed now, you needn't notice your editing code.

That is also true. But I really don't see why it's so hard to have a compiler?

No, it's not very hard; it's just harder than not needing one. Remember that most web-hosting is accessed by FTP, not SSH; any compilation has to happen on a different system, usually a home PC; with any luck, things will end up binary compatible with the server. So: plain-text options: great; needs-a-c-compiler options: slightly awkward, but perhaps a necessary evil; needs-a-compiler-compiler options: you're no longer an administrator, you're a developer.

...

...
If it were possible to only require a c compiler, it would certainly be a favour to other admins running MediaWiki. It's going to be annoying enough for some of them to have to deal with a binary part as well as PHP.

As I mentioned before, it is *not* necessary for anyone to "deal with" anything. People can continue to use the old not-a-parser if they want!

For how long? As new features come along, what are the chances that they will be back-ported to the not-a-parser? How many versions will be released before the not-a-parser is completely incompatible with large parts of the codebase? I'm not saying this whole thing is a bad idea - I think it's extremely sensible - just that we do need to minimise the hassle for other MediaWiki admins who want to use the latest version, but don't want to play the role of developer.

BTW, did anyone ever find a compiler-compiler that could output PHP?

-- Rowan Collins BSc [IMSoP]

Timwi

1 Oct 1 Oct

1:55 p.m.

Rowan Collins wrote:

...

OTOH, the code *outside* the parser will have to, and more: if you want to avoid putting redirects through the parser at all, that code has to do the following: 1] spot that the page is a redirect 2] determine what it is a redirect to 3] find all the article's entries in the links tables (links, brokenlinks, categorylinks) and delete all those except the target of the redirect. 1 has to be done one way or another anyway; 2 and 3 would be natural side-effects of parsing *any* page before save (unless I've completely misunderstood the current code structure, or you have some magic way of avoiding this), so why duplicate them in special case code just for redirects?

This was very enlightening and convincing. You're right. The parser should parse redirects.

...

...
As I mentioned before, it is *not* necessary for anyone to "deal with" anything. People can continue to use the old not-a-parser if they want!

For how long? As new features come along, what are the chances that they will be back-ported to the not-a-parser?

You're talking as if you're regarding myself as some sort of authoritative figure who determines the future direction of development. Open source software doesn't work like that, and especially not MediaWiki.

If people need certain features or certain functionality, then they need to make sure that they are programmed; if nobody volunteers to do it, they need to do it themselves. I can't, won't, and should not be required to, take any responsibility for the effects of my parser on other system administrators. If people need new parser features backported, then they need to do it themselves; whether or not this is difficult should not and cannot be a criterion for the direction of the development of my parser.

In the hypothetical scenario you are describing, i.e. the old not-a-parser becoming completely obsolete and outdated, administrators will either have to use the new parser or enhance the old parser for themselves. This is their problem and their decision, not mine.

...

BTW, did anyone ever find a compiler-compiler that could output PHP?

Myself, I haven't even looked. However, I had already retracted my original hypothesis that porting a C-based yacc/bison file to PHP would be easy. I have done quite some C-specific programming here (see parsetree.c), which can certainly be translated to PHP, but not trivially.

Greetings, Timwi

Delirium

20 Sep 20 Sep

9:41 p.m.

Timwi wrote:

...

The "Media", "Image" and other namespaces are handled in post-processing. The parser sees them as text within the link target.

From the point of view of semantics and abstract syntax, it would make some sense for these to be represented in the AST. If it turns out to be difficult to do in lex/yacc that's fine, but perhaps a first-pass postprocessor could stick them into the tree? It'd be particularly useful when retargeting the parser to non-HTML outputs. Depending on implementation, it might even be as efficient or more efficient, since it'd avoid an "if this is really an image, render it this way" check on every link node in the AST.

-Mark

Timwi

23 Sep 23 Sep

9:16 p.m.

Delirium wrote:

...

...

The "Media", "Image" and other namespaces are handled in post-processing. The parser sees them as text within the link target.

perhaps a first-pass postprocessor could stick them into the tree?

This is what I meant by post-processing. The post-processing step is an operation on the parse tree. Its output is the minorly transformed parse tree. After post-processing, the finished parse tree is handed to the output generator ("compiler", if you will).

...

it'd avoid an "if this is really an image, render it this way" check on every link node in the AST.

This sort of check is not only negligibly unnoticeably fast, but it's also unavoidable: Whether the parser does it during parsing, or the post-processor during post-processing, or the output generator during output generation, makes no difference.

Timwi

Nikola Smolenski

20 Sep 20 Sep

11:06 a.m.

On Saturday 18 September 2004 11:19, Timwi wrote:

...

HTML tags (currently everything between <something> and </something> is treated as an 'extension'; I need to limit that to <nowiki>, <pre>, <math>, <hiero>, <music> and <chem> for now)

Will this list be extensible (without recompiling the parser)?

Timwi

23 Sep 23 Sep

9:20 p.m.

Nikola Smolenski wrote:

...

On Saturday 18 September 2004 11:19, Timwi wrote:

...

HTML tags (currently everything between <something> and </something> is treated as an 'extension'; I need to limit that to <nowiki>, <pre>, <math>, <hiero>, <music> and <chem> for now)

Will this list be extensible (without recompiling the parser)?

Not without recompiling the parser, of course. But as I said in an earlier mail, recompiling the parser is easy and fast.

What would the alternative be? If you don't want to have to recompile it, then the parser would have to read and interpret the contents of some file at the beginning of every single invocation. That would take away quite a lot of performance.

Timwi

7387

Age (days ago)

7400

Last active (days ago)

wikitech-l@lists.wikimedia.org

21 comments

7 participants

tags (0)

participants (7)

Delirium
Jens Frank
Nikola Smolenski
Rowan Collins
Stephan Walter
Timwi
Ævar Arnfjörð Bjarmason