2010-09-26 20:57, Aryeh Gregor skrev:
On Thu, Sep 23, 2010 at 8:47 AM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
. . . You can can come up with thousands of situations like this, and without a consistent plan on how to handle them, you will need to add thousands of border cases to the code to handle them all.
I have avoided this by simply disabling all html block tokens inside wikitext list items. Of course, it may be that someone is actually relying on being able to mix in this way, but it doesn't seem likely as the result tends to be strange.
The way the parser is used in real life is that people just write random stuff until it looks right. They wind up hitting all sorts of bizarre edge cases, and these are propagated to thousands of pages by templates.
Yes, this is a problem.
A pure-PHP parser is needed for end users who can't install binaries, and any replacement parser must be compatible with it in practice, not just on the cases where the pure-PHP parser behaves sanely. In principle, we might be able to change parser behavior in lots of edge cases and let users fix the broken stuff, if the benefit is large enough. But we'd have to have a pure-PHP parser that implements the new behavior too.
Antlr is a multi language parser generator. Unfortunately PHP is not currently on the list of target languages. Porting the back end to PHP is a feasible task, however. Likewise, porting my parser implementation to PHP is feasible. Then the later question is if you want to maintain two language versions to also have the performance advantage of the C parser.
The parts you considered to be the hard parts are not that hard.
What support do you have for this claim? Parsing wikitext is difficult, because of the any-input-is-valid-wikitext philosphy. Parsing MediaWiki wikitext is very difficult, since it is not designed to be parsable.
I consider the parts I pointed out to be hard because they cannot be implemented with standard parser techniques. I've developed a state machine for enabling/disabling individual token productions depending on context; I've employed speculative execution in the lexical analysator to support context sensitive lookahead. I don't believe that you will find these techniques in any text book on compiler design. So I consider these items hard in the sense that before I started working on my implementation, it was not at all clear that I would be able to find a workable algorithm.
As of the apostrophe heuristics, as much as 30% of the cpu time seem to be spent on this, regardless of there being any apostrophes in the text. So, I consider that hard in the sense that it is a very high cost for very little functionality. It might be possible to get rid of this overhead at the cost of higher implementation complexity though.
We've had lots of parser projects, and I'm sure some have handled those.
Point me to one that has.
The hard part is coming up with a practical way to integrate the new and simplified parser into MediaWiki in such a way that it can actually be used at some point on sites like Wikipedia. Do you have any plans for how to do that?
Developing a fully featured integration would require a large amount of work. But I wouldn't call it hard. I haven't analysed all of the hooks, and it is possible that some difficulties will turn up when implementing emulation of these. But other than that, I cannot see that the integration work would consist of anything but a long list of relatively simple tasks. If a project were to be launced to perform the integration, I would feel confidident that it would reach its goal. Considering the large body of information that is encoded in MediaWiki syntax, I would guess that there is a strong interest in actually spending efforts on this.
/Andreas