Hello,
During the course of our recent discussion, Jan Hidders said:
Have you seen the parsing code? There is nothing _very light_ about it, at the moment.
As soon as I had the opportunity, I located the current Wiki parsing code
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/wikipedia/phpwiki/fpw/wikiPa ge.php?rev=1.44&content-type=text/vnd.viewcvs-markup)
It occured to me that while the current code is fine as an initial version, it should be optimized if we expect the Wikipedia traffic to grow significantly, since using PHP regular expressions and string management functions to process the markup is inefficient.
The obvious solution would be to write a "real" one-pass parser. Doing this stand-alone (or as CGI) would be quickest using C; I do not know whether an efficient solution exsits using PHP considering the way strings are implemented in it. However, executing C code from PHP is complicated in its own right (UNIX pipes being the simplest solution, in my opinion).
I could try to re-write the parser in either PHP or C, but I wanted to ask first the members of what do they think of the subject (the extent to which the code can be optimized, which language to choose, in what technique should the parser be written, which Wiki syntax changes should be made, etc.).
Sincerely yours, Uri Yanover
From: "Uri Yanover" uriyan_subscribe@yahoo.com
It occured to me that while the current code is fine as an initial version, it should be optimized if we expect the Wikipedia traffic to grow significantly, since using PHP regular expressions and string management functions to process the markup is inefficient.
The obvious solution would be to write a "real" one-pass parser.
Actually that is what I also first had in mind, but now I think that this would be a bad idea. It makes the code more complicated and therefore harder to debug and harder to introduce new and/or changed mark-up features. It is important that the code is kept so simple that newcomers who find a bug should in principle, if they know PHP and MySQL, be able to find the problem and send a patch.
Moreover, I don't think that doing it with regular expresssions is that inefficient. The regular expressions, especially the Perl compatible ones, are very well implemented in PHP and most wiki systems do it that way. If you are clever with perl regular expressions there is still a lot of optimization that you can do. If you want inspiration look at the implementation of PhpWiki.
I could try to re-write the parser in either PHP or C, but I wanted to ask first the members of what do they think of the subject (the extent to which the code can be optimized, which language to choose, in what technique should the parser be written, which Wiki syntax changes should be made, etc.).
My advice would be: first try it in PHP with the PCRE's. If that doesn't work, write a real one-pass parser in PHP. If that also doesn't work, then you can start thinking about C and, for example, yacc/bison. As for Wiki syntax changes, IMO we should first get the current syntax efficiently and error-free working, and only then start thinking about changes and extensions.
Kind regards,
-- Jan Hidders
wikipedia-l@lists.wikimedia.org