[Wikipedia-l] Optimizing the Wiki parser

Uri Yanover uriyan_subscribe at yahoo.com
Fri Feb 8 15:15:50 UTC 2002


> Actually that is what I also first had in mind, but now I
> think that this would be a bad idea. It makes the code
> more complicated and therefore harder to debug and
> harder to introduce new and/or changed mark-up
> features. It is important that the code is kept so simple
> that newcomers who find a bug should in principle, if
> they know PHP and MySQL, be able to find the
> problem and send a patch.

Well, while easiness of debugging is important, it 
is still possible to write a good parser that would 
be easy to debug; keeping it simple for as long as 
possible might lead to very hasty and ugly attempts 
to optimize when it's too late.

> Moreover, I don't think that doing it with
> regular expresssions is that inefficient. The regular
> expressions, especially the Perl compatible ones, are
> very well implemented in PHP and most wiki systems
> do it that way. If you are clever with perl regular expressions
> there is still a lot of optimization that you can do. If you
> want inspiration look at the implementation of PhpWiki.

Looking at PhpWiki was exactly what I did. And I
know that PHP regexps are good, but they still take 
time to be evaluated. As of now (v. 1.51), the code 
uses regexps (and str_replaces) for the following:

1. Image linking
2. External links
3. Wiki variable substitution
4. Wiki-style text formatting
5. Removal of "forbidden tags"
6. Auto-number headings
7. ISBNs

In addition to that, the internal links check explodes the 
array which is expensive both in terms of passing again 
over the text and the added use of memory.

Therefore, the code discussed here passes over the text 
at least 8 times, often doing the checking against 
complicated regular expressions. Be the PHP regexp 
performance as good as it can be, converting from one 
complicated markup (Wiki) to another (HTML) is 
simply not the task they were intended for.

> My advice would be: first try it in PHP with the PCRE's.
> If that doesn't work, write a real one-pass parser in PHP. If that
> also doesn't work, then you can start thinking about C and, for example,
> yacc/bison. As for Wiki syntax changes, IMO we should first get the
> current syntax efficiently and error-free working, and only then start
> thinking about changes and extensions.

For reasons stated above, I don't think PCRE's will do
much better, all regexps will have the same problems.
Having thought a bit about the subject, a one-pass PHP
parser is certainly the easiest solution to write and
maintain; however I'd feel much better if I knew for sure
that PHP does not create a penalty for fetching 
characters out of a string one at a time.

Sincerely yours,
                    Uri Yanover




More information about the Wikipedia-l mailing list