Actually that is what I also first had in mind, but now I think that this would be a bad idea. It makes the code more complicated and therefore harder to debug and harder to introduce new and/or changed mark-up features. It is important that the code is kept so simple that newcomers who find a bug should in principle, if they know PHP and MySQL, be able to find the problem and send a patch.
Well, while easiness of debugging is important, it is still possible to write a good parser that would be easy to debug; keeping it simple for as long as possible might lead to very hasty and ugly attempts to optimize when it's too late.
Moreover, I don't think that doing it with regular expresssions is that inefficient. The regular expressions, especially the Perl compatible ones, are very well implemented in PHP and most wiki systems do it that way. If you are clever with perl regular expressions there is still a lot of optimization that you can do. If you want inspiration look at the implementation of PhpWiki.
Looking at PhpWiki was exactly what I did. And I know that PHP regexps are good, but they still take time to be evaluated. As of now (v. 1.51), the code uses regexps (and str_replaces) for the following:
1. Image linking 2. External links 3. Wiki variable substitution 4. Wiki-style text formatting 5. Removal of "forbidden tags" 6. Auto-number headings 7. ISBNs
In addition to that, the internal links check explodes the array which is expensive both in terms of passing again over the text and the added use of memory.
Therefore, the code discussed here passes over the text at least 8 times, often doing the checking against complicated regular expressions. Be the PHP regexp performance as good as it can be, converting from one complicated markup (Wiki) to another (HTML) is simply not the task they were intended for.
My advice would be: first try it in PHP with the PCRE's. If that doesn't work, write a real one-pass parser in PHP. If that also doesn't work, then you can start thinking about C and, for example, yacc/bison. As for Wiki syntax changes, IMO we should first get the current syntax efficiently and error-free working, and only then start thinking about changes and extensions.
For reasons stated above, I don't think PCRE's will do much better, all regexps will have the same problems. Having thought a bit about the subject, a one-pass PHP parser is certainly the easiest solution to write and maintain; however I'd feel much better if I knew for sure that PHP does not create a penalty for fetching characters out of a string one at a time.
Sincerely yours, Uri Yanover
From: "Uri Yanover" uriyan_subscribe@yahoo.com
Well, while easiness of debugging is important, it is still possible to write a good parser that would be easy to debug;
Ok. If you say so. :-)
keeping it simple for as long as possible might lead to very hasty and ugly attempts to optimize when it's too late.
You don't know how ugly those attempts will be until you try (or let somebody else try).
In addition to that, the internal links check explodes the array which is expensive both in terms of passing again over the text and the added use of memory.
A good point, and indeed something that needs to be looked at. But you can avoid this also if you use the regular expressions approach.
[...] Be the PHP regexp performance as good as it can be, converting from one complicated markup (Wiki) to another (HTML) is simply not the task they were intended for.
When the original Wiki mark-up was designed it was done so as to be effeciently implementable by Perl regular expressions.
-- Jan Hidders
From: "Uri Yanover" uriyan_subscribe@yahoo.com
[...] If you want inspiration look at the implementation of PhpWiki.
Looking at PhpWiki was exactly what I did.
Er, I may have been a bit unclear here. I meant: http://phpwiki.sourceforge.net/
-- Jan Hidders
Uri Yanover wrote:
Well, while easiness of debugging is important, it is still possible to write a good parser that would
How do you know that the current parser is bad? Do you have numbers (measurements from the live website) that indicate that the regexp parsing is the bottleneck in the performance of today's Wikipedia website? Or do you just want to write a parser? (I know writing parsers can be great fun, seriously, but I think this discussion should focus on fixing the performance problems.)
Sometimes, a simple access to http://www.wikipedia.com/wiki/Biology takes 11 seconds, sometimes it takes 2 seconds. When it takes longer, is it because too much CPU time is spent in regexp parsing? How can we know this? From profiling the running server? Or is my HTTP request put on hold for some other reason (database locks, swapping, I/O waits, network congestion, ..., or too much CPU time spent on some other task)? If regexp parsing really is the bottleneck, how much more efficient can a new parser be? Twice as fast? Is it possible to add more hardware (multiprocessor) instead?
Lars Aronsson wrote:
more efficient can a new parser be? Twice as fast? Is it possible to add more hardware (multiprocessor) instead?
I'm looking at adding hardware -- this machine is a single processor Pentium III/866, but it's a dual processor motherboard. I don't know how much this will help, though.
However, I will say this: based on my extensive experience running busy websites (Bomis), it seems very unlikely that the problem is just _hardware_. I think we have a long way to go in terms of optimizing first.
Just switching from mysql_connect to mysql_pconnect has resulted in a speedup from 1/2 page per second to over 2 pages per second. So that's 400% right there. Adding a second processor may speed things up by what, well 100% at _best_, but realistically probably 30%.
Anyhow, I agree with you 100% -- I don't think that regex parsing is the primary bottleneck (although I think we should profile to be sure!). I think it's excessive or inefficient SQL queries.
Is there an easy way in mysql to turn on some logging to see which queries are taking the longest? If we had some metric telling us what our bottleneck is, i.e. things that get called a lot AND that take a long time, then I bet we could end up very pleased with the results.
--Jimbo
Is there an easy way in mysql to turn on some logging to see which queries are taking the longest?
See http://www.mysql.com/doc/L/o/Log_Files.html and esp. http://www.mysql.com/doc/S/l/Slow_query_log.html
Let me know the results ASAP because I would really like some real data on where the real problems are.
-- Jan Hidders
wikipedia-l@lists.wikimedia.org