(Nick Reinking nick@twoevils.org):
I'm actually in the middle of a C project to reduce the wikitext parser to a two-pass parser...
Just to update everybody on my progress with the C wikitext parser:
To do: * Lists of any sort
Done: * Ignores <math> * Converts < > and & inside <nowiki> * <pre> (space at beginning of line) * <hr> (---- at beginning of line) * Sections, subsections, and subsubsections (==, ===, and ==== respectively) * Emphasis, strong emphasis, and very strong emphasis ('', ''', and ''''') * {{CURRENTMONTH}}, {{CURRENTDAY}}, {{CURRENTYEAR}}, {{CURRENTTIME}} * Basic links (http://, ftp://, gopher://, news://, etc.) * Complex basic links ([http://... Blah Blah]
Possibly later: * ISBN lookups * Handle <math> conversion
Must be done by PHP: * Handle links / link lookup * Ignore links in <nowiki> * ~~~ and ~~~~ * {{NUMBEROFARTICLES}}, {{CURRENTMONTHNAME}}, {{CURRENTDAYNAME}}
Couple quick questions: When Wikitext is pulled from the database, what are the newlines? Are they always \n? If so, I can clean up the parsing a bit and eek a bit more performance out (not a big deal). Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?
As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).