(Nick Reinking nick@twoevils.org):
Couple quick questions: When Wikitext is pulled from the database, what are the newlines?
MySQL gives back whatever you give it. We generally give it Unix-style text with just \n, but a few browsers might add CRs.
Are they always \n? If so, I can clean up the parsing a bit and eke a bit more performance out (not a big deal).
It shouldn't hurt performance to just ignore and skip CRs. That can be done in the lexer. You should never encounter CR-only line ends.
Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?
Some of the foreign ones use UTF-8. The English one is ISO-8859-1.
As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).
So on a typical page of, say, 40-50 lines, that makes half a millisecond spent in parsing. If PHP were 100 times worse, it would account for 1/20th of a second per page fetch. Doesn't sound like much of a problem to me, and I doubt it's 1000 times worse.
Just curious: what does your parser do with Quotes.txt from the test suite?