- Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
That condition seems bizarre. The parser is either faster or it's slower. Whether it's faster because it's implemented in C is irrelevant: it's faster.
In any case I thought it had been decided that it had to be in PHP?
No, I think if we can get a 20:1 speedup for a C version, they'd take it. :-0
I don't doubt it in the case of most large wiki farms - but numerically most installations of MediaWiki are on small wikis, probably running on shared hosts, and in those situations using a C-based parser is either not possible, or significantly more complicated than running a PHP script. So for those installs, if the speed of a PHP parser suddenly gets much worse, then I expect those admins would complain. So whilst a faster parser is a faster parser, if it requires running code that you can't run, then it ain't going to do you much good. A custom super-fast wiki-farm parser is great, but the general-case parser should have similar performance characteristics and the same software requirements (i.e. the test is that nobody should be noticeably worse off).
- Should have a worst-case render time no worse than 2x slower on any given
input.
Any given? That's not reasonable. Perhaps "Any given existing Wikipedia page"? It would be too easy to find some construct that is rendered quickly by the existing parser but is slow with the new one, then create a page that contained 5000 examples of that construct.
Sure; pathological cases are always possible. Let's say "on any 10 randomly chosen already extant pages of wikitext."
The current parser (from my perspective) seems to cope quite well with malformed input. So all I'm saying is that if a replacement parser could behave similarly then that would be good - although I take your point that the input that is considered pathological could be different for different parsers, so let's say that the render time on randomly generated malformed input should be equivalent on average.
The English Wikipedia does an provide excellent environment to test the English language environment. It does not do the same for other languages. Remember that MediaWiki supports over 250 languages?
Indeed - it's only intended as a test for performance and most functionality. For a more complete compatibility test with a variety of languages, you'd probably need to test against all the database dumps at: http://download.wikimedia.org/
- When running parserTests should introduce a net total of no more than
(say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
I'm not familiar enough with the current set of tests to comment on that.
The core tests are in maintenance/parserTests.txt ( http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/parserTes... ) and generally follow a structure with name of the test, wiki text input, and the expected XHTML output, for example:
!! test Preformatted text !! input This is some Preformatted text With ''italic'' And '''bold''' And a [[Main Page|link]] !! result <pre>This is some Preformatted text With <i>italic</i> And <b>bold</b> And a <a href="/wiki/Main_Page" title="Main Page">link</a> </pre> !! end
It's probably a pretty good place to start with writing a parser, in terms of what the expected behaviour is. Then probably after that comes testing against user-generated input versus the current parser.
-- All the best, Nick.