On Fri, 26 Oct 2007 16:27:46 -0400, Simetrical wrote:
On 10/26/07, Steve Sanbeg ssanbeg@ask.com wrote:
On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:
On 10/26/07, Steve Sanbeg ssanbeg@ask.com wrote: That depends on a number of things. Twelve passes in C is certainly a *lot* faster than twelve passes in PHP. Remember that the difference engine used to be one of the slowest components of MediaWiki, until it was rewritten (using an identical algorithm) in C++ -- now it's far faster than rendering the exact same page.
My own experiences with perl & C haven't shown such dramatic differences, and that some operations scale linearly with the number of passes. I was assuming PHP would be similar, although I haven't benchmarked differences in language or passes for this.
It really depends on what you're doing. If you're doing some simple regex of input data, almost all the heavy lifting is done in C anyway. But the Parser is 5000 lines of PHP code, the most troublesome parts of which are called repeatedly for complicated templates. Computation tends to be between ten and a hundred times faster in C than in interpreted languages, according to various benchmarks, depending on the exact task. The differences in performance when using wikidiff2 versus the built-in diff engine aren't made up.
Of course, there would be many other possible parser optimizations. If templates inserted HTML rather than wikitext, for instance, they could be cached separately from the including articles, so that a header or infobox template wouldn't need to be rerendered every time there was a change to article content. But that would be a major change to functionality, I suspect.
The number of individual characters that are significant to wiki markup is actually fairly small. Changing it to one pass would significantly alter the language in a lot of cases. But I still think if we could do it in three or so passes it would be faster, even if we did have to deal with dozens, or even hundreds, of individual characters.
So preg_split on every significant character, and iterate through each of those? Maybe. I'm really overstepping my expertise by venturing to comment much here.
Ideally, just skip over sequences of interesting characters, then match markup with anchored regular expressions, which should only need a few characters to match, then repeat.
I guess you could get the same affect by preg_splitting in two, parsing the beginning of the wiki part, the repeating on the leftover.
In the short term, just using more complex regular expressions would just make some passes disappear. But that would affect some corner cases, such as breaking things like <<noinclude>includeonly> which could break stuff that hack around not having proper subst detection, or the exact behavior or which = gets skipped when there's more than 6.
The side affect might be that large classes
of those spaghetti
templates become inoperable.
Which is really the idea, isn't it? It's not what I'd call a side effect, the point is to kill them.
The problem now is to fix the few pages that have rendering problems. So I think killing them on pages where they don't cause problems yet is just a happy side affect. But if everything that shouldn't work suddenly doesn't work, that would certainly create some short-term problems for wikipedia.