Re: [Wikitech-l] 20071018 dumps have more problems. "United States" does not render.

26 Oct 2007


      On Fri, 26 Oct 2007 16:27:46 -0400, Simetrical wrote:
...
On 10/26/07, Steve Sanbeg ssanbeg@ask.com wrote:
...
On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:
...
On 10/26/07, Steve Sanbeg ssanbeg@ask.com
wrote: That depends on a number of things.  Twelve passes in C is
certainly a *lot* faster than twelve passes in PHP.  Remember that the
difference engine used to be one of the slowest components of
MediaWiki, until it was rewritten (using an identical algorithm) in
C++ -- now it's far faster than rendering the exact same page.
My own experiences with perl & C haven't shown such dramatic
differences, and that some operations scale linearly with the number of
passes. I was assuming PHP would be similar, although I haven't
benchmarked differences in language or passes for this.
It really depends on what you're doing.  If you're doing some simple regex
of input data, almost all the heavy lifting is done in C anyway.
 But the Parser is 5000 lines of PHP code, the most troublesome parts
of which are called repeatedly for complicated templates.  Computation
tends to be between ten and a hundred times faster in C than in
interpreted languages, according to various benchmarks, depending on the
exact task.  The differences in performance when using wikidiff2 versus
the built-in diff engine aren't made up.
Of course, there would be many other possible parser optimizations. If
templates inserted HTML rather than wikitext, for instance, they could be
cached separately from the including articles, so that a header or infobox
template wouldn't need to be rerendered every time there was a change to
article content.  But that would be a major change to functionality, I
suspect.
...
The number of individual characters that are significant to wiki markup
is actually fairly small.  Changing it to one pass would significantly
alter the language in a lot of cases.  But I still think if we could do
it in three or so passes it would be faster, even if we did have to deal
with dozens, or even hundreds, of individual characters.
So preg_split on every significant character, and iterate through each of
those?  Maybe.  I'm really overstepping my expertise by venturing to
comment much here.
Ideally, just skip over sequences of interesting characters, then match
markup with anchored regular expressions, which should only need a few
characters to match, then repeat.
I guess you could get the same affect by preg_splitting in two, parsing
the beginning of the wiki part, the repeating on the leftover.
In the short term, just using more complex regular expressions would just
make some passes disappear.  But that would affect some corner cases, such
as breaking things like <<noinclude>includeonly> which could break stuff
that hack around not having proper subst detection, or the exact behavior
or which = gets skipped when there's more than 6.
...
...
The side affect might be that large classes
of those spaghetti
...
...
templates become inoperable.
Which is really the idea, isn't it?  It's not what I'd call a side
effect, the point is to kill them.
The problem now is to fix the few pages that have rendering problems.  So
I think killing them on pages where they don't cause problems yet is just
a happy side affect.  But if everything that shouldn't work suddenly
doesn't work, that would certainly create some short-term problems for
wikipedia.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] 20071018 dumps have more problems. "United States" does not render.