20071018 dumps have more problems. "United States" does not render.

List overview All Threads
Download

newer

older

Bug in URL spam filter

error while running mwdumper

jmerkey＠wolfmountaingroup.com

25 Oct 2007 25 Oct '07

12:34 p.m.

The article "United States" and "Antarctica" (and lots of others) do not render on MediaWiki 1.9.3 release with the 20071018 dumps. I also have a test setup with MediaWiki 1.11 and the performance is very poor vs. 1.9.3. The errors are mysql timeouts and no useful output from the http logs. I will attempt debug next. Any ideas on what this could be? The page is: http://www.wikigadugi.org/wiki/United_States Dumps prior to September do not exhibit this breakage. Jeff

Show replies by date

Brion Vibber

25 Oct 25 Oct

7:50 p.m.

jmerkey(a)wolfmountaingroup.com wrote:

...

The article "United States" and "Antarctica" (and lots of others) do not

Ugly slow templates. Please look at the actual content before complaining about dumps. -- brion vibber (brion @ wikimedia.org)

David A. Desrosiers

7:50 p.m.

...

I just tried on my install here, running latest trunk of phase3, and although it took about 10 seconds at the first query, it did come up. There's so much on that page that it took my browser a lot longer to render it than a fetch with wget did, but the content is there, and did display. This is enwiki fetched on September 10th. You want me to try with a newer dump?

jmerkey＠wolfmountaingroup.com

2:21 p.m.

...

The Sep dump has the same problem, just not as bad. It appears the templates are exceeding the 30 second timeout and causing the connection to abort (based on tracing through wfDebug logs). It appears the fix is to set the 30 second timeout higher. On my system, it looks to be exceeding the preconfigured timeouts. 1.11 seems to exacerbate the problem when used. Jeff

...

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thomas Dalton

8:44 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

...

It appears the fix is to set the 30 second timeout higher.

The fix is to remove/fix the templates. 30 seconds is a long time for a page to take to render...

jmerkey＠wolfmountaingroup.com

2:43 p.m.

...

It appears the fix is to set the 30 second timeout higher.

The fix is to remove/fix the templates. 30 seconds is a long time for a page to take to render...

Yep. Since Wikimedia publishes the templates and the dumps, the fixing needs to happen at the source -- the English Wikipedia. Some sort of constraint should be placed into MediaWiki to limit the call depth and complexity for some of these templates by refusing to save changes for templates which are so obviously broken. Jeff

...

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Steve Summit

9:11 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

Jeff Merkey wrote:

...

...Some sort of constraint should be placed into MediaWiki to limit the call depth and complexity for some of these templates by refusing to save changes for templates which are so obviously broken.

Yup. See this message from Simetrical, over in the "slowness tonight?" thread: * * * Date: Thu, 25 Oct 2007 11:42:23 -0400 From: Simetrical <Simetrical+wikilist(a)gmail.com> To: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org> Subject: Re: [Wikitech-l] slowness tonight? Yes, those load bog-slowly. I tested WP:RD/S. If I leave in the header, but remove the archives, it loads in 40191 ms on preview according to Tamper Data. With the archives but not the header, it's just 6766 ms. Without either, it's 5836 ms. With both, it's 49430 ms. I think the conclusion is clear: the header adds over 30 seconds to the page load time. It needs to be killed, stone-dead. This is the sort of thing that the wikitext include limit was supposed to prevent -- it seems not to be doing its job.

RLS

9:15 p.m.

jmerkey(a)wolfmountaingroup.com wrote:

...

Thing is, they're not "broken" on en.wikipedia.org. The WMF hardware is capable of rendering these pages in a reasonable amount of time, ~12s for me for either [[United States]] and [[Antarctica]], including download time for the images etc. I agree that's higher than most pages, but I wouldn't call it "broken." Broken is relative, and I don't see why the English Wikipedia needs to be crippled because your hardware, as a downstream user, isn't capable of matching the performance of the Foundation's hardware. --en.wp Darkwind

GerardM

9:28 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

Hoi, There is another reason why such templates might be reconsidered. The complexity has become such that it become increasingly difficult to find people who still understand it. The technical sophistication needed has become such that it is beyond many advanced users of MediaWiki. This in itself is a problem because it also raises the threshold for starting Wikimedians. Thanks, GerardM On 10/25/07, RLS <evendell(a)gmail.com> wrote:

...

jmerkey(a)wolfmountaingroup.com wrote:

RLS

11:41 p.m.

GerardM wrote:

...

Well, I certainly agree that template programming is entirely beyond the pale, but that's an entirely separate issue that actually has several possible solutions (an extremely simplified programming language that could be enabled for only certain namespaces, artificially limiting template recursion to improve simplicity, keeping the current style of functions but developing an easier-to-read syntax, etc.) --en.wp Darkwind

Nick Jenkins

26 Oct 26 Oct

7:34 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

...

"Not our problem" is potentially a dangerous argument. Let's take as a given that some normal non-malicious pages as currently written take 12 seconds to render on WMF hardware. Suppose that an actively malicious user then systematically identifies and repeatedly calls the slowest operations contributing to that render time, and eliminates all the fast operations, thus allowing them to increase the "efficacy"/slowness of the wikitext rendering, such that a page that's only 5% of the size takes 4 times longer to render (so we're up to around 50 seconds to render 8 KB of wikitext). We then take the number of MediaWiki Apache servers (let's assume 170 for the sake of argument). So for a DoS we need to request from each server (say) two preview renderings of each attack page per 50 seconds, and assuming 170 servers, that's 170/(50/2) seconds * 8 KB = 54 KB per second upstream bandwidth required. Downstream bandwidth doesn't matter because we don't care about the response, and we won't be listening anyway. My connection now for example is 1017 kilobits per second upstream, equals 127 KB per second. So, if the above assumptions are reasonable and my maths is okay, then any single reasonably modern broadband connection is more than sufficient to make every Wikipedia unusable. ... remind me again of how this is not our problem? ;-) It might be better to think of Jeff's servers as the gasping canary, WMF servers as the miner, and slow render time as the toxic gas, the Internet as the mine, the people who will make useful contributions as the gold, the trolls as trolls, and ... actually I think I'm overextending the metaphor, so I'll stop there! -- All the best, Nick.

RLS

7:57 a.m.

Nick Jenkins wrote:

...

It might be better to think of Jeff's servers as the gasping canary, WMF servers as the miner, and slow render time as the toxic gas, the Internet as the mine, the people who will make useful contributions as the gold, the trolls as trolls, and ... actually I think I'm overextending the metaphor, so I'll stop there!

I agree there are limits to even what the WMF hardware can do; I just don't necessarily see that we've reached them, when we don't know anything about his servers or configuration and the issue is not causing problems on en.wp, which is likely to hit such problems first as the largest WMF wiki. There *are* template-related problems on en.wp at the moment, discussed in the thread from http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034323.html, but I'm still not sure that's an indication that additional limits are needed - but it might be. --Darkwind

Steve Summit

5:37 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

Darkwind wrote:

...

There *are* template-related problems on en.wp at the moment, discussed in the thread from http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034323.html, but I'm still not sure that's an indication that additional limits are needed - but it might be.

Also relevant are the threads at [[Wikipedia:Village pump (technical)#any work on the incredible sluggishness of deeply edited articles?]] and [[Wikipedia talk:Reference desk#Houston, we have a problem.]], and Simetrical's statement at <http://lists.wikimedia.org/pipermail/wikitech-l/2007-October/034338.html> in the aforementioned thread that "This is the sort of thing that the wikitext include limit was supposed to prevent". The question is, how true is it that "almost every very-high-traffic page on Wikipedia is having extreme problems right now". I suspect not, but if so, is it because there are more pages with say, heavy use of the {cite} template, or because templates like {cite} have gotten more complicated, or because template interpolation has somehow gotten slower, or simply because there are more hits and edits being processed every day, such that our headroom is going down?

Simetrical

8:09 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/26/07, Steve Summit <scs(a)eskimo.com> wrote:

...

The question is, how true is it that "almost every very-high-traffic page on Wikipedia is having extreme problems right now". I suspect not, but if so, is it because there are more pages with say, heavy use of the {cite} template, or because templates like {cite} have gotten more complicated, or because template interpolation has somehow gotten slower, or simply because there are more hits and edits being processed every day, such that our headroom is going down?

Well, whatever the problem is, I suspect I know one way that would fix it: rewriting the parser in C(++). Unfortunately, that's a whole lot easier said than done. Rewriting even part of it, though, say replaceVariables, might be a big benefit. For now it might be best to refine our heuristics of what's slow to render. Currently we use a simple text-length heuristic, but perhaps it would make more sense to incorporate additional criteria. Maximum number of template inclusions? Maximum template depth? It would require testing to see what would be effective.

Thomas Dalton

8:22 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

...

Working out what the parser is actually meant to do would be required first, though. At the moment it does what it does and that the best anyone can say. Trying to translate that idiosyncratic behaviour into a new language would be a nightmare.

...

For now it might be best to refine our heuristics of what's slow to render. Currently we use a simple text-length heuristic, but perhaps it would make more sense to incorporate additional criteria. Maximum number of template inclusions? Maximum template depth? It would require testing to see what would be effective.

I suspect depth would be the best one to try. People can tell by looking at an article's source how many templates there are, and can keep that under control. Telling how deep templates go is often impossible for anyone that isn't an expert on MediaWiki template syntax, so they could easily end up with 100s of templates being processed without noticing.

Steve Sanbeg

8:48 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On Fri, 26 Oct 2007 14:09:38 -0400, Simetrical wrote:

...

On 10/26/07, Steve Summit <scs(a)eskimo.com> wrote:

I'm not sure simply porting to a different language would have such a huge affect, and certainly isn't easy with a grammar that's not well defined. Currently, even if you were to render a large plain-text page with no markup, MW would still have to make about dozen passes over the text to determine that there's really nothing to do; that's going to be slow, no matter what language it's done in. I think a much simpler interpreted parser would beat a complex compiled one, unless you're dealing with small pages where initial overhead is significant.

...

I don't think the text length is very accurate; we definitely need something better. Also, I think a big part of the problem is with the parser functions; they tend to first expand every template passed into them, then decide which one to keep. Deferring that expansion, which could be done by adding a keyword to each nested template call, should help there, although there may be a better way.

Simetrical

9:05 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/26/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

...

That depends on a number of things. Twelve passes in C is certainly a *lot* faster than twelve passes in PHP. Remember that the difference engine used to be one of the slowest components of MediaWiki, until it was rewritten (using an identical algorithm) in C++ -- now it's far faster than rendering the exact same page.

...

I think a much simpler interpreted parser would beat a complex compiled one, unless you're dealing with small pages where initial overhead is significant.

Tim once remarked to me on IRC that he suspected a one-pass PHP parser would be slower than our current one, simply because the current one avoids going through each character in PHP. Something like preg_split is fast precisely because it's executed in C: then PHP only has to deal with ten or twenty or two hundred chunks of text, rather than a hundred thousand individual characters.

...

Well, if the expansion is deferred that should be decided by the individual parser function, not by the call syntax for the template. Either way, I think some more careful benchmarking is needed here before anyone can say what limits are best to add. One thing that's for sure is that it's the templates/conditionals specifically that are the problem, not refs or links or whatever: replaceVariables takes up something like 50% of CPU time now, or what? There are charts around somewhere.

Steve Sanbeg

10:07 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:

...

On 10/26/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

My own experiences with perl & C haven't shown such dramatic differences, and that some operations scale linearly with the number of passes. I was assuming PHP would be similar, although I haven't benchmarked differences in language or passes for this.

...

I think a much simpler interpreted parser would beat a complex compiled one, unless you're dealing with small pages where initial overhead is significant.

The number of individual characters that are significant to wiki markup is actually fairly small. Changing it to one pass would significantly alter the language in a lot of cases. But I still think if we could do it in three or so passes it would be faster, even if we did have to deal with dozens, or even hundreds, of individual characters.

...

Yes, certainly variable replacement. I think it's clear that something like {{#if{{a}}:{{defer:b}}|{{defer:c}}}} would be more efficient than {{#if:{{a}}|{{b}}|{{c}}}}. If that behavior was implicit in #if, rather than adding a new modifier and plugging it into all the templates, so much the better. I agree that there should be benchmarking to suggest new limits. Really, we should have a cost per transclusion/function, which could vary by function, that the caller would be charged. This would much more accurately address the issue. The side affect might be that large classes of those spaghetti templates become inoperable.

Simetrical

10:27 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/26/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

...

On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:

On 10/26/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote: That depends on a number of things. Twelve passes in C is certainly a *lot* faster than twelve passes in PHP. Remember that the difference engine used to be one of the slowest components of MediaWiki, until it was rewritten (using an identical algorithm) in C++ -- now it's far faster than rendering the exact same page.

It really depends on what you're doing. If you're doing some simple regex of input data, almost all the heavy lifting is done in C anyway. But the Parser is 5000 lines of PHP code, the most troublesome parts of which are called repeatedly for complicated templates. Computation tends to be between ten and a hundred times faster in C than in interpreted languages, according to various benchmarks, depending on the exact task. The differences in performance when using wikidiff2 versus the built-in diff engine aren't made up. Of course, there would be many other possible parser optimizations. If templates inserted HTML rather than wikitext, for instance, they could be cached separately from the including articles, so that a header or infobox template wouldn't need to be rerendered every time there was a change to article content. But that would be a major change to functionality, I suspect.

...

So preg_split on every significant character, and iterate through each of those? Maybe. I'm really overstepping my expertise by venturing to comment much here.

...

The side affect might be that large classes of those spaghetti templates become inoperable.

Which is really the idea, isn't it? It's not what I'd call a side effect, the point is to kill them.

Steve Sanbeg

27 Oct 27 Oct

12:33 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On Fri, 26 Oct 2007 16:27:46 -0400, Simetrical wrote:

...

On 10/26/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

On Fri, 26 Oct 2007 15:05:44 -0400, Simetrical wrote:

So preg_split on every significant character, and iterate through each of those? Maybe. I'm really overstepping my expertise by venturing to comment much here.

Ideally, just skip over sequences of interesting characters, then match markup with anchored regular expressions, which should only need a few characters to match, then repeat. I guess you could get the same affect by preg_splitting in two, parsing the beginning of the wiki part, the repeating on the leftover. In the short term, just using more complex regular expressions would just make some passes disappear. But that would affect some corner cases, such as breaking things like <<noinclude>includeonly> which could break stuff that hack around not having proper subst detection, or the exact behavior or which = gets skipped when there's more than 6.

...

> The side affect might be that large classes

of those spaghetti

...

templates become inoperable.

Which is really the idea, isn't it? It's not what I'd call a side effect, the point is to kill them.

The problem now is to fix the few pages that have rendering problems. So I think killing them on pages where they don't cause problems yet is just a happy side affect. But if everything that shouldn't work suddenly doesn't work, that would certainly create some short-term problems for wikipedia.

Rolf Lampa

26 Oct 26 Oct

11:31 p.m.

Simetrical skrev:

...

Interestingly enough I'm coding the ReplaceVariable in Delphi Pascal right now using highly optimized code. Although implementing it "my way" I'll soon be able to produce "any metrics" about exactly what's time consuming in which template, on which page, in the entire enWP. I can already say (after having profiled it "hundreds of times") that you are perfectly right in that the "links or whatever" is NOT taking much CPU at all, especially not in comparison with ReplaceVariables. My dump-processor expands templates (most of it), but not parsing HTML, at least not yet. Regards, // Rolf Lampa

Simetrical

28 Oct 28 Oct

2:35 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/26/07, Rolf Lampa <rolf.lampa(a)rilnet.com> wrote:

...

Neat. I might not have chosen Delphi Pascal, for maintainability reasons, but we already have stuff written in OCaml, so . . . Is your intent to get this into the software proper (as an option of course), like wikidiff2? On 10/27/07, Lars Aronsson <lars(a)aronsson.se> wrote:

...

Can you please stop arguing about the speed of hypothetical programs? Go write the C parser or shut up.

Was that a productive remark?

Jay R. Ashworth

4:47 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On Sat, Oct 27, 2007 at 08:35:25PM -0400, Simetrical wrote:

...

On 10/27/07, Lars Aronsson <lars(a)aronsson.se> wrote:

Can you please stop arguing about the speed of hypothetical programs? Go write the C parser or shut up.

Was that a productive remark?

It might be. If the productive time (not to mention morale) of people who can actually do the work is taken up continuing to have to tell optimistic by poorly informed people just how difficult that work is, then scaring such people off to do their homework on that point, preemptively, can in fact be useful. :-) Cheers -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Lars Aronsson

1:15 p.m.

Simetrical wrote:

...

On 10/27/07, Lars Aronsson <lars(a)aronsson.se> wrote:

Can you please stop arguing about the speed of hypothetical programs? Go write the C parser or shut up.

Was that a productive remark?

That was my intention. The current discussion on this list is not productive. And that is why I sent that comment to you off-list. If you really think that your dad's C parser is stronger than what we currently have, you should bring your dad here. Where is he? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Simetrical

8:13 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/28/07, Lars Aronsson <lars(a)aronsson.se> wrote:

...

Simetrical wrote:

On 10/27/07, Lars Aronsson <lars(a)aronsson.se> wrote:

Can you please stop arguing about the speed of hypothetical programs? Go write the C parser or shut up.

Was that a productive remark?

That was my intention. The current discussion on this list is not productive. And that is why I sent that comment to you off-list.

Oops, I didn't notice that part. Gmail doesn't make the difference between on-list and off-list very clear. Sorry. I'll reply off-list, since this is *really* not on-topic.

Thomas Dalton

29 Oct 29 Oct

2:44 p.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

...

Oops, I didn't notice that part. Gmail doesn't make the difference between on-list and off-list very clear. Sorry. I'll reply off-list, since this is *really* not on-topic.

There is an option in gmail's preferences which puts symbols next to threads if they contain emails sent explicitly to you (one symbol if you're the only recipient, and one if there are multiple). Very useful thing to have turned on!

Jay R. Ashworth

28 Oct 28 Oct

4:42 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On Thu, Oct 25, 2007 at 02:15:04PM -0500, RLS wrote:

...

jmerkey(a)wolfmountaingroup.com wrote:

Datapoint: FF2.0, HP nc6000 on wall power, [[US]] renders in 10.244, and two reloads in 7.3ish; optimum online (probably DSL), New Brunswick NJ. Timing according to LoRI Yup; serviceable. Dell's selling SC1430 dual quad-core Xeons for $779 for the rest of the month, Jeff: go buy one. http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=04&kc=6W… Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Simetrical

5:40 a.m.

New subject: 20071018 dumps have more problems. "United States" does not render.

On 10/27/07, Jay R. Ashworth <jra(a)baylink.com> wrote:

...

Datapoint: FF2.0, HP nc6000 on wall power, [[US]] renders in 10.244, and two reloads in 7.3ish; optimum online (probably DSL), New Brunswick NJ. Timing according to LoRI Yup; serviceable.

You do realize that the question is server render times, not client render times?

6023

days inactive

6027

days old

wikitech-l@lists.wikimedia.org

Manage subscription

27 comments

13 participants

tags (0)

participants (13)

Brion Vibber
David A. Desrosiers
GerardM
Jay R. Ashworth
jmerkey＠wolfmountaingroup.com
Lars Aronsson
Nick Jenkins
RLS
Rolf Lampa
Simetrical
Steve Sanbeg
Steve Summit
Thomas Dalton