Less than two hours ago, I found the page http://www.wikipedia.com/wiki/Sweden blank again. So I clicked Edit, then Save, to restore its full contents. This has happened before. See the page's History.
Is this "the caching bug"? How does it work? Wouldn't the obsolete cached copy me destroyed the first time I did this? Why does the blank page come back?
Is caching really necessary for performance now? When caching was activated, I thought several other functions were disabled at the same time. Do we know the impact on response times of not using caching?
On dim, 2002-05-12 at 17:38, Lars Aronsson wrote:
Less than two hours ago, I found the page http://www.wikipedia.com/wiki/Sweden blank again. So I clicked Edit, then Save, to restore its full contents. This has happened before. See the page's History.
Is this "the caching bug"?
Not as far as I know...
How does it work? Wouldn't the obsolete cached copy me destroyed the first time I did this?
Sure should have been. Here's how the caching works:
* When a page is loaded, the contents of the cur_cache field are checked. If it is not empty (!= ""), the cached HTML is dropped into the output and various parsing/rendering steps are bypassed. Otherwise, the page is rendered anew, shown, and stored in cur_cache for the next time the page is shown.
* When a page is saved, the cur_cache field is cleared (set to ""), so that the next time it's loaded (which should be immediately!) it will be re-rendered.
* When a new page is created, the "unlinked" table is checked to see if any other pages were linking to the previously-nonexistant page. If so, those pages have their cur_cache cleared (set to ""), so that when next viewed, they can show the link as being to an existing page. (This won't happen currently, as the "unlinked" table is broken.)
"The caching bug" is due to the fact that, if caching is disabled, the cur_cache field isn't touched at all, not even to clear caches when old pages are saved: the software has no way of knowing the cache is old (because there's no cache timestamp), so the old version is displayed as if it were the current version of the page. Basically, it assumes that caching will either be always off, or start off and turn on forever. Turning it on, then turning it off for a while, then back on without manually clearing the cache leaves bad cached pages.
This can easily be solved in the future by performing the cache-clears whether caching is enabled or not. (I assume this was not done in the first place so that the code would still work until the cur_cache field was added to the database!)
The solution for our current problem is simply to clear all the cache fields: UPDATE cur SET cur_cache="",cur_timestamp=cur_timestamp
which can be done by someone with is_developer access (that would be nobody, currently!) or by someone with direct access to the db server (messieurs Wales and Richey). That will get rid of all the old cached pages, and the _correct_ caches can start to fill up again. (Except that unlinked links may stick around for newly created pages due to the separate "unlinked" table problem.)
Why does the blank page come back?
I have no idea! I haven't actually seen this problem yet, I've only seen a couple notes that said roughly "page was blank, so I saved it again, fine now."
Does the page _stay_ blank after reloading; ie does it consistently come up blank until you resave it? Or does it come up blank once, but is okay if you reload? Is it completely blank, or does it contain spaces or something? Are there any error messages? Can you see anything suspicious in the HTML source of the page?
If anyone with is_sysop status notices a page doing this, try doing a direct SQL query before doing anything to the article:
SELECT * FROM cur WHERE cur_title="Title_of_the_page"
What's in the cur_cache field? Does anything else look out of place? (Hmm, looks like special_asksql.php needs to slip in a couple of <nowiki></nowiki> tags.)
Is caching really necessary for performance now?
I wouldn't doubt that it's _helping_.
When caching was activated, I thought several other functions were disabled at the same time. Do we know the impact on response times of not using caching?
Alas, yes and no in that order. Since the main limiting factor is still the multiple/complex/slow database queries, not CPU time in PHP, caching may be a smaller help than anything else we're doing; but I'd be quite surprised if it's not helping at all. And what's the sense in throwing away a performance gain if we've got one?
(Note that caching does cut down a bit on database queries, since links don't need to be checked for existence.)
-- brion vibber (brion @ pobox.com)
Brion L. VIBBER wrote:
Does the page _stay_ blank after reloading; ie does it consistently come up blank until you resave it?
Yes it stays blank. I can tell you how I discovered this. I suspected that my response time monitoring script was seeing timeouts instead of real pages when response times where in the 500+ seconds range. So I added HTTP status code and document length to my log file. I can tell you the Main Page is 18K in normal cases and 4K when only the page header and footer are returned.
With times in GMT, the normal sequence is like this:
date time resp status url bytes 20020510 Fri 11:40:00 9.02 200 http://www.wikipedia.com/ 18459 20020510 Fri 11:50:00 1.13 200 http://www.wikipedia.com/ 18459
As you can see, the status code is 200 and response time very reasonable (1.13 - 9.02 seconds). But then something happens at 12:00 GMT (5:20 am PDT) Friday:
20020510 Fri 12:00:00 0.90 200 http://www.wikipedia.com/ 3999 20020510 Fri 12:10:00 1.09 200 http://www.wikipedia.com/ 3999
It stays this way until restored at 14:10 GMT (7:10 am PDT) Friday:
20020510 Fri 13:50:00 7.80 200 http://www.wikipedia.com/ 3999 20020510 Fri 14:00:00 8.71 200 http://www.wikipedia.com/ 3999 20020510 Fri 14:10:00 1.27 200 http://www.wikipedia.com/ 18444 20020510 Fri 14:20:00 1.18 200 http://www.wikipedia.com/ 18444
The same thing happens at 13:20 GMT (6:20 am PDT) Saturday:
20020511 Sat 13:00:00 1.65 200 http://www.wikipedia.com/ 18443 20020511 Sat 13:10:00 1.12 200 http://www.wikipedia.com/ 18443 20020511 Sat 13:20:00 174.32 200 http://www.wikipedia.com/ 3998 20020511 Sat 13:30:00 3.31 200 http://www.wikipedia.com/ 3998
And is restored at 15:40 GMT (8:40 am PDT) Saturday:
20020511 Sat 15:20:00 0.59 200 http://www.wikipedia.com/ 3998 20020511 Sat 15:30:00 0.59 200 http://www.wikipedia.com/ 3998 20020511 Sat 15:40:00 7.68 200 http://www.wikipedia.com/ 18445 20020511 Sat 15:50:00 1.18 200 http://www.wikipedia.com/ 18445
And 15:30-16:30 GMT (8:30 am - 9:30 am PDT) Sunday:
20020512 Sun 15:00:00 13.99 200 http://www.wikipedia.com/ 18443 20020512 Sun 15:20:00 1.88 200 http://www.wikipedia.com/ 18443 20020512 Sun 15:30:00 0.59 200 http://www.wikipedia.com/ 3998 20020512 Sun 15:40:00 0.68 200 http://www.wikipedia.com/ 3998 [...] 20020512 Sun 16:10:00 0.61 200 http://www.wikipedia.com/ 3998 20020512 Sun 16:20:00 0.62 200 http://www.wikipedia.com/ 3998 20020512 Sun 16:30:00 4.79 200 http://www.wikipedia.com/ 18397 20020512 Sun 16:40:00 1.22 200 http://www.wikipedia.com/ 18289
Obviously, I didn't restore the Main Page since I don't have the privileges to do this. And I have no idea who or what did this.
For the [[Sweden]] page some response time observations are so long or timed out (zero length, no status code) that log entries appear in the wrong order. The two restore times (15:50 GMT = 8:50 am PDT Saturday, and 22:50 GMT = 3:50 pm PDT Sunday) correspond to the revision History page (8:48 and 15:48), when I restored the page:
20020511 Sat 13:10:00 45.22 200 http://www.wikipedia.com/wiki/Sweden 13360 20020511 Sat 13:20:00 769.98 200 http://www.wikipedia.com/wiki/Sweden 13360 20020511 Sat 13:30:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0 20020511 Sat 13:40:00 2288.51 200 http://www.wikipedia.com/wiki/Sweden 4577 20020511 Sat 13:50:00 940.10 200 http://www.wikipedia.com/wiki/Sweden 4577 [...] 20020511 Sat 15:30:00 0.80 200 http://www.wikipedia.com/wiki/Sweden 4577 20020511 Sat 15:40:00 0.81 200 http://www.wikipedia.com/wiki/Sweden 4577 20020511 Sat 15:50:00 1.22 200 http://www.wikipedia.com/wiki/Sweden 12807 20020511 Sat 16:00:00 1.09 200 http://www.wikipedia.com/wiki/Sweden 12807 [...] 20020512 Sun 10:40:00 384.72 200 http://www.wikipedia.com/wiki/Sweden 12807 20020512 Sun 11:00:00 327.15 200 http://www.wikipedia.com/wiki/Sweden 12807 20020512 Sun 11:10:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 11:20:00 2549.71 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 11:30:00 2549.82 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 12:10:00 481.58 200 http://www.wikipedia.com/wiki/Sweden 4576 20020512 Sun 11:40:00 2286.75 200 http://www.wikipedia.com/wiki/Sweden 4576 [...] 20020512 Sun 22:30:00 1.90 200 http://www.wikipedia.com/wiki/Sweden 4576 20020512 Sun 22:40:00 0.80 200 http://www.wikipedia.com/wiki/Sweden 4576 20020512 Sun 22:50:00 1.10 200 http://www.wikipedia.com/wiki/Sweden 12805 20020512 Sun 23:10:00 1.10 200 http://www.wikipedia.com/wiki/Sweden 12805
On lun, 2002-05-13 at 00:36, Lars Aronsson wrote:
20020511 Sat 13:20:00 769.98 200 http://www.wikipedia.com/wiki/Sweden 13360 20020511 Sat 13:30:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0 20020511 Sat 13:40:00 2288.51 200 http://www.wikipedia.com/wiki/Sweden 4577
[...]
20020512 Sun 11:00:00 327.15 200 http://www.wikipedia.com/wiki/Sweden 12807 20020512 Sun 11:10:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 11:20:00 2549.71 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 11:30:00 2549.82 http://www.wikipedia.com/wiki/Sweden 0 20020512 Sun 12:10:00 481.58 200 http://www.wikipedia.com/wiki/Sweden 4576
Woo-hoo! Thanks for the detailed stats, I think I've got it figured out now. The timeouts just before the blanking look VERY suspicious...
Here's what I think is going on:
* The main database query times out on a page load.
* wikiPage::load() doesn't check the return code from mysql_query() (oops!) and ends up with nothing for, among other things, $this->cache and $this->text.
* Since the cache is empty (not != ""), the text is re-rendered. Blank text actually ends up rendering to "<p><p>\n<p>\n". (Is that what you see if you check the HTML code of the blank pages?)
* The new rendered blankness gets saved in the cache. (For some reason, the PHP process for the page hasn't been killed for being around too long by this point.)
* Subsequent page views get shown with the pseudo-blank cache, since it's not actually a null string (hence, != "")
So, it's not THE caching bug, but it's A caching bug.
Fix: always check return codes! I've added a quick check to wikiPage::load() for the main query, so if it can't execute the query it should spit out an error message and _not_ cache the result. All the other queries really should be checked too, though.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org