It's continuing to get these errors after a reboot.
Seems to be the primary controller; I'm not a SCSI guru, I don't know
what's what.
I'll get all the data backed up somewhere if I can...
sigh
-- brion vibber (brion @ pobox.com)
memtest86, the quintessential memory tester app, requires physical
console access and a reboot. In the meantime, I've run some tests on
geoffrin from within Linux with a program called 'memtester' which was
recommended to me:
http://www.qcc.ca/~charlesc/software/memtester/
It only seemed willing to lock about 2 gigs at a time, so I ran two
memtester processes, which between them covered most of the available
memory. I didn't tell it to test _all_ memory as I was afraid Linux
might start killing sshd processes for daring to ask for more memory to
send me the results. ;)
Well, it's spewing out errors right and left; 19 failures caught
between all the tests on the first pass. At this point we can't be sure
if these are actual physical RAM defects, or artifacts of kernel bugs
or other problems (or even bugs in memtester, which might not have been
thoroughly tested on amd64), but it's not a good sign.
I'll post the logs once it's run a while longer.
-- brion vibber (brion @ pobox.com)
The database is back on pliny for now.
Replication isn't currently set up, mainly because pliny is short on
disk space, and the binlogs accumulate at a rate approaching a gig per
day.
Once we're satisfied with geoffrin, we can move everything back, yay...
:P
We waste a lot of disk space on the old table; gzipping old text should
save a fair chunk of space relatively easily at minor cpu cost, without
the complications of trying to make consistent diff-based storage. Even
if it were only 50% savings, that's several gigabytes. I may try to
throw something together for this...
-- brion vibber (brion @ pobox.com)
Damned computers.
Well, Tim is out of town. Jason is a couple hours away and there's no
way I'm asking him to drive down on Christmas. The colo does not have
24x7 staffing, so no one is there to reboot it for us today.
Therefore, we're doomed until tomorrow morning. Jason has a family
obligation tomorrow at 10AM, so he _could_ go down tomorrow afternoon,
but only if we can't get someone at the colo to do the reboot.
At least the website is serving cached pages, maybe if someone has a
chance, the cached-pages message could be updated to explain the time
we expect editing to resume.
>From now on when we think about how to spend money, we should think about
this morning, and think about redundancy, which is expensive but probably
necessary as we get bigger and bigger.
--Jimbo
Brion Vibber wrote:
> While the new server is wonderfully fast, it's crashing way *way* too
> often. :(
>
> Has the memory been tested? What kind of warranty do we have? Will
> Penguin replace any defective parts? How long would this take?
>
> We *really* need a way to reboot it remotely. Somebody's going to have
> to go in on Christmas day just to push the reset button, and that ain't
> cool.
>
> sigh...
>
> -- brion vibber (brion @ pobox.com)
I am wondering why Geoffrin is so unstable. For how
many hours was it down?
how about the Opteron's temperatures? I suppose it is
installed in an air-conditioned room.
and how about the Opteron's motherboard BIOS? is it
updated with a stable revision? Some Opteron
motherboards had problems with the BIOS.
for the remote reset issue somebody suggested this on
http://openfacts.berlios.de/index-en.phtml?title=Wikipedia_Status
:
"Type "remote power management" into Google, and
you'll get listings for lots of network-controlled
power-distribution devices, which will allow you to
remotely power-cycle locked-up boxes, but without the
cost of a UPS, and more importantly, with the ability
to power cycle on a per-power socket basis.
(wikipedian, the Anome)"
--
Optim
__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/
Yann wrote:
>I suppose that we want that the language changes acccording
>to users' settings with a "user_language" field in the user table.
A user pref to override the default UI language would be nice. Another thing
that can be done in addition to that is have something like &lang=fr in the
URL to force an interface change for anybody who has not set an override
language in their UI (to French in this example).
>I see a problem with the cached pages. We will need to have
>a cache for each language.
Language category tags (via the MediaWiki category system which is waiting for
some more bug fixing before it is implemented) would be a way to solve that.
Then when anons (who view cached pages by default) visit a page with a French
language category tag in it their UI would switch to French (&lang=fr could
then be added to history and talk page links so the French UI follows the
anon).
-- Daniel Mayer (aka mav)
Jimbo got the machine rebooted and I restored the settings, but I have no
idea how to get the memcached stuff working right. I disabled it for now.
Regards,
Erik
Geoffrin is back up. I am trying to get the website back to normal,
but I really don't know what I'm doing, so I'm just poking around
looking at things. Probably someone could fix this pretty quickly.
I left a voicemail for Brion, but anyone else who knows what to do
could probably do it as well.
An email I received...
-------
Hi,
I am Sujith, 29 years, Male.
I visit your esteemed website on regular basis.I was
shoked to know that ur Database server has crashed
down.
I have one of important exams comming up for which my
study material is only wikipedia.org.
Pl. pull up ur socks and also help us.
Best wishes for a very happy new year ahead.
Thanks,
SUJITH
__________________________________
So, yeah, I guess I better pull up my socks.
--Jimbo