Damned computers.
Well, Tim is out of town. Jason is a couple hours away and there's no way I'm asking him to drive down on Christmas. The colo does not have 24x7 staffing, so no one is there to reboot it for us today.
Therefore, we're doomed until tomorrow morning. Jason has a family obligation tomorrow at 10AM, so he _could_ go down tomorrow afternoon, but only if we can't get someone at the colo to do the reboot.
At least the website is serving cached pages, maybe if someone has a chance, the cached-pages message could be updated to explain the time we expect editing to resume.
From now on when we think about how to spend money, we should think about
this morning, and think about redundancy, which is expensive but probably necessary as we get bigger and bigger.
--Jimbo
Brion Vibber wrote:
While the new server is wonderfully fast, it's crashing way *way* too often. :(
Has the memory been tested? What kind of warranty do we have? Will Penguin replace any defective parts? How long would this take?
We *really* need a way to reboot it remotely. Somebody's going to have to go in on Christmas day just to push the reset button, and that ain't cool.
sigh...
-- brion vibber (brion @ pobox.com)
On Dec 25, 2003, at 08:00, Jimmy Wales wrote:
From now on when we think about how to spend money, we should think about
this morning, and think about redundancy, which is expensive but probably necessary as we get bigger and bigger.
When Geoffrin comes back online, I intend to move the live database back to Pliny. Geoffrin is unsuitable for production use until it is thoroughly tested for defects and rebootable in case of future crashes. In just three weeks we've had two hard crashes and two instances of file corruption.
A RAM test needs to be done; that's a highly likely source of errors (over 32 billion bits of memory; if just _one_ is defective, it can cause data corruption or a crash).
We *really should* get database replication going, if at some point we have two working machines with enough disk space to handle the database, so a dead database server can be taken over by the slave.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
A RAM test needs to be done; that's a highly likely source of errors (over 32 billion bits of memory; if just _one_ is defective, it can cause data corruption or a crash).
Early next week, Jason is going to be delivering the new machine, and it can be pressed into service while we take the opteron out of service for testing and parts swapping, if that's what the problem is.
We *really should* get database replication going, if at some point we have two working machines with enough disk space to handle the database, so a dead database server can be taken over by the slave.
And then perhaps once we get rolling again, this is what we can do with the $4000 in the bank... buy a machine to be the secondary DB server?
--Jimbo
Brion Vibber schrieb:
When Geoffrin comes back online, I intend to move the live database back to Pliny. Geoffrin is unsuitable for production use until it is thoroughly tested for defects and rebootable in case of future crashes. In just three weeks we've had two hard crashes and two instances of file corruption.
We are using a 64-bit Kernel on Geoffrin, right? I wonder if this is a good idea, since the Kernel hackers working on the AMD64 Linux port have just fixed several nasty bugs that could lead to data corruption and crashes. According to Andi Kleen, one of the lead hackers working on x86_64-Linux, AMD64 isn't really ready for production use and has still lots of hack value.
Maybe it might be a good idea to stick with a 32-bit system on Geoffrin until the Linux codebase for AMD64 stabilizes a bit more?
Alwin Meschede
I agree that currently a 32bit operating system is better for production use on AMD Opteron CPUs.
When I first learnt that Geoffrin uses 64bit Linux I was very surprised. I really don't understand why Geoffrin was set up to work in 64bit mode, since it is a production machine and stability is a critical issue, so we should use 32bit Linux (or freebsd).
I don't mean that AMD Opteron is not good hardware. I just say that it is too early to install 64bit software, because probably it is more buggy than the 32bit versions.
Optim
--- Alwin Meschede ameschede@gmx.de wrote:
Brion Vibber schrieb:
When Geoffrin comes back online, I intend to move
the live database back
to Pliny. Geoffrin is unsuitable for production
use until it is
thoroughly tested for defects and rebootable in
case of future crashes.
In just three weeks we've had two hard crashes and
two instances of file
corruption.
We are using a 64-bit Kernel on Geoffrin, right? I wonder if this is a good idea, since the Kernel hackers working on the AMD64 Linux port have just fixed several nasty bugs that could lead to data corruption and crashes. According to Andi Kleen, one of the lead hackers working on x86_64-Linux, AMD64 isn't really ready for production use and has still lots of hack value.
Maybe it might be a good idea to stick with a 32-bit system on Geoffrin until the Linux codebase for AMD64 stabilizes a bit more?
Alwin Meschede
Wikitech-l mailing list Wikitech-l@Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
On Dec 26, 2003, at 16:33, Nikos-Optim wrote:
I agree that currently a 32bit operating system is better for production use on AMD Opteron CPUs.
When I first learnt that Geoffrin uses 64bit Linux I was very surprised. I really don't understand why Geoffrin was set up to work in 64bit mode, since it is a production machine and stability is a critical issue, so we should use 32bit Linux (or freebsd).
My understanding is that the system shipped with SuSE Professional 8.1 for AMD64 preinstalled. If that's not production ready, somebody should really tell Penguin Computing that they're shipping broken goods.
-- brion vibber (brion @ pobox.com)
AMD x86-64 (AMD64) technology was designed in 2001 and got into production in 2003. The hardware is good, but I would never trust AMD64 software to run critical server. It's just too new.
Even if it has no bugs at all, it is too early to run AMD64 software for mission critical applications.
I think Geoffrin has 2GB of memory? In that case, we needn't the 64bitness, since 32bit kernels work fine with 2GB of memory. 64bit is needed when a machine has more than 4GB RAM. I suppose that the extra performance of AMD64 mode is not so required, so we can downgrade to 32bit Linux/freebsd to be sure that the problem is not from the AMD64 kernel.
--- Brion Vibber brion@pobox.com wrote:
My understanding is that the system shipped with SuSE Professional 8.1 for AMD64 preinstalled. If that's not production ready, somebody should really tell Penguin Computing that they're shipping broken goods.
-- brion vibber (brion @ pobox.com)
__________________________________ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
The machine has 4 gig of memory, so yes, a 32 bit kernel would be fine. I have no objection to trying that, although we should recognize that we have no specific information to cause us to think that this is actually the problem.
SuSE ships this as production software, and I generally trust them. But, the machine is crashing, so...
Nikos-Optim wrote:
AMD x86-64 (AMD64) technology was designed in 2001 and got into production in 2003. The hardware is good, but I would never trust AMD64 software to run critical server. It's just too new.
Even if it has no bugs at all, it is too early to run AMD64 software for mission critical applications.
I think Geoffrin has 2GB of memory? In that case, we needn't the 64bitness, since 32bit kernels work fine with 2GB of memory. 64bit is needed when a machine has more than 4GB RAM. I suppose that the extra performance of AMD64 mode is not so required, so we can downgrade to 32bit Linux/freebsd to be sure that the problem is not from the AMD64 kernel.
--- Brion Vibber brion@pobox.com wrote:
My understanding is that the system shipped with SuSE Professional 8.1 for AMD64 preinstalled. If that's not production ready, somebody should really tell Penguin Computing that they're shipping broken goods.
-- brion vibber (brion @ pobox.com)
Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
since we dont know what causes the problem we suspect everything.
Possible problems: 1. Temperature. Can be checked with software. 2. BIOS bugs. Check the motherboard manufacturer's web site for information and new versions. Do not install the newest version, very often it is unstable or untested. 3. 64bit linux kernel bugs. Solution: 32bit kernel. 4. Defective CPU. solution: downclock (but I think this is impossible on Opterons, they are locked), less CPU work (less utilization), change the CPUs with other CPUs. (use warranty). Make sure you use a good heatsink and fan. 5. Defective memory. Solution: buy new memory and change it. (use warranty). Always buy REGISTERED and ECC memory, never non-ECC, and always from well-known manufacturers such as Kingston, Corsair, Crucial etc. The memory should have its own passive cooling system too (heatsink).
To check the CPU, the L2 cache, and the RAM, you can use the MPrime Software from www.mersenne.org
to check the memory you can use the memtest86 Software from www.memtest86.com
also test the hard disk surface for bad sectors (although this should not cause such a problem).
Production software often has bugs. The only thing that can assure you that something will run without problems is its Years of Presence in the market. AMD64 is so new, it is untested. 32bit linux exists for so many years and it is tested very well, so you can be sure that if you install a 32bit kernel, your problem (if you still have the crash problem) will not be because of the 64bit kernel. However keep in mind that, in general, 32bit software has 20% lower performance than 64bit software on AMD Opterons.
--- Jimmy Wales jwales@bomis.com wrote:
The machine has 4 gig of memory, so yes, a 32 bit kernel would be fine. I have no objection to trying that, although we should recognize that we have no specific information to cause us to think that this is actually the problem.
SuSE ships this as production software, and I generally trust them. But, the machine is crashing, so...
__________________________________ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
Brion Vibber wrote:
My understanding is that the system shipped with SuSE Professional 8.1 for AMD64 preinstalled. If that's not production ready, somebody should really tell Penguin Computing that they're shipping broken goods.
I agree completely. They *will* hear from me on Monday, and the more information I have by Monday morning, of course the more intelligent my complaining will be.
--Jimbo
Brion Vibber schrieb:
My understanding is that the system shipped with SuSE Professional 8.1 for AMD64 preinstalled. If that's not production ready, somebody should really tell Penguin Computing that they're shipping broken goods.
I think, we should use as much existing knowledge as possible, this machine is not intended as a learning environment. If nobody here has experience with 64bit Linux Kernels, we shouldn't use them. We ''have'' experience with 32Bit SMP kernels, so let's go for it. If someone has access to the remote console during boot, install GRUB instead of LILO, it helps debug kernel related boot problems, but of course only if you can reach the console.
If nobody here has experience with SuSE Professional 8.1 for AMD 64, we should replace it with an distribution we ''are'' familiar with, be it Red Hat Linux, Fedora Project, or Debian GNU/Linux. However, we have to eliminate as much variables as possible, and to my understanding that means, we have to use an operating system base, which the people who administer the machine are familiar with and trust [1].
The latest 64Bit hardware is more than enough to explore, we don't need no more SPOFs than necessary.
If the hardware ''is'' working properly is a totally different story [2].
Greetings & good luck, -asb
[1] Because of experiences at work with GNU/Linux-based Internet and enterprise infrastructure services, I personally don't trust SuSE. But I ''do'' trust Red Hat Linux, and Debian GNU/Linux - but only if it's administered by someone with enought expererience with Red Hat Linux, or Debian GNU/Linux (and ''only'' then).
[2] I personally don't like SMP machines based on AMD CPUs like the Athlon MP because (a) neither I nor my dealer nor the manufacturer of the mainboards (Tyan) was ever able to chill them down to acceptable temperatures without watercooling, which definitely ain't no fun on remote SMP systems (lm_sensors will tell you more than you want to know about this), and (b) the BIOS manufactured by AMD I know so far are too buggy (whatever this Opteron is using might be totally different).
wikitech-l@lists.wikimedia.org