Hi Marcin,
Marcin Cieslak wrote:
- Put documentation in our documentation wiki,
http://wikitech.wikimedia.org. If you need access, please contact me and I'll get you set up.
Can you create accounts for Aude and myself (Saper)? Is Ævar there as well?
Yes, I will get the details to you.
Yes, here are my questions:
(1) It has been reported that RAID controller has serious stability problems (causes kernel abends). I think this should be fixed in the new firmware OR the new driver, see below.
(2) What are the kernel upgrade procedures on the WMF servers? (3) What are the OS upgrade procedure on the WMF servers?
<snip>
I like your enthusiasm in managing these systems. :-) However, especially for the long run I think it would be easiest if you let us (WMF operations) deal with these system level things, and you concentrate on the OSM-specific software setup. Otherwise these systems will naturally diverge from our other servers, and for these production systems that's something we'll need to avoid.
What we normally do:
We upgrade the OS and kernels on need (features/stability), or when serious (remote) security issues are identified. We tend to upgrade to the latest Ubuntu kernel and use that unless we experience problems in practice. In this case there is indeed an issue with that RAID controller, so we'll upgrade it. We try to stick with Ubuntu Hardy as long as we can for miscellaneous servers. If you feel you really need newer versions for OSM setup, then let us know and we can upgrade it.
Partitioning: we know that it's traditional to separate /usr /var etc, but we have found that this usually has very little use in practice, and is more often a nuisance. These days we put everything in one large enough / and only split off data partitions on servers where it matters. Of course your databases should be running off a special partition, but for the rest there is probably no real need. If you think otherwise and have good arguments, we can surely change it, of course. We do tend to use LVM for everything non-root in those cases.
The same holds for the RAID setup: on our databases and big storage systems, most often we just run it off the same big RAID-10 array. It's more convenient and flexible and if well-configured the rest of the OS is not hitting that array much at all. If you feel there is a need, we can of course change it - but we'll need to reinstall the OS. A different RAID level would be totally fine as well of course - this is very much dependent on your needs. I picked RAID-10 as neither Aevar nor Katie knew what was necessary, and RAID-10 tends to be the best choice for databases and high performance I/O systems.
Serial console/LOM access cannot easily be handed out, but should also not be necessary usually. In the unlikely event that the system becomes unmanageable in-band, just contact us directly (ask on #wikimedia-tech for example) and we'll restore it quickly.
I hope that we can have a joint on project on maps and use resources efficiently. For example, we might not have space for the full OSM database anywhere else then on ptolemy. However, I think we can find a way to provide production-level stability and stay within our resource base. Besides, I have no objections to having exactly the same production/monitoring features on cassini as well.
I really want to stress that these systems need to be *separate*, they cannot be used together at all. Ideally there is no traffic between those servers at all, except in the form of cassini generating visitor traffic like the rest of the Internet. Cassini is meant for playing around where lots of people have access, the other two are (in the end) really meant for production use with limited access. Stable operation is simply not possible when arbitrary users can do arbitrary things on a system, and that's why we intended these systems to be very isolated from the start. Cassini is also managed by WMDE / Toolserver, ptolemy and ortelius are Wikimedia Foundation managed. So I'm afraid that we really cannot use those servers in one resource pool... If those separate clusters do not have enough resources/space to do what we need, I think we should look into buying more hardware. That is really not impossible. :)
Thanks for sharing your ideas. :)