[Maps-l] Server admin procedures wrt ptolemy and ortelius

Fri Sep 18 12:18:22 UTC 2009

Hi Marcin,

Marcin Cieslak wrote:
>> - Put documentation in our documentation wiki,
>> http://wikitech.wikimedia.org. If you need access, please contact me and
>> I'll get you set up.
> 
> Can you create accounts for Aude and myself (Saper)? Is Ævar there as well?

Yes, I will get the details to you.

> Yes, here are my questions:
> 
> (1) It has been reported that RAID controller has serious stability
> problems (causes kernel abends). I think this should be fixed in the new
> firmware OR the new driver, see below.
> 
> (2) What are the kernel upgrade procedures on the WMF servers?
> (3) What are the OS upgrade procedure on the WMF servers?

<snip>

I like your enthusiasm in managing these systems. :-) However,
especially for the long run I think it would be easiest if you let us
(WMF operations) deal with these system level things, and you
concentrate on the OSM-specific software setup. Otherwise these systems
will naturally diverge from our other servers, and for these production
systems that's something we'll need to avoid.

What we normally do:

We upgrade the OS and kernels on need (features/stability), or when
serious (remote) security issues are identified. We tend to upgrade to
the latest Ubuntu kernel and use that unless we experience problems in
practice. In this case there is indeed an issue with that RAID
controller, so we'll upgrade it. We try to stick with Ubuntu Hardy as
long as we can for miscellaneous servers. If you feel you really need
newer versions for OSM setup, then let us know and we can upgrade it.

Partitioning: we know that it's traditional to separate /usr /var etc,
but we have found that this usually has very little use in practice, and
is more often a nuisance. These days we put everything in one large
enough / and only split off data partitions on servers where it matters.
Of course your databases should be running off a special partition, but
for the rest there is probably no real need. If you think otherwise and
have good arguments, we can surely change it, of course. We do tend to
use LVM for everything non-root in those cases.

The same holds for the RAID setup: on our databases and big storage
systems, most often we just run it off the same big RAID-10 array. It's
more convenient and flexible and if well-configured the rest of the OS
is not hitting that array much at all. If you feel there is a need, we
can of course change it - but we'll need to reinstall the OS. A
different RAID level would be totally fine as well of course - this is
very much dependent on your needs. I picked RAID-10 as neither Aevar nor
Katie knew what was necessary, and RAID-10 tends to be the best choice
for databases and high performance I/O systems.

Serial console/LOM access cannot easily be handed out, but should also
not be necessary usually. In the unlikely event that the system becomes
unmanageable in-band, just contact us directly (ask on #wikimedia-tech
for example) and we'll restore it quickly.

> I hope that we can have a joint on project on maps and use resources
> efficiently. For example, we might not have space for the full OSM
> database anywhere else then on ptolemy. However, I think we can find
> a way to provide production-level stability and stay within our resource
> base. Besides, I have no objections to having exactly the same
> production/monitoring features on cassini as well.

I really want to stress that these systems need to be *separate*, they
cannot be used together at all. Ideally there is no traffic between
those servers at all, except in the form of cassini generating visitor
traffic like the rest of the Internet. Cassini is meant for playing
around where lots of people have access, the other two are (in the end)
really meant for production use with limited access. Stable operation is
simply not possible when arbitrary users can do arbitrary things on a
system, and that's why we intended these systems to be very isolated
from the start. Cassini is also managed by WMDE / Toolserver, ptolemy
and ortelius are Wikimedia Foundation managed. So I'm afraid that we
really cannot use those servers in one resource pool... If those
separate clusters do not have enough resources/space to do what we need,
I think we should look into buying more hardware. That is really not
impossible. :)

Thanks for sharing your ideas. :)

-- 
Mark Bergsma <mark at wikimedia.org>
System & Network Administrator, Wikimedia Foundation