Mark Bergsma wrote:
Ævar Arnfjörð Bjarmason wrote: Please keep in mind that ptolemy and ortelius are meant to be WMF production boxes. That means they're (also) managed by the Wikimedia Ops team. I think that for the near future we're happy to let you play with the boxes and experiment with what OSM/integration software/architecture works best. But eventually, when these maps are integrated into our core web sites, the servers and software will need to be managed by WMF as well as you guys. Especially since you volunteers might lose interest in the long run... :)
I think being "production" is very good - we will be on monitoring from the very beginning :-)
That means:
- Please work with us; keep us informed
So far the only update for ptolemy: - LOM firmware update by river - Sun STK RAID INT firmware update from 5.2-0 (15825) to 5.2-0 (16732) - Tool to manage STK RAID INT installed in /usr/sbin/arcconf (Version 6.10 (B17551) from Intel website). I think that Adaptec's version (Version 6.10 (B18359)) is a bit more informative, what are you using on other servers?
- Installed as dependencies for arcconf: libstdc++5 gcc-3.3-base libgcc1
- Put documentation in our documentation wiki,
http://wikitech.wikimedia.org. If you need access, please contact me and I'll get you set up.
Can you create accounts for Aude and myself (Saper)? Is Ævar there as well?
Looking at http://wikitech.wikimedia.org/view/Platform-specific_documentation how far is Sun Fire X4250 different?
- Logging of server actions can be done on #wikimedia-tech using the log
bot. Just use "!log <message>" in the channel, it will work. Put the server-name in the line.
Cool, thanks.
- If you have any problems/issues/needs related to managing the servers
in general (RAID controller/driver issues?), as opposed to OSM software specific things, then certainly ask us! Chances are we've already solved it or have a certain way of doing things, and there is no need for you to reinvent the wheel. :)
Yes, here are my questions:
(1) It has been reported that RAID controller has serious stability problems (causes kernel abends). I think this should be fixed in the new firmware OR the new driver, see below.
(2) What are the kernel upgrade procedures on the WMF servers?
(3) What are the OS upgrade procedure on the WMF servers?
(4) /home/saper/raid/linux_x86_x64_driver_v1.1.5-2463 contains Linux driver version 2463 for Sun STK RAID INT that we probably should be running. I can do that given (2) above :)
(5) I asked on #ts-admins about the management console access, that would be beneficial to perfom changes to kernel and partitioning, see the next points what we need to be done from there.
(6) I think we should reconfigure RAID - for now, I would like to put the current filesystem on a single RAID1 pair of drives. It's root, so I think this shouldn't be done from the running system. I think we can disband the current RAID 10 setup for now, we will be testing one or two possible RAID setups for Postgres as soon as we have space.
(7) I'd love to have OS repartitioned - small /, large /usr, mid-large /var, small /tmp in a traditional UNIX way. All of this on a RAID 1 volume created in step #6
(8) It would be nice to have different OS (FreeBSD or Solaris) but I understand that probably you'd like to have a uniform setup accross WMF and I think I can live with it. Would be nice to have information re #3 though if we stick to Ubuntu.
I think I could do (4)...(7) myself given access to the management console and with some possibility to have some netboot/CD-boot from there. This leads me to:
(9) I've seen this: http://wikitech.wikimedia.org/view/Automated_installation Do you have some kind of minimal netboot/recovery system to be invoked from LOM to do stuff like total repartitioning?
Ptolemy and ortelius are, in the long run, *not* meant to be used by toolserver users. Those boxes are explicitly separate. You can't run a production database when users are running all kinds of inefficient and uncoordinated queries on it. :) For now it doesn't matter, but keep this in mind.
Cassini is a toolserver, and managed by Wikimedia Germany. They do things differently than WMF, coordinate with them to see what works there.
I hope that we can have a joint on project on maps and use resources efficiently. For example, we might not have space for the full OSM database anywhere else then on ptolemy. However, I think we can find a way to provide production-level stability and stay within our resource base. Besides, I have no objections to having exactly the same production/monitoring features on cassini as well.
Uff, that's all from me for now :)