Computer Room Shutdown and Recovery¶
The process of shutting down servers and network devices in the computer room will adhere to the following two scenarios:
Critical Situation with Extended ETA:
If the computer room is in a critical state with no immediate solution or an estimated time for resolution that exceeds a reasonable wait time (typically a couple of hours), a decision will be made to power off all servers.
Critical Situation with Proximate ETA:
In instances where the computer room is in a critical state but a solution is imminent, only a select group of servers, identified as “safe to power off,” will undergo shutdown procedures. This procedure aims to balance the need for thermal management with the continuity of essential services during server shutdowns.
Computer Shutdown Tiers¶
The following is the list of computers in each tier.
Tier 1: Computers that can be powered off during an emergency in the computer room by IT/Devops, without the authorization or help from system owners.
Tier 2: Computer that will be powered off in the computer room by IT/Devops, alerting the system owners but without taking down the control system
Tier 3: Computers that will be powered off in the computer room by IT/Devops, that will take down the control system but not the communications to the summit.
Tier 4: Complete power off of the computer room.
Tier 1¶
The following is the list of safe to power-off servers. The computers can be powered off without noticing or alerting the system owner.
IT
snapshot and shutdown VMs first - vsphere[2-3].cp.lsst.org - hvaccp and dccp1 VMs migrate to vsphere01.cp.lsst.org . First shut down the VMS on vsphere2 then put vsphere2 in maintenance mode first then shut down from vcenter.cp.lsst.org/ui. Then repeat for vsphere03 - lukay[1-5].cp.lsst.org - perfsonar01.cp.lsst.org
Control - love01.cp.lsst.org – development machine
Comcam:
comcam-dc01
comcam-daq-mgt
comcam-archiver
Lsstcam:
Any lsstcam machines part of diagnostic cluster. lsstcam-dc01.cp.lsst.org - 10
Auxtel
auxtel-dc01.cp.lsst.org - can always be shut down first, it is only used for image visualization
If we are not actively taking data e.g. during the daytime, during the weekend, or during a week when the AuxTel is NOT operating on sky:
auxtel-archiver.cp.lsst.org - can always be shut down if we are not taking data, will lose ability to ingest data in Butler if taking data
auxtel-daq-mgt.cp.lsst.org - can always be shut down if we are not taking data, will lose connection/monitoring of the WREB a
Tier 2¶
The following is the list computers will be powered off alerting the system owners, but not taking down the control system
Control System
azar[02-03].cp.lsst.org
Use foreman - remote job on machines(s) with ipmtool chassis power off || shutdown -h now
Tier 3¶
The following is the list computers will be powered off alerting the system owner, taking down the control system but not the communications to the summit
IT
core[02,03].cp.lsst.org (not dns or foreman hypervisor). Log in to core03 first and shutdown the libvirt vms that are running on the node. This can be done using the virsh shutdown <name> command. To list the vms on the node we use virsh list –all. Once this is complete the node can be safely powered off and we can move on to core02 repeating the procedure
elqui[01-18].cp.lsst.org
ipsec switches (can be done in tier 4)
leafs of each rack (except A1) an not spine switches in A5 and A6 (can be done in tier 4)
ipmi of each rack (expect A1) (can be done in tier 4)
all vms except hvac monitoring (hvaccp) and domain controller (dccp)
Control System
First bring down the control system - then bring down yagan[01-20].cp.lsst.org using foreman -> remote job on machines ‘ipmitool chassis power off’
azar01.cp.lsst.org
chonchon[01-03].cp.lsst.org
nfs1.cp.lsst.org
tma-controller01 (open tekniker and phase cabinet doors)
hexrot-vm01 (shut off topend machines)
m1m3-dev (already disconnected from hardware)
comcam-mcm
comcam-db01
auxtel-mcm
auxtel-db01
fp01 (only if warmup cameras)
daq mgt (only if warmup cameras)
daq ATCA crates (only if warmup cameras)
Tier 4¶
The following is the list computers will be powered off that will take down the communications to the summit.
IT:
vsphere1.cp.lsst.org
core[remaining].cp.lsst.org (includes foremand and dns)
yepun[01-05].cp.lsst.org
nvr01.cp.lsst.org
network devices (spines, agg, leafs, wlc, cucm, etc)
dwdm