Computer Room Shutdown and Recovery

The process of shutting down servers and network devices in the computer room will adhere to the following two scenarios:

  • Critical Situation with Extended ETA:

If the computer room is in a critical state with no immediate solution or an estimated time for resolution that exceeds a reasonable wait time (typically a couple of hours), a decision will be made to power off all servers.

  • Critical Situation with Proximate ETA:

In instances where the computer room is in a critical state but a solution is imminent, only a select group of servers, identified as “safe to power off,” will undergo shutdown procedures. This procedure aims to balance the need for thermal management with the continuity of essential services during server shutdowns.

Computer Shutdown Tiers

The following is the list of computers in each tier.

Tier 1: Computers that can be powered off during an emergency in the computer room by IT/Devops, without the authorization or help from system owners.

Tier 2: Computer that will be powered off in the computer room by IT/Devops, alerting the system owners but without taking down the control system

Tier 3: Computers that will be powered off in the computer room by IT/Devops, that will take down the control system but not the communications to the summit.

Tier 4: Complete power off of the computer room.

Tier 1

The following is the list of safe to power-off servers. The computers can be powered off without noticing or alerting the system owner.

IT

  • vsphere[2-3].cp.lsst.org - Previous movement of VMs to vsphere01.cp.lsst.org

  • lukay[1-5].cp.lsst.org

  • perfsonar01.cp.lsst.org

Comcam:

  • comcam-dc01

  • comcam-daq-mgt

  • comcam-archiver

Lsstcam:

  • Any lsstcam machines part of diagnostic cluster.

  • Auxtel

auxtel-dc01.cp.lsst.org - can always be shut down first, it is only used for image visualization

If we are not actively taking data e.g. during the daytime, during the weekend, or during a week when the AuxTel is NOT operating on sky:

auxtel-archiver.cp.lsst.org - can always be shut down if we are not taking data, will lose ability to ingest data in Butler if taking data

auxtel-daq-mgt.cp.lsst.org - can always be shut down if we are not taking data, will lose connection/monitoring of the WREB a

Tier 2

The following is the list computers will be powered off alerting the system owners, but not taking down the control system

Control System

  • yagan[13-20].cp.lsst.org

  • azar[02-03].cp.lsst.org

  • love02.cp.lsst.org

Tier 3

The following is the list computers will be powered off alerting the system owner, taking down the control system but not the communications to the summit

IT

  • vsphere1.cp.lsst.org

  • core[2 instances].cp.lsst.org (not dns of foreman hypervisor)

  • elqui[01-18].cp.lsst.org

  • ipsec switches

  • leafs of each rack (except A1)

  • ipmi of each rack (expect A1)

  • all vms except hvac monitoring

Control System

  • yagan[01-12].cp.lsst.org

  • azar01.cp.lsst.org

  • love01.cp.lsst.org

  • chonchon[01-03].cp.lsst.org

  • nfs1.cp.lsst.org

  • tma-controller01 (open tekniker and phase cabinet doors)

  • hexrot-vm01 (shut off topend machines)

  • m1m3-dev (already disconnected from hardware)

  • comcam-mcm

  • comcam-db01

  • auxtel-mcm

  • auxtel-db01

  • fp01 (only if warmup cameras)

  • daq mgt (only if warmup cameras)

  • daq ATCA crates (only if warmup cameras)

Tier 4

The following is the list computers will be powered off that will take down the communications to the summit.

IT:

  • core[remaining].cp.lsst.org (includes foremand and dns)

  • yepun[01-05].cp.lsst.org

  • nvr01.cp.lsst.org

  • network devices (spines, agg, leafs, wlc, cucm, etc)

  • dwdm