AuxTel Recovery after Shutdown¶
Overview¶
This document describes to procedures necessary to recover the AuxTel systems from a major event. This might include a power shutdown, a major software upgrade, or a loss of network connectivity.
Precondition¶
These procedures should be applied any time the AuxTel systems are not operating normally. Not all procedures will need to be followed in every case, and the user should use judgement and only apply the recovery procedures to the systems that are not operating normally.
Post-Condition¶
After completing these procedures, the AuxTel systems should be operating normally. It is recommended to perform a full set of daytime checkouts after completing these procedures to confirm the recovery has been successful.
Procedure Steps¶
The recovery procedures here are divided into several sections:
Main ATCamera electronics and sensor readout cabinet recovery¶
The main ATCamera electronics and sensor readout cabinet is on the first floor, next to the chiller, and is shown in Figure 1. Figure 2 shows the inside of this cabinet after opening the door. After a loss of power or other major work, the chiller should start up, but there are typically two things that need to be done, as shown by the yellow arrows in Figure 2:
- The Pfeiffer vacuum gauge will stop reading and stop sending telemetry.
To reset it, press and hold the Up arrow key for 3 seconds. The display should then start reading a vacuum pressure and resume sending telemetry.
- The CryoCon temperature controller will stop controlling.
To resume control, press the Control button on the front panel. The blue light will come on. Typically, when first pressed, there is short Overtemp excursion and the blue light goes off. In this case press it again.
The goal is for the blue light to come on and stay on. It may take 2-3 tries for it to stay on.
ATMCS/ATPneumatics recovery¶
Often the ATMCS and ATPneumatics CSCs will fail to recover after a loss of power or a software upgrade. In this case the ATMCS/ATPneumatics cRIO needs to be rebooted. This is inside the main AT control cabinet on the first floor, shown in Figures 3 and 4.
Press the reset button briefly (less than 1 second). The yellow light on the cRIO should come on. When the yellow light goes out, the reboot is completed. The CSCs should then be recovered.
It is also possible to reboot this cRIO remotely by ssh into atmcs-crio.cp.lsst.org using the credentials in the 1Password vault and sending the restart command.
ATHexapod recovery¶
Sometimes the ATHexapod CSC does not recover from a major event. The ATHexapod controller is also located in the main AT control cabinet shown in Figure 3.
In the event of a failure of the ATHexapod CSC, power cycle the controller by turning it off, waiting approximately 1 minute, and turning it back on. The location of the power button is shown in Figure 5.
ATCalSys recovery¶
The ATCalSys generates white and monochromatic light for illuminating the dome screen for calibrations. The system is shown in Figure 6. There are typically two things that need to be done after a loss of power to recover it.
- Arrow #1 in Figure 7 shows the NUC computer that is auxtel-monochromator01.cp.lsst.org.
After a power failure, it does not start back up automatically. There is a small round power button on the left side of the computer that needs to be pressed to power it up.
A configuration change that will make this unnecessary is under development, but for now it needs to be done.
Once the computer is powered up, the LabView instance needs to be relaunched. The procedure for this is outlined in AuxTel Illumination System Handbook.
- Often the auxtel-ill-control.cp.lsst.org fails to come up properly after a loss of power.
In this case, it needs to be manually power cycled. It is the machine shown by arrow #2 in Figure 7. At the back of the computer, there is a green and orange power connector. This needs to be unplugged and the re-plugged to power cycle the computer.
ATDome recovery¶
ATDome does not usually have a problem recovering. More detail on interfacing with the ATDome hardware is in the technote SITCOMTN-094. The reset procedure is briefly outlined here:
Press the safety gate bypass button on the outside of the main drive cabinet to bypass the safety gate and then open the safety gate.
Reset the Main Box cRIO on the first floor as shown in Figure 8.
Reset the Top Box cRIO on the second floor as shown in Figure 9.
Re-lock the safety gate and press the button again to remove the bypass.
ATCamera recovery¶
Recovering the ATCamera is the most complex set of steps in this recovery procedure. This procedure assumes that the user is familiar with the CCS Camera Control System software. With the complexity of CCS, this document will not be able to cover all possible things that might go wrong. However, below are outlined some procedures that will deal with most cases. The technote AuxTel PowerUp sequence has detailed information on how to power up the camera.
Step 1 - Assess the status of the CCS subsystems¶
The easiest way to do this is to open a CCS console:
Log in to auxtel-hcu01
ssh -XY <your login>@auxtel-hcu01.cp.lsst.org
Open a CCS-console
$ccs-console &
If you have an M1 Mac, this command will result in a black window. In that case, run this command:
$ccs-console -Dsun.java2d.xrender=false -Dsun.java2d.pmoffscreen=false&
After the CCS-Console window opens, use the pulldown-menu to launch CCS Tools > Monitoring > Whole Camera > CCS Health.
This should give you a display like Figure 10.
All of the subsystems should be operational. However, after a major event, it is likely that one or more of the subsystems are in Engineering Fault. Proceed with step 2 to clear the faults out of those failing subsystems.
Step 2 - Bring the failing subsystems out of fault¶
Bringing the CCS subsystems out of fault requires interfacing with the CCS Shell. Once you are in the CCS Shell, you can issue commands to the various subsystems.
Remember that “tab-complete” is your friend in CCS. If you aren’t sure what commands are available, try hitting tab to see what it shows you.
The CCS subsystems have levels of permission which limit what you can do. In the lowest level, only some commands will be visible. At higher levels, you will have access to more commands. In addition, there is a normal mode, and an engineering more for each subsystem. Some commands are only accessible in engineering mode. When you access a higher level, a lock is placed on that subsystem which must be removed before the system will operate.
Here is an example of bringing one of the subsystems out of fault, in this case ats:
$ ccs-shell & # Starts the CCS shell from the bash prompt at auxtel-hcu01.cp.lsst.org
ccs> set level ats 10 # Set the ats subsystem to the highest level
ccs> ats switchtoEngineeringMode
ccs> ats clearAllAlerts
ccs> ats switchToNormalMode
ccs> unlock ats # This sets the level back to 0
For future commands, this guide won’t go through all of the locking and unlocking steps, and it’s assumed you have brought the subsystem to the necessary level to access the command. Using the clearAllAlerts command will usually allow you to clear most of the subsystem faults after a major event. However, there are some exceptions:
- The ats-mcm (which stands for Master Control Module) can not be cleared in this way.
However, after the other faults have been cleared, ats-mcm should come out of fault. If it doesn’t, try logging into auxtel-mcm.cp.lsst.org and running the command sudo systemctl restart ats-mcm. Of course, this requires sudo privileges.
- If the WREB board has not been powered up, then ats-fp will not be reporting.
This requires starting up the WREB board with the
ats-init.py
script, followed by turning on the HV bias. Detailed instructions for starting up the WREB and turning on the HV are available in the powering up from a completely cold state section of the SITCOMTN-026.
- Sometimes, bonn-shutter has a fault which can not be cleared with the instructions above.
When this happens, the only way that has been found to clear this is to physically power cycle the shutter controller. Figure 11 shows the location of the bonn shutter controller. Power cycle it by unplugging the power cable, waiting a few seconds, and plugging it back in. This usually clears the fault.
Step 3 - Bringing ats-ocs-bridge to the proper state¶
One of the CCS modules is ats-ocs-bridge. This is the subsystem that interfaces between CCS and the Observatory Control System (i.e. the CSCs). In this case ats-ocs-bridge is interfacing with the ATCamera CSC. It is necessary to get ats-ocs-bridge into the proper state in order to be able to control ATCamera with LOVE and the ScriptQueue. Here are the necessary steps:
Get the state of the atc-ocs-bridge running the command from the ccs-shell
ccs>ats-ocs-bridge getState
This will return something like:
AlertState:NOMINAL CCSCommandState:IDLE CommandState:READY ConfigurationState:CONFIGURED OfflineState:OFFLINE_PUBLISH_ONLY OperationalState:ENGINEERING_OK PhaseState:OPERATIONAL SummaryState:OFFLINE
- The SummaryState is the same state of ATCamera you see with LOVE.
If the SummaryState is
FAULT
, it cannot be brought out of fault with the normal LOVE commands. It needs to be brought out of fault with the ccs-shell commandccs>ats-ocs-bridge clearFault
- Assuming the SummaryState is
OFFLINE
, then we look at the OfflineState. If the OfflineState is
OFFLINE_PUBLISH_ONLY
, we need to transition it toOFFLINE_AVAILABLE
before we can use the usual state transition commands in LOVE and the script queue to bring it online. This is done with the ccs-shell commandccs>ats-ocs-bridge setAvailable
- Assuming the SummaryState is
- Transition ATCamera to
STANDBY
Once we have it in SummaryState
OFFLINE
and OfflineStateOFFLINE_AVAILABLE
, the ATCamera can transition using the script queue and theset_summary_state.py
to bring the SummaryState toSTANDBY
.
- Transition ATCamera to
- Transition LATISS to
ENABLED
Once the SummaryState is
STANDBY
, you can runenable_latiss.py
in the script queue to bring up all of LATISS. If this is successful, things should now be operating normally.
- Transition LATISS to
Contingency¶
If the procedure was not successful, report the issue on the #summit-announce channel and/or activate the Out of hours support.
This procedure was last modified Nov 19, 2024.