Fault Reporting

Reporting telescope and observatory faults - whether they are mechanical errors, software bugs, or facilities issues - is a crucial aspect of observatory operations. Understanding the observatory and its efficiency begins with robust fault reporting, documenting recovery, and knowledge-sharing. This section describes the process to file a fault report for any incident that happens during nighttime operations in the Observing Operations (OBS) JIRA project.

The process describing the OBS Jira project workflow and process management is described in the section specific to the Fault Handling Workflow.

Guidelines For Productive Reporting

The most important part of fault-reporting is that the team can understand the problem well. Some guidelines to keep in mind are:

  • Facts first. The author of the fault report should provide as many details as possible, including screenshots, telescope telemetry, and timestamps for future investigation.

  • Leave the ticket unassigned, it will be triaged by the traige team, whos membership includes T&S and SIT-Com members. If observers think people need to be aware of the ticket (e.g. prospective assignees), they should be @’d in a comment.

  • Ideas are welcome, but let the facts speak first.

  • Report a problem, but do not assign blame. No one person needs to be identified unless they have a contribution or may offer a solution. Identifying the problem and reporting it effectively ensures that the Rubin team will move forward with a solution.

Filing Fault Reports

Fault reports and resolutions utilize the OBS Jira project. In many cases, there is already a Jira ticket associated with a given problem, and the action is primarily to log an instance of re-occurrence. This should be done by adding a comment to the ticket and not by editing the ticket description.

When creating a ticket, make sure to fill in the following fields:

../_images/Fault_report_example_page_1.png

Screenshot of an example fault report.

  • Project: The reporter should ensure that the OBS project is selected to include all things affecting nighttime operations.

  • Issue type: If unsure, select “problem.”
    • Problem: issue type usually refers to a hardware issue.

    • Bug: issue type typically refers to a software issue.

    • Improvement: issue type refers to suggestions for improvements to a procedure, software or else.

    • Information: issue type refers to alerting the team of a new behavior. This does not immediately impact operations, but informs of a change noticed.

  • Summary: Describe the problem in one phrase. Be as clear and succinct as possible.

  • Urgent: IMPORTANT. This field is crucial to allocate time to solve a problem. If the issue results in a significant loss of telescope efficiency, then a task should be marked as urgent. This includes issues observing at night, data collection, or anything that endangers equipment. Toggle this flag and alert the team as soon as possible.

  • Time lost (hr): More details about calculating time lost due to a fault are in the Guidelines For Calculating Time Loss section. Time loss is reported in the 0.1 decimal hour.

  • Components: Be as accurate as possible to select the correct component - i.e. facilities: vent gates, AuxTel, etc. If the component does not exist, contact Alysha Shugart and they will add it to the list.

  • Description: Provide details and a timeline as accurately as possible to help people more efficiently search telemetry logs for diagnosis. Include a timestamp of the occurrence as well as the salIndex of the script (if applicable). The traceback should be added as well (if applicable). Tracebacks are best copy/pasted into the ticket rather than using a screenshot so the error is searchable.

../_images/Fault_report_example_page_2.png

Continuing fields of an example fault report.

  • Assignee: The reporter should leave the ticket unassigned. In the case you are certain of who is the correct person to follow-up on the fault report, that person should be added as a watcher on the ticket. A team will review the fault reports after the night is over and determine the best person or group for follow-up.

  • Primary Software Component: This is not a required field, but may provide more information to the components involved.

  • Primary Hardware Component: If you are not sure what hardware was affected or the root cause, use “Other”.

  • Attachment: Upload any screenshots, images, or files to support the facts reported or to help the problem-solving effort.

Guidelines For Calculating Time Loss

  • If the problem can be troubleshooted while taking images on sky, or proceeding with another engineering task, that time will not count towards a fault loss.

  • If a fault occurs while we are closed due to bad weather, or the problem occurs before or after 12 degree twilight, the time lost should be reported as 0. When potential on-sky science time begins, the time loss starts accumulating.

  • In the event the amount of time lost is not well understood, it is better to provide an overestimate than an underestimate.

Filling Out Night Logs

More details about writing night logs are provided on the Nighttime Logging page. Concerning fault reports filed during the night, it is important that the observer lists all the problems that occurred during the night in the fault report section of the night log. This will provide higher visibility and allow to calculate total time lost to faults at the end of the observing night.

../_images/Night_log_fault_reports_list.png

List of all the fault reports that happened during the night for the night log.

This procedure was last modified Nov 28, 2024.