Skip to main content
Version: 2.9.0

Executing Failover - DR Assistant

Introduction

Executing a failover is a critical step in disaster recovery that requires precision, careful planning, and a clear understanding of the process involved. Once you have configured your failover type, the next phase is the actual execution, which ensures continuity of service and data integrity. This guide contains detailed steps to perform a failover using Eyeglass DR Assistant, highlighting the key actions needed to maintain control over your environment during a disaster recovery event.

Controlled vs Uncontrolled Failover

Before starting any failover procedure, it's important to understand the distinction between controlled and uncontrolled failover, as well as the potential data protection implications associated with each. In Eyeglass DR Assistant, selecting the appropriate failover method directly affects data integrity, recovery efforts, and the overall success of the disaster recovery process.

Using controlled failover ensures that both clusters are synced and that no data is lost. This should always be the first option if the Eyeglass VM maintains connectivity to both clusters. On the other hand, uncontrolled failover comes with significant risks, such as immediate data loss and the need for manual intervention during recovery. This method should only be used in critical scenarios where connectivity to the source cluster is lost and there is no other option.

Understanding these implications is essential to avoid data corruption, ensure a smooth recovery, and minimize potential downtime. Therefore, before initiating any actions, you must carefully assess which failover option applies to your situation.

warning

Using uncontrolled failover means you are failing away from the data and will lose ALL changes at the moment the failover starts in Eyeglass.

This option should only be used when the Eyeglass VM does not have reachability to source cluster.

Even if there are data access issues with PowerScale OneFS, as long as Eyeglass shows green reachability on the Continuous Op Dashboard, do not use uncontrolled failover — instead, opt for controlled failover.

Recovery from uncontrolled failover is the customer's responsibility and is not covered by the support contract.

This will involve coordination with all vendors related to the equipment in the customer's data center, as well as receiving approval from all relevant parties (e.g., PowerScale OneFS, AD, DNS, other applications using PowerScale OneFS services, physical infrastructure such as power and networking WAN links) before resuming operations.

DO NOT bring the cluster online without planning. Resync preparation does not run automatically in this mode, meaning both clusters will be writable. You should disconnect the source cluster and carefully plan a controlled recovery from the uncontrolled failover.

Reasons you might need to execute an uncontrolled failover include the following:

  • A WAN link to the data center is severed, with a lengthy repair time expected to restore service.
  • Extended power loss at the production data center.
  • A damaged cluster or a significant issue during an upgrade.
  • Equipment failure preventing access to the cluster, or application server failures with prolonged recovery times.
  • A network failure that blocks users from accessing storage and also affects the PowerScale OneFS management network.

Pre-Failover Check

  • Do not make any changes to SyncIQ Policies or Eyeglass Configuration Replication Jobs during failover, as this can lead to unexpected results.
  • Eyeglass Assisted Failover has a 45-minute timeout for each failover step. If any step is not completed within this period, the failover will fail. This can occur if SyncIQ policies are already running or if the SyncIQ steps take longer than expected to finish. While the timeout duration can be adjusted, lowering it does not speed up the failover process.
  • If configuration data (such as shares, exports, or quotas) is deleted or modified on the target cluster—especially Share names, NFS Alias names, or NFS Export paths—without running Eyeglass Configuration Replication, these changes may cause the source cluster to delete the object after failover. To prevent this, run Eyeglass configuration replication before failover.

How to Failover Data with DR Assistant

Preventing Client Access During Failover

To ensure data integrity during a failover, it is crucial to prevent client access to the Failover Source cluster.

Use the SMB Data Integrity option to disconnect user sessions on shares that will failover, and unmount NFS shares to prevent client access.

  1. Open the DR Assistant

  2. Choose failover settings

    In the Failover Wizard tab, select your source cluster, failover type, and failover options, outlined in the table below.

    1. Select the source cluster.
    2. Select the failover type.
    3. Leave the failover mode set to failover/failback.
    4. Leave all default check boxes for a planned controlled failover, or read the options below to make changes:

    Failover Options:

    Failover OptionDescription
    Controlled FailoverCheck if the source cluster is healthy and reachable. Uncheck ONLY for a real DR event. Uncontrolled failover skips API calls and assumes the source is destroyed.
    Data SyncRuns a final SyncIQ data sync job during failover (Recommended).
    Config SyncSyncs shares, exports, and NFS aliases during failover (Disabled in versions > 2.5.6).
    SMB Data Integrity FailoverDisconnects active SMB sessions and blocks new sessions to protect data integrity during failover.
    Quota SyncSyncs quotas to the target cluster or skips syncing to improve failover performance if there are many quotas.
    Block Failover on WarningPrevents failover if warnings are detected in the DR Dashboard (Recommended to leave enabled).
    Quota Domain Conflict CheckOverrides validation for quotas with pending scans that could block failover (Recommended to run quota scan first).
    SyncIQ Resync PrepPrepares SyncIQ policies for failover and failback (Recommended to leave enabled).
    Disable SyncIQ Jobs on TargetDisables SyncIQ jobs on the target cluster post-failover (Recommended to leave enabled for automated failback; manual steps required if disabled).
    Rollback SMB Shares on FailureAutomatically rolls back SMB share renames if a failure occurs during failover (Recommended to leave enabled).

    Use uncontrolled failover only if is a real DR event. Uncheck "controlled failover" in failover options. In this case, source cluster API calls are skipped and cached knowledge of shares and quotas are used to failover. In uncontrolled failover, Eyeglass assumes the source cluster has been destroyed. No steps that provide the option for failback are executed.

    warning

    Do not use uncontrolled failover unless you are lab testing, or prepared for manual steps to recover from the resulting end state.

    Recovery from an uncontrolled failover is a customer's responsibility, and is not covered by the support contract.

    All recovery is manual if this option is used.

    note

    Eyeglass Configuration Replication jobs will be in a 'USERDISABLED' state after an uncontrolled failover.

    Click next after making selection of failover options.

  3. Review and accept you have read all material regarding the support process and customer responsibilities.

  4. Verify domain mark steps have been completed.

  5. Select the policy or policies, Access Zone, or IP pool for the failover type selected.

    important

    Check readiness again before continuing! Make sure you understand the warnings, and if they will block failover. In general, warnings will not block failover. Errors block failovers.

  6. Validate the failover configuration

    Click next to proceed, and Eyeglass will run a validation check. If you receive an error, you will need to review and address it.

    Review the validations and acknowledge the necessary conditions to proceed. Ensure all conditions are met before continuing with the failover process - failing to read any accompanying documents or address warnings could result in data loss.

  7. Review the final summary

    warning

    This is the point of no return. Be sure you are ready for failover before proceeding.

    Once started, the failover steps can be cancelled, but the resulting recovery steps will be manual.

  8. Start the failover

    Read and acknowledge any conditions to initiate failover.

    Select "Run Failover" to begin the failover job.

    Cancelling Failover

    Use this only if directed by support.

    Cancelling a failover requires manual recovery of networking policy state, shares, SPN, and SmartConnect. Support is unable to assist with recovery from intentionally cancelling a failover.

    Failover jobs can be canceled by clicking the 'cancel job' link provided in the running failover job table.

  9. Monitor the failover job progress.

    Navigate to the Running Failovers tab to see the failovers currently in-progress. Click Logs, then click Watch to follow the failover real time, or click Fetch to update the log window with the current progress.

  10. Test client data access for failover success or failure.

  11. Download completed failover logs.

    To review and download failover logs, follow the steps below:

    1. Click on the "Failover History" tab in the DR Assistant interface. This tab provides a comprehensive list of past failover events.

    2. Accessing Failover Logs:

      • Locate the desired failover entry in the history list.
      • Under the "Failover Logs View/Save" column, click on the "Open" link corresponding to the selected failover event. This action will open the failover log details.
    3. Downloading SyncIQ Reports:

      • In addition to the failover logs, SyncIQ reports can also be accessed. Click on the "Open" link under the "SyncIQ Reports View/Save" column for the selected event.
      • These reports are crucial for understanding the specifics of SyncIQ steps, especially if a SyncIQ step has failed.
    4. Using SyncIQ Logs for PowerScale OneFS Support:

      • If there are issues with the SyncIQ steps, the logs can provide detailed information.
      • These logs can be downloaded and shared with PowerScale OneFS support when opening a support request (Service Request - SR) to expedite the troubleshooting process.

This concludes the procedure for how to execute a failover with DR Assistant


Next Steps

After the failover process, it’s important to check that your environment is functioning correctly and that all data is in the right place. Verifying that everything matches your disaster recovery plan helps ensure data integrity and keeps your operations running smoothly.

See the Post-Failover Steps documentation for more information.