Failover Design
Failover Types on PowerScaleβ
- SyncIQ Policy
- Access Zone
- IP Pool
- DFS
What is Policy Failover?β
Policy Failover is a tool designed to assess and manage disaster recovery readiness for data configurations. It provides a detailed overview of replication job statuses, enabling quick identification and resolution of issues. This ensures that each SyncIQ policy is prepared for a successful failover. The tool offers guidance on reviewing failover logs, identifying failed steps, and executing recovery processes, helping to minimize downtime and maintain data integrity during disaster recovery scenariosββ.
Which factors are involved?β
When setting up Policy Failover in Superna Eyeglass, important factors need to be addressed to ensure a smooth transition:
-
Cluster Compatibility: Ensure that the PowerScale OneFS clusters involved are compatible with Superna Eyeglass to avoid configuration errors.
-
SyncIQ Policy Status: The SyncIQ Policy and corresponding replication job must be enabled to allow the failover process to proceed.
-
Network Setup: The target cluster must be accessible over the network, with all necessary ports open.
-
SPN Updates: Manually update Service Principal Names (SPN) for direct mount shares to ensure proper authentication post-failover.
-
Replication Topology: Avoid using the same share names for different SyncIQ policies on the source and target clusters to prevent failover issues.
-
SmartConnect Zones: Use dedicated SmartConnect Zones for each SyncIQ Policy to simplify network updates after a failover.
For more details, please refer to the Policy Failover Configuration Prerequisites.
How is Policy Failover executed?β
Policy Failover is a structured process executed through Eyeglass and involves the following key steps:
-
Failover Initiation: The process begins by initiating the failover from Eyeglass.
-
Final SyncIQ Run: Before the transition, the SyncIQ policy runs one last time on the source cluster to ensure all data is fully synchronized.
-
Making the Target Path Writable: The SyncIQ policy on the target cluster is updated to make the target path writable.
-
Resync Preparation: Eyeglass prepares a mirror SyncIQ policy on the target cluster and runs it. This resync process captures any remaining data changes, ensuring the target cluster has the most up-to-date data.
-
Quota Management: During this step, quotas on the source cluster are deleted and recreated on the target cluster.
-
Post-Failover Manual Steps: After the failover, some manual steps are required. SMB shares and NFS exports/aliases must be remounted on a SmartConnect Zone Name from the target cluster.
The process is largely automated by Eyeglass, but careful attention to the manual steps, particularly around SPN management and SmartConnect configuration, is necessary for a successful failover.
What are the results of Policy Failover?β
The successful execution of a Policy Failover leads to several key outcomes that ensure business continuity and data integrity:
-
Seamless Data Access: After the failover, clients can continue to access their data from the target cluster without interruption. The remounting of SMB shares and NFS exports ensures that all connections are correctly redirected to the new cluster.
-
Data Consistency: The final SyncIQ run before failover ensures that all data is fully synchronized between the source and target clusters.
-
Maintained Security and Authentication: Proper management of Service Principal Names (SPNs) and DNS records during the failover process ensures that all authentication protocols remain intact.
-
Optimized Resource Allocation: By moving active data operations to the target cluster, the load is balanced across the infrastructure, optimizing resource utilization.
-
Restored Quotas: Quotas are reestablished on the target cluster, maintaining the same data access controls that were in place on the source cluster.
-
Granular Control: The process allows for granular control over which SyncIQ Policies are failed over, providing flexibility in disaster recovery scenarios.
-
Manual Adjustments: While much of the failover process is automated, the manual steps required, such as SmartConnect and SPN management.
What is Access Zone Failover?β
Access Zone Failover is an essential tool for ensuring seamless disaster recovery within your environments. It operates on a per-access zone basis, managing the failover of SyncIQ policies, Service Principal Name (SPN) delegations, and DNS delegations as a cohesive unit. This process requires remounts for SMB shares and NFS exports, making it particularly beneficial when maintaining SmartConnect Zone Names post-failover is necessary.
Which factors are involved?β
Successfully executing an Access Zone failover involves several critical factors. Below are the key factors involved:
-
Configuration Compliance
- Ensure that all shares, exports, and aliases are correctly set within the Access Zone to be failed over. Misalignment may lead to data access outages.
- Set up Service Principal Name (SPN) delegation for SMB shares to avoid SMB client authentication issues.
-
SyncIQ Policy Readiness
- The last run of the SyncIQ policy should be successful, and policies should not be in a paused or canceled state.
- Avoid using exclusions or inclusions in SyncIQ policies as they are not supported for failback.
-
Network and Node Management
- Restrict source nodes in SyncIQ policies to manage bandwidth and ensure specific nodes replicate data.
- Ensure that no critical SyncIQ policies are disabled, as these will be skipped during the failover.
-
System and Process Integrity
- Confirm that Eyeglass configuration replication jobs for SyncIQ policies have been completed without error.
- Continuously monitor the Access Zones Readiness section in the DR Dashboard.
-
Manual Intervention Risks
- Investigate and understand any errors in pre-failover checks to prevent issues during the failover process.
For more details, please refer to the section Recommendations for Access Zone Failover.
How is Access Zone Failover executed?β
The Access Zone failover process minimizes disruption and maintains data integrity throughout the failover. The following steps outline how the failover is executed:
-
Failover Initiated from Eyeglass (Manual)
- The failover process begins manually when initiated through Eyeglass, setting the stage for the automated steps to follow.
-
SyncIQ Policy Run One Last Time (Automatic)
- The SyncIQ policies on the source cluster are run one final time to ensure all data is fully synchronized before the failover begins.
-
SmartConnect Zone Names/Aliases and SPNs Transferred to Target Cluster (Automatic)
- SmartConnect Zone names, aliases, and Service Principal Names (SPNs) are transferred from the source cluster to the target cluster.
-
HOST SPNs Deleted on Source and Created on Target in Active Directory (Automatic)
- SPNs associated with the source cluster are deleted and recreated on the target cluster in Active Directory, maintaining secure access for SMB shares.
-
SyncIQ Policy Target Path Made Writable (Automatic)
- The target path for the SyncIQ policy on the target cluster is made writable, while the source path is set to read-only to prevent further modifications.
-
Resync Prep Creates Mirror SyncIQ Policy and Runs It (Automatic)
- A mirror SyncIQ policy is automatically created and run on the target cluster to ensure data consistency across both clusters.
-
Quota Failover (Optional)
- Quotas on the source cluster may be deleted and recreated on the target cluster as part of the failover process, depending on the configuration.
-
Post-Failover Manual Steps (Manual)
- NFS Exports/Aliases Remount: NFS exports and aliases must be manually remounted on the target cluster to restore full access.
- SMB Share Remounts: SMB share remounts can be avoided by removing network interfaces from the failover network pools on the source cluster. If this step is not performed, manual remounting, rebooting, or logging out/in to Windows will be required to restore access.
What are the results of Access Zone Failover?β
These results reflect the effectiveness of the failover process in transferring operations from the Source Cluster to the Target Cluster while maintaining data integrity and minimizing disruptions. The main results are:
-
Data Synchronization and Integrity
- All data previously replicated using SyncIQ is now fully synchronized to the Target Cluster, ensuring that no data is lost during the failover process.
-
Reestablished Network and Service Connectivity
- SmartConnect Zone names and aliases have been successfully moved to the Target Cluster. This ensures that network services and client connections are seamlessly redirected to the Target Cluster.
-
Updated Active Directory Configurations
- Service Principal Names (SPNs) for the cluster have been updated in Active Directory. This allows for continued secure access to SMB shares on the Target Cluster, avoiding authentication issues and maintaining security protocols.
-
Quota Management and Policy Enforcement
- Quotas and SyncIQ policies have been recreated and enforced on the Target Cluster, ensuring that storage limits and data protection policies remain consistent and effective in the new environment.
-
Minimal Service Disruption
- While some manual steps may be required post-failover (such as remounting NFS exports or SMB shares), these are minor and typically do not result in significant downtime.
-
Preparedness for Further Failback or Recovery
- The Target Cluster is now fully prepared to handle ongoing operations, and the environment is set up for potential failback or future failover scenarios.
What is IP Pool Failover?β
IP Pool Failover is a distinct process, different from Access Zone Failover. Though both types are similar: IP Pool Failover applies at the network level, and Access Zone applies at the storage and security level.
Specifically, IP Pool Failover is used when there is interest in ensuring continuous connectivity between Clients and Clusters, in case of failure of a node or interface.
Which factors are involved?β
The factors involved in this type of Failover are, on both the Source and Target PowerScale OneFS:
- The Network Pool
- The Access Zone
- SyncIQ Policies
These factors interact in certain ways during Failover in order to help execute the process correctly.
There are Automatic and Manual interactions that are necessary for Failover, for example:
- Automatic
- SyncIQ Policy target path is made writeable and source path read-only.
- Manual
- NFS Exports must be remounted.
How is IP Pool Failover executed?β
In the simplest terms, the entire process of IP Pool Failover is described as follows:
- Failover is initiated. (Manual)
- SyncIQ Policies are run one last time. (Automatic)
- SmartConnect Zone Names and Aliases are moved to the Target Cluster. (Automatic)
- The HOST SPNs are deleted on the Source Cluster and Created on Target Cluster in AD. (Automatic)
- SyncIQ Policy target path is made writeable.(Automatic)
- Resync prep creates mirror SyncIQ policy and runs it. (Automatic)
- Quotas are deleted on Source Cluster and Created on Target Cluster. (Automatic)
- Post-Failover Steps. (Manual)
What are the results of IP Pool Failover?β
Once IP Pool Failover is complete, you will find that:
- SMB Shares and NFS exports/aliases are replicated to the Target Cluster.
- Mirror SyncIQ Policies are created and have been run.
- Each Source Pool's SmartConnect Name is mapped with an igls- prefix wherever applicable, be it igls-original- or igls-ignore-.
- Clients are able to connect to the Failed Over IP Pools without issue.
What is DFS Failover?β
DFS Failover is the process of automatically switching data access to a backup location during a disaster. Superna Eyeglass for PowerScale OneFS supports this by simplifying the configuration and management of DFS failover.
How DFS Failover Worksβ
The Eyeglass DFS solution simplifies disaster recovery (DR) for PowerScale OneFS by maintaining DFS targets (UNC paths) that point to both the source and destination clusters.
Failover and failback operations are initiated directly from Eyeglass, automatically moving configuration data to the writable copy of the UNC target. By grouping shares based on SyncIQ policies, Eyeglass ensures that any newly added shares on PowerScale OneFS are automatically protected. Quotas are also detected and protected without manual intervention.
For optimal availability, Domain-based DFS namespaces are recommended over server-based DFS roots. Domain-based DFS provides a more reliable and resilient solution for clients.
What are the results of DFS Failoverβ
-
Automatic Switching of Data Access: The UNC target path automatically switches to the writable copy, ensuring continuous access to data without manual intervention.
-
Protected Configuration Data: The configuration data for DFS targets is moved to the destination cluster, ensuring that all DFS target paths are updated to point to the correct, accessible location.
-
Protection of Newly Added Shares: Any shares added to PowerScale OneFS after the initial SyncIQ configuration are automatically included in the failover process, ensuring that no new data is left unprotected.
-
Quota Management: Quotas applied to the shares are also automatically detected and protected, meaning that the quota settings are maintained during and after the failover.
-
Simplified Disaster Recovery: Overall, the policy failover process is streamlined, with Eyeglass handling the complexity of failover, making disaster recovery operations more efficient and less error-prone.