Failover Recovery
Failover Log Analysis
Steps to Reading a Failover Log
- The Failover Logs can be found by clicking on the DR Assistant icon on the desktop of the Eyeglass Web UI.
- Click on the Failover History tab, you’ll see the various jobs that have been run, run date and results. (By clicking on an individual job, the details of that job will appear in the lower half of the window).
- Identify the section in the log with an error message by expanding folders to locate the red "X".
- Determine which step failed and refer to the appropriate table (from this guide) for the next steps.
- Match the scenario in the failover log to the corresponding scenario in the table to identify the issue and resolution steps.
Additional Reporting for SyncIQ Job Errors
As of version 1.8 and later, SyncIQ job reports are collected in a separate log to simplify troubleshooting. This log includes:
- Run Report: Provides details of the executed jobs.
- Resync Prep Report: Tracks preparation steps for synchronization.
- Resync Prep Domain Mark Report: Captures domain-specific preparations.
If the root cause of a failure is identified as a SyncIQ policy error that cannot be recovered or retried, this log can be provided to Support for faster resolution and escalation with EMC.
Replication Policy Failover Preparation
Below are several more failover steps that can fail or time out, along with concise, step-by-step fixes.
Wait for Other Failover Jobs to Complete
Impact on Failover
- Eyeglass allows only one failover job at a time.
- If another job is already running, your new failover won’t start.
Recovery Steps
- Check Running Jobs
- Go to the Eyeglass “running jobs” window. Confirm any existing failover jobs are completed or canceled.
- Wait or Cancel
- If a job is still in progress, allow it to finish or manually stop it if appropriate.
- Restart Failover
- Once no failover jobs are running, launch the new failover job again.
- Time‐Out: This step can remain in the “running” state for up to two hours before timing out.
- Data Loss Impact: Not typically applicable here since failover hasn’t started; however, no progress will be made until other jobs finish.
SOURCE get POLICY Info
Impact on Failover
- Eyeglass can’t communicate with the source cluster, causing an immediate failover failure.
Recovery Steps
- Check Connectivity
- Verify the network path between Eyeglass and the source cluster. Ensure DNS, IP addresses, and firewalls are correctly configured.
- Fix Communication Errors
- If there are permission or authentication issues, update credentials or correct them in Eyeglass.
- Restart Failover
- Once Eyeglass can reach the source cluster, rerun the failover job.
- Uncontrolled Failover: This step does not run during an emergency failover scenario.
- Data Loss Impact: Failover cannot begin; any delay could risk data currency until this is resolved.
Wait for Existing Policy Jobs to Complete
Impact on Failover
- If Eyeglass detects other active policy jobs, it waits for them to finish.
- Default timeout is often 180 minutes (3 hours), but this can vary by release or be modified via
isi_gs
CLI commands.
Recovery Steps
- Confirm No Overlapping Jobs
- Check Eyeglass to ensure no other SyncIQ or replication jobs are running.
- Address Stuck Policies
- If a policy is stuck or returning errors from the cluster, you may need EMC support to resolve the underlying issue before failover can proceed.
- Restart Failover
- Once no conflicting jobs remain, retry the failover.
- Timeout: If the wait exceeds the configured failover timeout, the failover will fail.
- Data Loss Impact: Failover remains blocked, leaving you in a potential data loss scenario until resolved.
SOURCE Remove Schedule POLICY
Impact on Failover
- Eyeglass is unable to remove or adjust the schedule on the source cluster.
- Communication failure prevents the failover from proceeding.
Recovery Steps
- Verify Eyeglass–Source Connectivity
- Confirm network or permission settings allow Eyeglass to manage schedules on the source.
- Manually Remove/Update Schedule
- If needed, log into PowerScale OneFS on the source cluster and remove or modify the relevant SyncIQ schedule.
- Restart Failover
- Retry the failover job once the schedule is successfully removed or updated.
- Uncontrolled Failover: This step does not run during an emergency failover.
- Data Loss Impact: Failover is halted until the schedule is addressed, so data is not actively protected.
Replication Policy Failover (Run All Policies with “SyncIQ Data Sync”)
Impact on Failover
- The final incremental sync of data failed, causing the failover to abort.
- Source and target remain in the initial state (target cluster read‐only).
Recovery Steps
- Identify Which Policies Failed
- Check the Eyeglass Job Details to see which policies ran successfully and which timed out or failed.
- Troubleshoot or Cancel Ongoing Sync
- If a policy job is still running on PowerScale OneFS, wait for it to finish or cancel it if it’s stuck.
- Manually run the policy again to see if it can succeed.
- Open a Support Case & Retry Failover
- If the policy repeatedly fails, open a case with EMC (or relevant vendor).
- Once resolved, restart the failover job.
- Timeout: Eyeglass will wait for each policy over a set timeout. If incremental sync takes longer than that, it fails.
- Optional “Data Sync”: If unsynced data is not critical, you can uncheck the “Data Sync” box. This moves forward with failover but any unreplicated data will be lost.
- Uncontrolled Failover: Not run during an emergency failover.
Run Configuration Replication Now (Config Sync)
Impact on Failover
- The final sync of configuration items (e.g., shares, exports, aliases) has failed.
- Failover continues, but the target cluster remains read‐only until config replication succeeds.
Recovery Steps
- Review the Eyeglass Jobs
- In the Eyeglass interface, switch to the running jobs tab and find the recent config replication job.
- Identify & Fix the Failure Reason
- Use the Job Details to see if there’s a permissions, network, or file conflict issue.
- Correct the problem (re-auth, DNS, etc.).
- Restart Failover
- If config data remains unsynced but is nonessential, you can uncheck “Config Sync” to speed up failover.
- Otherwise, once the issue is fixed, rerun the replication and then proceed with failover.
- Skipping Config Sync: You can uncheck “Config Sync” if source/target configs are already aligned or you accept losing any changes.
- Uncontrolled Failover: Not applicable in an emergency scenario.
- Data Loss Impact: Typically minimal for config items, but the target remains in a partial failover state until replication is resolved.
Notes
- Eyeglass Timeout Values: Many failover steps rely on timeouts (e.g., 180 minutes). Adjust these if you have large datasets or slower networks.
- Support Calls: Frequent communication or schedule failures may require a vendor (EMC) support case.
- Documentation: Always consult “Best Practices for Failover with Eyeglass” and official PowerScale OneFS guides for deeper command‐level instructions.
DFS Mode
If DFS Share(s) rename fails on target or source, DFS clients will not switch clusters.
Recovery Steps
- Remove
igls-dfs
prefix manually from the target cluster shares that weren't renamed (check the failover log). This will complete failover, and clients will switch automatically. - Add
igls-dfs
prefix manually to the source cluster shares that weren't renamed (check the failover log). This will block client access to the source and switch them to the target. - Allow writes manually from PowerScale OneFS for selected failover policies. This applies to release 1.9 and below.
- Run quota jobs related to the failover manually from Eyeglass.
- Run re-sync prep manually from PowerScale OneFS for selected failover policies.
- Apply SyncIQ policy schedule to target cluster policies that failed over.
Releases after 2.0 will run these steps automatically if a share rename fails.
Networking Operations
The following subtitles will explore the possibility of each of the mentioned steps not finishing, the effects this may have on failover, and how you can react.
Rename Source SC (SmartConnect) Zone Names & Aliases
Impact on Failover
- Failover fails during the networking step.
- Auto‐rollback reverts Source/Target clusters to initial states.
- The source file system remains read/write, returning SmartConnect zones to original settings.
Recovery Steps
- Check the Job Details: Determine the cause of the networking failure (e.g., DNS or network misconfiguration).
- Fix & Retry: Correct the error, then rerun failover. If networking is still unstable, only select the SyncIQ portion until you’re sure network issues are resolved.
- Start Fresh: Once networking is stable, choose the required policies and rerun the failover.
Warnings
- Not performed during uncontrolled (emergency) failovers.
Modify Source SPNs
Impact on Failover
- SPN failure does not stop the failover; errors go into the log.
- Failover continues, but SPNs may be incorrect.
Recovery Steps
- Review the Failover Log: Identify failing SPNs.
- Manually Fix SPNs: Use ADSIEdit (with domain admin rights) to create/delete SPNs on the source cluster.
- Validate: Ensure the corrected SPNs match what’s required for each SmartConnect zone.
Warnings
- SPN changes on the source are proxied through target cluster ISI commands, so source cluster availability is not impacted by these corrections.
Rename Target SC (SmartConnect) Zone Names & Aliases
Impact on Failover
- Failover fails during the networking step.
- Rollback restores both clusters to pre‐failover state.
- SmartConnect zones revert to original configuration.
Recovery Steps
- Identify Failure Reason: Check job details to see what caused the networking step to fail.
- Attempt Partial Failover: If networking remains unstable, try just the SyncIQ portion while you troubleshoot the network or DNS.
- Rerun Full Failover: After fixing issues, rerun failover with the necessary policies.
Warnings
- Not performed during uncontrolled failover.
Modify Target SPNs
Impact on Failover
- SPN failure does not stop the failover; it continues.
- Errors are logged, and SPNs may be incorrect afterward.
Recovery Steps
- Review the Failover Log: Locate SPN failures.
- Fix SPNs: Use ADSIEdit on the domain to create/delete SPNs for the target cluster.
- Confirm Changes: Ensure the new SPNs match each SmartConnect zone that failed over.
Warnings
- SPN operations on the target are also proxied through ISI commands, so the cluster’s availability is not affected.
Replication Policy Failover
Below are the most common issues that can occur when failing over SyncIQ policies, along with quick step-by-step fixes. For more advanced scenarios or troubleshooting, refer back to the full documentation or contact Support.
Replication Policy Failover (All Policies)
Impact on Failover
- One or more SyncIQ policies did not successfully complete their failover operation.
- This is the parent task that contains all sub‐policies.
Recovery Steps
- Determine Failure Reason
- Check the job details (policy failover logs) to see which policies failed and why.
- Look for a step called “
CLUSTERNAME allow writes POLICY PATH