Failover Recovery
Introduction
Failover Log Analysis
Steps to Reading a Failover Log
- The Failover Logs can be found by clicking on the DR Assistant icon on the desktop of the Eyeglass Web UI.
- Click on the Failover History tab, you’ll see the various jobs that have been run, run date and results. (By clicking on an individual job, the details of that job will appear in the lower half of the window).
- Identify the section in the log with an error message by expanding folders to locate the red "X".
- Determine which step failed and refer to the appropriate table (from this guide) for the next steps.
- Match the scenario in the failover log to the corresponding scenario in the table to identify the issue and resolution steps.
Additional Reporting for SyncIQ Job Errors
As of version 1.8 and later, SyncIQ job reports are collected in a separate log to simplify troubleshooting. This log includes:
- Run Report: Provides details of the executed jobs.
- Resync Prep Report: Tracks preparation steps for synchronization.
- Resync Prep Domain Mark Report: Captures domain-specific preparations.
If the root cause of a failure is identified as a SyncIQ policy error that cannot be recovered or retried, this log can be provided to Support for faster resolution and escalation with EMC.
Replication Policy Failover Preparation
Below are several more failover steps that can fail or time out, along with concise, step-by-step fixes.
Wait for Other Failover Jobs to Complete
Impact on Failover
- Eyeglass allows only one failover job at a time.
- If another job is already running, your new failover won’t start.
Recovery Steps
- Check Running Jobs
- Go to the Eyeglass “running jobs” window. Confirm any existing failover jobs are completed or canceled.
- Wait or Cancel
- If a job is still in progress, allow it to finish or manually stop it if appropriate.
- Restart Failover
- Once no failover jobs are running, launch the new failover job again.
- Time‐Out: This step can remain in the “running” state for up to two hours before timing out.
- Data Loss Impact: Not typically applicable here since failover hasn’t started; however, no progress will be made until other jobs finish.
SOURCE get POLICY Info
Impact on Failover
- Eyeglass can’t communicate with the source cluster, causing an immediate failover failure.
Recovery Steps
- Check Connectivity
- Verify the network path between Eyeglass and the source cluster. Ensure DNS, IP addresses, and firewalls are correctly configured.
- Fix Communication Errors
- If there are permission or authentication issues, update credentials or correct them in Eyeglass.
- Restart Failover
- Once Eyeglass can reach the source cluster, rerun the failover job.
- Uncontrolled Failover: This step does not run during an emergency failover scenario.
- Data Loss Impact: Failover cannot begin; any delay could risk data currency until this is resolved.
Wait for Existing Policy Jobs to Complete
Impact on Failover
- If Eyeglass detects other active policy jobs, it waits for them to finish.
- Default timeout is often 180 minutes (3 hours), but this can vary by release or be modified via
isi_gs
CLI commands.
Recovery Steps
- Confirm No Overlapping Jobs
- Check Eyeglass to ensure no other SyncIQ or replication jobs are running.
- Address Stuck Policies
- If a policy is stuck or returning errors from the cluster, you may need EMC support to resolve the underlying issue before failover can proceed.
- Restart Failover
- Once no conflicting jobs remain, retry the failover.
- Timeout: If the wait exceeds the configured failover timeout, the failover will fail.
- Data Loss Impact: Failover remains blocked, leaving you in a potential data loss scenario until resolved.
SOURCE Remove Schedule POLICY
Impact on Failover
- Eyeglass is unable to remove or adjust the schedule on the source cluster.
- Communication failure prevents the failover from proceeding.
Recovery Steps
- Verify Eyeglass–Source Connectivity
- Confirm network or permission settings allow Eyeglass to manage schedules on the source.
- Manually Remove/Update Schedule
- If needed, log into OneFS on the source cluster and remove or modify the relevant SyncIQ schedule.
- Restart Failover
- Retry the failover job once the schedule is successfully removed or updated.
- Uncontrolled Failover: This step does not run during an emergency failover.
- Data Loss Impact: Failover is halted until the schedule is addressed, so data is not actively protected.
Replication Policy Failover (Run All Policies with “SyncIQ Data Sync”)
Impact on Failover
- The final incremental sync of data failed, causing the failover to abort.
- Source and target remain in the initial state (target cluster read‐only).
Recovery Steps
- Identify Which Policies Failed
- Check the Eyeglass Job Details to see which policies ran successfully and which timed out or failed.
- Troubleshoot or Cancel Ongoing Sync
- If a policy job is still running on OneFS, wait for it to finish or cancel it if it’s stuck.
- Manually run the policy again to see if it can succeed.
- Open a Support Case & Retry Failover
- If the policy repeatedly fails, open a case with EMC (or relevant vendor).
- Once resolved, restart the failover job.
- Timeout: Eyeglass will wait for each policy over a set timeout. If incremental sync takes longer than that, it fails.
- Optional “Data Sync”: If unsynced data is not critical, you can uncheck the “Data Sync” box. This moves forward with failover but any unreplicated data will be lost.
- Uncontrolled Failover: Not run during an emergency failover.
Run Configuration Replication Now (Config Sync)
Impact on Failover
- The final sync of configuration items (e.g., shares, exports, aliases) has failed.
- Failover continues, but the target cluster remains read‐only until config replication succeeds.
Recovery Steps
- Review the Eyeglass Jobs
- In the Eyeglass interface, switch to the running jobs tab and find the recent config replication job.
- Identify & Fix the Failure Reason
- Use the Job Details to see if there’s a permissions, network, or file conflict issue.
- Correct the problem (re-auth, DNS, etc.).
- Restart Failover
- If config data remains unsynced but is nonessential, you can uncheck “Config Sync” to speed up failover.
- Otherwise, once the issue is fixed, rerun the replication and then proceed with failover.
- Skipping Config Sync: You can uncheck “Config Sync” if source/target configs are already aligned or you accept losing any changes.
- Uncontrolled Failover: Not applicable in an emergency scenario.
- Data Loss Impact: Typically minimal for config items, but the target remains in a partial failover state until replication is resolved.
Notes
- Eyeglass Timeout Values: Many failover steps rely on timeouts (e.g., 180 minutes). Adjust these if you have large datasets or slower networks.
- Support Calls: Frequent communication or schedule failures may require a vendor (EMC) support case.
- Documentation: Always consult “Best Practices for Failover with Eyeglass” and official OneFS guides for deeper command‐level instructions.
DFS Mode
If DFS Share(s) rename fails on target or source, DFS clients will not switch clusters.
Recovery Steps
- Remove
igls-dfs
prefix manually from the target cluster shares that weren't renamed (check the failover log). This will complete failover, and clients will switch automatically. - Add
igls-dfs
prefix manually to the source cluster shares that weren't renamed (check the failover log). This will block client access to the source and switch them to the target. - Allow writes manually from OneFS for selected failover policies. This applies to release 1.9 and below.
- Run quota jobs related to the failover manually from Eyeglass.
- Run re-sync prep manually from OneFS for selected failover policies.
- Apply SyncIQ policy schedule to target cluster policies that failed over.
Releases after 2.0 will run these steps automatically if a share rename fails.
Networking Operations
The following subtitles will explore the possibility of each of the mentioned steps not finishing, the effects this may have on failover, and how you can react.
Rename Source SC (SmartConnect) Zone Names & Aliases
Impact on Failover
- Failover fails during the networking step.
- Auto‐rollback reverts Source/Target clusters to initial states.
- The source file system remains read/write, returning SmartConnect zones to original settings.
Recovery Steps
- Check the Job Details: Determine the cause of the networking failure (e.g., DNS or network misconfiguration).
- Fix & Retry: Correct the error, then rerun failover. If networking is still unstable, only select the SyncIQ portion until you’re sure network issues are resolved.
- Start Fresh: Once networking is stable, choose the required policies and rerun the failover.
Warnings
- Not performed during uncontrolled (emergency) failovers.
Modify Source SPNs
Impact on Failover
- SPN failure does not stop the failover; errors go into the log.
- Failover continues, but SPNs may be incorrect.
Recovery Steps
- Review the Failover Log: Identify failing SPNs.
- Manually Fix SPNs: Use ADSIEdit (with domain admin rights) to create/delete SPNs on the source cluster.
- Validate: Ensure the corrected SPNs match what’s required for each SmartConnect zone.
Warnings
- SPN changes on the source are proxied through target cluster ISI commands, so source cluster availability is not impacted by these corrections.
Rename Target SC (SmartConnect) Zone Names & Aliases
Impact on Failover
- Failover fails during the networking step.
- Rollback restores both clusters to pre‐failover state.
- SmartConnect zones revert to original configuration.
Recovery Steps
- Identify Failure Reason: Check job details to see what caused the networking step to fail.
- Attempt Partial Failover: If networking remains unstable, try just the SyncIQ portion while you troubleshoot the network or DNS.
- Rerun Full Failover: After fixing issues, rerun failover with the necessary policies.
Warnings
- Not performed during uncontrolled failover.
Modify Target SPNs
Impact on Failover
- SPN failure does not stop the failover; it continues.
- Errors are logged, and SPNs may be incorrect afterward.
Recovery Steps
- Review the Failover Log: Locate SPN failures.
- Fix SPNs: Use ADSIEdit on the domain to create/delete SPNs for the target cluster.
- Confirm Changes: Ensure the new SPNs match each SmartConnect zone that failed over.
Warnings
- SPN operations on the target are also proxied through ISI commands, so the cluster’s availability is not affected.
Replication Policy Failover
Below are the most common issues that can occur when failing over SyncIQ policies, along with quick step-by-step fixes. For more advanced scenarios or troubleshooting, refer back to the full documentation or contact Support.
Replication Policy Failover (All Policies)
Impact on Failover
- One or more SyncIQ policies did not successfully complete their failover operation.
- This is the parent task that contains all sub‐policies.
Recovery Steps
- Determine Failure Reason
- Check the job details (policy failover logs) to see which policies failed and why.
- Look for a step called “
CLUSTERNAME allow writes POLICY PATH
” to confirm if at least one policy succeeded.
- Create a New Failover Job (If Needed)
- If some policies never started or failed partway, create a new SyncIQ failover job with the incomplete policies selected.
- Retry only those policies that did not finish.
- Review Data Loss Impact
- Some failing sub‐steps can lead to partial or inconsistent data states. Consult the next sections (or vendor docs) for safe remediation steps.
- Parent Step: Because this is a top‐level failover task, a failure here can have cascading effects on sub‐policies.
- Data Loss: Review each policy’s logs to see if data was at risk when failover halted.
Target “Allow Writes” Policy Path
Impact on Failover
- The target cluster cannot be put into a writeable state (i.e., “writes allowed”).
- By default (as of 1.6.1), the failover attempts a “make writeable” command before running the Resync Prep step.
- On releases prior to 2.0, failover halts if the “make writeable” command fails.
Recovery Steps
- Check Job Details
- Investigate error messages about “allow writes” on the target cluster.
- If needed, manually run the “allow writes” command for the failing policy or fix the underlying cause (e.g., permission issues).
- Restart Failover or Run Prep
- For controlled failovers, manually run Resync Prep on the source cluster after “allow writes” succeeds on the target.
- Re‐initiate failover if the policy is still in a failed state.
- Open EMC Support Case (If Required)
- If the cluster returns repeated errors, you may need EMC assistance to resolve the underlying problem.
- After the error is fixed, re‐attempt the policy failover or proceed with post‐failover tasks (e.g., Quota sync jobs).
- Data Access Impact: Until “allow writes” succeeds, users only have read‐only access to the affected policy data.
- Failover Stops (Pre‐2.0): Releases before 2.0 will halt entirely if “make writeable” fails.
Replication Policy Failover – Recovery
Impact on Failover
- One or more SyncIQ policies in the Failover job did not successfully complete their multi‐step failover.
- Important: No automatic rollback occurs in this failover section. Previously completed steps (networking, allow‐writes) remain in place, and the cluster is treated as “failed over” for those parts.
Recovery Steps
- Diagnose the Specific Policy Error
- Check the failover logs for each policy that returned an error.
- Identify if “resync prep,” “schedule,” or “run” actions are failing.
- Manually Complete or Retry Failing Steps
- If “resync prep” or “allow writes” steps had an error, fix the problem and re-run them manually.
- If scheduling or running the policy fails, correct the issue (quota error, DNS, permission) and retry.
- Open EMC Support Case (If Needed)
- Policies that persistently return errors might require EMC assistance to resolve.
- Completing these steps is critical to fully protect the filesystem after failover.
- Data Loss Impact: Typically none, because earlier failover steps have completed. However, any policy that remains unfinished is not actively protecting data.
- Parent Step: This step is a container for multiple sub‐steps; partial completion could leave some policies in a limbo state.
Source “Resync Prep” Policy
Impact on Failover
- The mirror policy cannot be created or prepared.
- The target cluster is active, but the overall failback readiness status is failed.
- As of 1.6.1, the “make writeable” command is attempted before Resync Prep; on 2.0 or newer, the failover tries to run Resync Prep against all included policies in sequence.
Recovery Steps
- Log In to OneFS on the Source
- Manually execute “resync prep” for the failing policy.
- Address any errors that appear (permissions, path issues, etc.).
- Set the Schedule (If Needed)
- Confirm that the target policy has a proper schedule.
- If it’s missing or broken, configure it to ensure ongoing replication.
- Re‐Run Quota or Cleanup Tasks
- If the policy completes Resync Prep successfully, you may also need to run Quota Jobs from Eyeglass or contact support for post‐failover quota sync.
- Uncontrolled Failover: This step does not run in an emergency failover scenario.
- Data Loss Impact: Typically none, but any policy not prepped for resync is left unprotected until this is fixed.
Target “Run” Policy Mirror
Impact on Failover
- The mirror policy cannot run on the target.
- Target cluster is active, but overall failover status is marked as failure.
Recovery Steps
- Log In to OneFS on the Target
- Identify the error preventing the policy from running (permissions, connectivity, etc.).
- Fix & Retry
- Correct the issue (e.g., reconfigure the policy, address any DNS or network errors).
- Re‐attempt running the mirror policy.
- Contact Support (If Needed)
- If the policy repeatedly fails, open a support case (EMC or vendor‐specific) to resolve deeper issues.
- Uncontrolled Failover: This step does not run during an emergency failover.
- Data Loss Impact: Typically none, but full protection is blocked until the mirror policy runs successfully.