Failover Estimation
Introduction​
In 2.11.0, we’ve introduced an initial implementation of Failover Time Estimation, providing users with an estimated duration for failover jobs before they run. This estimate is based on historical averages from previous failovers and is designed to give a general sense of how long the process may take.
This document outlines:
- What steps are included (and excluded) in the estimate
- How estimates are calculated for policies, zones, and pools
- The structure of the estimate object
- How these values appear in the Eyeglass UI
- Configuration options in
system.xml
that affect how estimates are generated
Keep in mind that this is a first version of the feature. While it can provide useful insight, the estimate is an approximation and may not fully reflect conditions at the time of failover.
What the Failover Estimate Includes​
- Running all SyncIQ policies
- Renaming SmartConnect zone names and aliases (during zone/pool failover)
- Renaming shares on source and target (for DFS failover)
- Making policies writable
- Running resync prep on all policies
- Failing over quotas from source to target (create on target, delete from source)
What the Estimate Does Not Include​
- Pre- and post-failover script execution
- SMB Data Integrity lockout steps
- SPN creation and deletion
- Enabling/disabling Eyeglass jobs or setting schedules
- Report generation
- Post-failover actions
Types of Failover​
Policy Failover Estimation​
Failover time for individual policies is estimated using a straightforward approach: we calculate the average duration of key steps based on historic data from previous successful failover runs.
The following data points are used:
-
SyncIQ Runtime
Based on the average runtime of SyncIQ policy reports from the past 90 days.
-
Make Writable / Resync Prep Time
Pulled from detailed job logs that record how long each step takes during failover.
-
Share Redirection
For these fields, we estimate the duration by using the historical average time for each step. That average is used to estimate the total time, depending on how often the step is expected to run in the failover scenario. If the operation supports concurrency, we adjust the estimate accordingly.
-
Quota Creation / Deletion
Quota creation can run either in parallel or serial, depending on the
runquotasyncinparallel
setting (default:true
):-
If
true
:Quotas are created in parallel, grouped into two phases—default quotas followed by regular quotas.
-
If
false
:All quotas are created one at a time, in strict sequence.
Quota deletion is simpler — it always uses the average time from past jobs.
-
When a failover job includes multiple policies, steps are typically executed in parallel. In this case:
- For each estimation element (SyncIQ, resync, etc.), we take the longest duration across all included policies.
- As an exception, share redirection remains serial if
runconfigsyncinparallel
isfalse
.
Policy estimates are calculated during the Configuration Replication job, using the past failover runtime averages calculated during the Readiness Job
Zone and Pool Failover Estimations​
Zone and pool failover estimates are based on the combined estimates of all policies within that zone or pool, plus the time needed for network redirection.
Network redirection includes:
- Renaming the source SmartConnect zone name
- Renaming the target SmartConnect zone name
- Renaming any source
igls-aliases
- Renaming any target
igls-aliases
The estimated redirection time is calculated as:
(AVERAGE_RENAME_TIME * 2 + NUMBER_OF_ALIASES * AVERAGE_RENAME_TIME * 2) * NUMBER_OF_POOLS
- Zone failover can include multiple pools.
- Pool failover is limited to a single zone.
- Failing over multiple zones creates separate failover jobs, each with its own estimate.
Zone and Pool estimates are calculated during readiness jobs.
Eyeglass UI Changes​
We've made the failover estimates visible in both the DR Dashboard and DR Assistant, giving users better visibility into how long a failover is expected to take based on historical data.
DR Dashboard​
The DR Dashboard, which provides an overview of readiness status, now includes a new Estimated Time column. This column displays the calculated failover time for each zone or policy.
- Estimates are shown per item based on the latest readiness results.
- If the item is in a
FAILED_OVER
orERROR
state, or the policy isDISABLED
, the estimate will not be displayed.
This allows users to quickly see expected durations alongside readiness and failover status.
DR Assistant​
The DR Assistant, which guides users through executing failover operations, now integrates failover estimates in multiple steps:
- Selection Screen: The Estimated Time is shown per zone or policy, helping users understand expected durations before initiating failover.
- Review Screen: After selecting one or more items, the tool combines individual estimates into a single combined estimate (or multiple estimates in the case of multiple zones).
- Summary Screen: The final estimate is displayed before the failover begins, providing insight into how long the failover is likely to take.
This integration informs users about estimations at multiple steps of the failover process.
New system.xml
Variables​
Two new configuration options have been added to system.xml
to control how failover time estimates are calculated:
Warning Threshold​
The failover_estimate_warning_threshold
sets the minimum number of historical data points required for each section of the estimate to be considered reliable.
Each estimate is broken down into separate time-based sections (or "time elements"), such as policy runtime, network redirection, or share renaming. The system checks the amount of historical data available for each of these sections individually.
- If all sections fall below the threshold, the estimate status is
ERROR
. - If some sections fall below the threshold, the estimate status is
WARNING
.
For example:
- If a newly created policy has only run a couple of times, the policy runtime section is based on very limited data and may not be accurate.
- If only a few zone failovers have occurred, the network redirection section might lack sufficient history.
Even if other sections have reliable data, the estimate will still be marked with a WARNING
to alert the user that some parts of the calculation may be less accurate.
When estimating across multiple policies, the system uses the lowest data count from any included policy to determine the final status.
Days of Historic Data​
The variable failover_estimate_historic_data_days
specifies how many days of historical data to use when calculating averages.
Only SyncIQ reports and failover logs within this window are considered. Older data is excluded from the calculation.