Version: 2.12.0

Failover Estimation

Introduction

In 2.11.0, we’ve introduced an initial implementation of Failover Time Estimation, providing users with an estimated duration for failover jobs before they run. This estimate is based on historical averages from previous failovers and is designed to give a general sense of how long the process may take.

This document outlines:

What steps are included (and excluded) in the estimate
How estimates are calculated for policies, zones, and pools
The structure of the estimate object
How these values appear in the Eyeglass UI
Configuration options in system.xml that affect how estimates are generated

Keep in mind that this is a first version of the feature. While it can provide useful insight, the estimate is an approximation and may not fully reflect conditions at the time of failover.

What the Failover Estimate Includes

Running all SyncIQ policies
Renaming SmartConnect zone names and aliases (during zone/pool failover)
Renaming shares on source and target (for DFS failover)
Making policies writable
Running resync prep on all policies
Failing over quotas from source to target (create on target, delete from source)

What the Estimate Does Not Include

Pre- and post-failover script execution
SMB Data Integrity lockout steps
SPN creation and deletion
Enabling/disabling Eyeglass jobs or setting schedules
Report generation
Post-failover actions

Types of Failover

Policy Failover Estimation

Failover time for individual policies is estimated using a straightforward approach: we calculate the average duration of key steps based on historic data from previous successful failover runs.

The following data points are used:

SyncIQ Runtime

Based on the average runtime of SyncIQ policy reports from the past 90 days.
Make Writable / Resync Prep Time

Pulled from detailed job logs that record how long each step takes during failover.
Share Redirection

For these fields, we estimate the duration by using the historical average time for each step. That average is used to estimate the total time, depending on how often the step is expected to run in the failover scenario. If the operation supports concurrency, we adjust the estimate accordingly.
Quota Creation / Deletion

Quota creation can run either in parallel or serial, depending on the runquotasyncinparallel setting (default: true):
- If true:
  
  Quotas are created in parallel, grouped into two phases—default quotas followed by regular quotas.
- If false:
  
  All quotas are created one at a time, in strict sequence.
Quota deletion is simpler — it always uses the average time from past jobs.

When a failover job includes multiple policies, steps are typically executed in parallel. In this case:

For each estimation element (SyncIQ, resync, etc.), we take the longest duration across all included policies.
As an exception, share redirection remains serial if runconfigsyncinparallel is false.

Policy estimates are calculated during the Configuration Replication job, using the past failover runtime averages calculated during the Readiness Job

Zone and Pool Failover Estimations

Zone and pool failover estimates are based on the combined estimates of all policies within that zone or pool, plus the time needed for network redirection.

Network redirection includes:

Renaming the source SmartConnect zone name
Renaming the target SmartConnect zone name
Renaming any source igls-aliases
Renaming any target igls-aliases

The estimated redirection time is calculated as:

(AVERAGE_RENAME_TIME * 2 + NUMBER_OF_ALIASES * AVERAGE_RENAME_TIME * 2) * NUMBER_OF_POOLS

note

Zone failover can include multiple pools.
Pool failover is limited to a single zone.
Failing over multiple zones creates separate failover jobs, each with its own estimate.

Zone and Pool estimates are calculated during readiness jobs.

Eyeglass UI Changes

We've made the failover estimates visible in both the DR Dashboard and DR Assistant, giving users better visibility into how long a failover is expected to take based on historical data.

DR Dashboard

The DR Dashboard, which provides an overview of readiness status, now includes a new Estimated Time column. This column displays the calculated failover time for each zone or policy.

Estimates are shown per item based on the latest readiness results.
If the item is in a FAILED_OVER or ERROR state, or the policy is DISABLED, the estimate will not be displayed.

This allows users to quickly see expected durations alongside readiness and failover status.

DR Assistant

The DR Assistant, which guides users through executing failover operations, now integrates failover estimates in multiple steps:

Selection Screen: The Estimated Time is shown per zone or policy, helping users understand expected durations before initiating failover.
Review Screen: After selecting one or more items, the tool combines individual estimates into a single combined estimate (or multiple estimates in the case of multiple zones).
Summary Screen: The final estimate is displayed before the failover begins, providing insight into how long the failover is likely to take.

This integration informs users about estimations at multiple steps of the failover process.

New `system.xml` Variables

Two new configuration options have been added to system.xml to control how failover time estimates are calculated:

Warning Threshold

The failover_estimate_warning_threshold sets the minimum number of historical data points required for each section of the estimate to be considered reliable.

Each estimate is broken down into separate time-based sections (or "time elements"), such as policy runtime, network redirection, or share renaming. The system checks the amount of historical data available for each of these sections individually.

If all sections fall below the threshold, the estimate status is ERROR.
If some sections fall below the threshold, the estimate status is WARNING.

For example:

If a newly created policy has only run a couple of times, the policy runtime section is based on very limited data and may not be accurate.
If only a few zone failovers have occurred, the network redirection section might lack sufficient history.

Even if other sections have reliable data, the estimate will still be marked with a WARNING to alert the user that some parts of the calculation may be less accurate.

When estimating across multiple policies, the system uses the lowest data count from any included policy to determine the final status.

Days of Historic Data

The variable failover_estimate_historic_data_days specifies how many days of historical data to use when calculating averages.

Only SyncIQ reports and failover logs within this window are considered. Older data is excluded from the calculation.

Introduction​

What the Failover Estimate Includes​

What the Estimate Does Not Include​

Types of Failover​

Policy Failover Estimation​

Zone and Pool Failover Estimations​

Eyeglass UI Changes​

DR Dashboard​

DR Assistant​

New system.xml Variables​

Warning Threshold​

Days of Historic Data​