Aggregator failover¶
SD-WAN can automatically move a bond from one aggregator to another if the bond’s primary aggregator fails. This significantly reduces downtime for a customer in the event of an aggregator failure compared to monitoring aggregators with an external system such as Nagios and moving bonds off of failed aggregators manually. Customer downtime is limited to about two minutes.
With the default settings, aggregators are contacted by the management server every 30 seconds to verify that they are still online. If an aggregator fails to respond three times in a row, it is considered down and bonds on the aggregator with secondary aggregators configured are moved to their secondary aggregator. Bonds with no secondary aggregator are not moved. If a bond with a failed primary aggregator and no secondary aggregator is later assigned a secondary aggregator, the bond is moved to the secondary aggregator within 30 seconds. If both a bond’s primary and secondary aggregators fail, the bond will remain assigned to the secondary aggregator.
Warning
Fast aggregator failover depends on the aggregation servers being integrated into the dynamic routing setup of the partner’s core network. If no dynamic routing is configured, bonds will still be moved between aggregators, but static routes in the datacenter routers will need to be updated manually before the bonds are able to come back online.
Management server location¶
Since the management server controls aggregator failover, it must not be co-located with any primary aggregators. If the management server is co-located with primary aggregators, an outage that affects both the management server and aggregators (for example, a common network, electrical, or VM host failure) will not be handled and bonds will not be moved to their secondary aggregators.
To avoid this issue, a management server should be put in one of the following locations:
- Co-located with secondary aggregators. With this design, the management server will not be affected by an outage interrupting primary aggregators, and it will be able to detect the primary aggregator failure and move bonds to their secondary aggregators. If a common outage affects both the management server and secondary aggregators, the primary aggregators will still be available and no failover is necessary.
- Located at a dedicated location apart from any aggregators.
Configuration¶
There are a variety of ways to configure aggregator failover, depending on a partner’s unique reliability and cost requirements. For details, see Bond assignment strategies.
Most failover settings can be specified for each aggregator from the aggregator settings page, with default values being configured in the failover administration page.
Note
The known hosts and suspension failover features can only be configured globally from the administration page.
To enable aggregator failover for a bond, edit the bond and assign a secondary aggregator.

To disable aggregator failover, clear the secondary aggregator field for the bond.
To move the bond back to the primary aggregator automatically when it recovers, check the Aggregator failback field. If this field is not checked, you will need to move the bond back to the primary aggregator manually when it recovers.
You can enable failover for multiple bonds at once by using the edit multiple form on the bond index page or aggregator bonds list page.
To see which bonds are configured as primary or secondary on an aggregator, go to the aggregator details page and click the Bonds button:

The aggregator bonds index page has three tabs—primary, secondary, and current. Bonds listed in the primary tab have that aggregator as their primary agg. Similarly, bonds listed in the secondary tab have that agg as their secondary agg. Bonds listed in the current tab are currently assigned to the aggregator. When an aggregator fails, bonds in its primary list will not appear in its current bond list, because they will have been moved to their secondary aggregator. The failed-over bonds will appear in the current tab of their secondary aggregators.
Disabling failover monitoring¶
While enabled by default, the monitoring of aggregator failure can be disabled on a per-aggregator basis. From the aggregator details page, you can verify whether failover monitoring is enabled. To enable or disable this setting, edit the aggregator.

Warning
When Failover monitoring is disabled on an aggregator, failure of that aggregator will not result in any of its bonds failing over to their secondary aggregator.
Notifications¶
SD-WAN sends email notifications about aggregator failures to administrators who have elected to receive these messages. It also sends emails when aggregators recover. To receive these emails, update your account preferences according to the instructions in Editing your account.
The SD-WAN web application tags aggregators that have failed. For example, on the aggregator index page:

A notice is also shown on the aggregator details page:

On the bond index page, bonds show when they have been assigned to their secondary aggregators:

They also show a notice on the bond details page:

Recovery and flap detection¶
When an aggregator has failed and later comes back online, an email notification is sent to administrators. Bonds with failback enabled are moved back to the recovered aggregator, while those with failback disabled remain on the secondary aggregators. Those bonds need to be moved back to the primary aggregator manually.
To avoid an aggregator flapping up and down, a flap damping technique is used when an aggregator recovers. To be considered available, an aggregator must respond to two consecutive checks from the management server. If the aggregator passes two checks, but fails again within 15 minutes, it will need to respond to four checks when it recovers before it will be considered available. The number of checks needed to be considered available doubles each time it fails, up to a maximum of 20 checks, or 10 minutes at the default 30 second check interval. The flap damping interval is reset after the aggregator has responded to 30 consecutive checks.
Restoring individual bonds to their primary aggregator¶
When an aggregator recovers, bonds that were not moved back automatically should be moved back manually. The following notice is shown on the bond detail pages of the bonds that are still on their secondary aggregator.

To assign the bond back to its primary aggregator, click the Restore to primary aggregator button.
Restoring multiple bonds to their primary aggregators¶
Multiple bonds can be moved back to their primary aggregators from the
main bond index page or from an aggregator bond page. For each bond that
should be moved back to its primary aggregator, click the bond’s
checkbox on the left side of the table. When all the target bonds have
been selected, click the
beside the Edit selected button.
Then click Restore selected to primary aggregator.

The selected bonds are moved back to their primary aggregator.
Limitations¶
Aggregator failover has a few limitations. These limitations will be addressed in future versions of SD-WAN.
- Assignment of bonds to secondary aggregators needs to be done manually. See Bond assignment strategies for ideas on how to assign bonds to aggregators.