====================
Aggregator failover
====================

Bonded Internet can automatically move a bond from one aggregator to
another if the bond's primary aggregator fails. This significantly
reduces downtime for a customer in the event of an aggregator failure
compared to monitoring aggregators with an external system such as
Nagios and moving bonds off of failed aggregators manually. Customer
downtime is limited to about two minutes.

With the default settings, aggregators are contacted by the management
server every 30 seconds to verify that they are still online.
If an aggregator fails to respond three times in a row, it is considered
down and bonds on the aggregator with secondary aggregators configured
are moved to their secondary aggregator. Bonds with no secondary
aggregator are not moved. If a bond with a failed primary aggregator
and no secondary aggregator is later assigned a secondary aggregator,
the bond is moved to the secondary aggregator within 30 seconds.
If both a bond's primary and secondary aggregators fail, the bond will
remain assigned to the secondary aggregator.


.. warning::

    Fast aggregator failover depends on the aggregation servers being
    integrated into the dynamic routing setup of the partner's core network.
    If no dynamic routing is configured, bonds will still be moved between
    aggregators, but static routes in the datacenter routers will need to be
    updated manually before the bonds are able to come back online.

Management server location
---------------------------

Since the management server controls aggregator failover, it must not be
co-located with any primary aggregators. If the management server is
co-located with primary aggregators, an outage that affects both the
management server and aggregators (for example, a common network,
electrical, or VM host failure) will not be handled and bonds will not
be moved to their secondary aggregators.

To avoid this issue, a management server should be put in one of the
following locations:

#. Co-located with secondary aggregators. With this design, the
   management server will not be affected by an outage interrupting
   primary aggregators, and it will be able to detect the primary
   aggregator failure and move bonds to their secondary aggregators. If
   a common outage affects both the management server and secondary
   aggregators, the primary aggregators will still be available and no
   failover is necessary.
#. Located at a dedicated location apart from any aggregators.

Configuration
--------------

There are a variety of ways to configure aggregator failover, depending
on a partner's unique reliability and cost requirements. For details,
see `Bond assignment
strategies <bond-assignment-strategies.html>`__.

Most failover settings can be specified for each aggregator from the
aggregator settings page, with default values being configured in the failover
`administration page <../administration/aggregator-failover-settings.html>`__.

.. NOTE::
    The known hosts and suspension failover features can only be configured
    globally from the administration page.

To enable aggregator failover for a bond, `edit the
bond <../bonds/managing-bonds.html>`__ and assign a secondary aggregator.

|image0|

To disable aggregator failover, clear the secondary aggregator field for
the bond.

To move the bond back to the primary aggregator automatically when it
recovers, check the Aggregator failback field. If this field is not
checked, you will need to move the bond back to the primary aggregator
manually when it recovers.

You can enable failover for multiple bonds at once by using the `edit
multiple form <../bonds/updating-multiple-bonds.html>`__ on the bond
index page or aggregator bonds list page.

To see which bonds are configured as primary or secondary on an
aggregator, go to the aggregator details page and click the Bonds
button:

|image1|

The aggregator bonds index page has three tabs—primary, secondary, and
current. Bonds listed in the primary tab have that aggregator as their
primary agg. Similarly, bonds listed in the secondary tab have that agg
as their secondary agg. Bonds listed in the current tab are currently
assigned to the aggregator. When an aggregator fails, bonds in its
primary list will not appear in its current bond list, because they will
have been moved to their secondary aggregator. The failed-over bonds
will appear in the current tab of their secondary aggregators.

Disabling failover monitoring
------------------------------

While enabled by default, the monitoring of aggregator failure can be
disabled on a per-aggregator basis. From the aggregator details page,
you can verify whether failover monitoring is enabled. To enable or
disable this setting, `edit the
aggregator <adding-and-updating-aggregators.html>`__.

|image2| |image3|

.. warning::

    When Failover monitoring is disabled on an aggregator, failure of that
    aggregator will not result in any of its bonds failing over to their
    secondary aggregator.

Notifications
--------------

Bonded Internet sends email notifications about aggregator failures to
administrators who have elected to receive these messages. It also sends
emails when aggregators recover. To receive these emails, update your
account preferences according to the instructions in `Editing your
account <../using-the-web-application/editing-your-account.html>`__.

The Bonded Internet web application tags aggregators that have failed.
For example, on the aggregator index page:

|image4|

A notice is also shown on the aggregator details page:

|image5|

On the bond index page, bonds show when they have been assigned to their
secondary aggregators:

|image6|

They also show a notice on the bond details page:

|image7|

Recovery and flap detection
----------------------------

When an aggregator has failed and later comes back online, an email
notification is sent to administrators. Bonds with failback enabled are
moved back to the recovered aggregator, while those with failback
disabled remain on the secondary aggregators. Those bonds need to be
moved back to the primary aggregator manually.

To avoid an aggregator flapping up and down, a flap damping technique is
used when an aggregator recovers. To be considered available, an
aggregator must respond to two consecutive checks from the management
server. If the aggregator passes two checks, but fails again within 15
minutes, it will need to respond to four checks when it recovers before
it will be considered available. The number of checks needed to be
considered available doubles each time it fails, up to a maximum of 20
checks, or 10 minutes at the default 30 second check interval. The flap
damping interval is reset after the aggregator has responded to 30
consecutive checks.

Restoring individual bonds to their primary aggregator
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

When an aggregator recovers, bonds that were not moved back
automatically should be moved back manually. The following notice is
shown on the bond detail pages of the bonds that are still on their
secondary aggregator.

|image8|

To assign the bond back to its primary aggregator, click the Restore to
primary aggregator button.

Restoring multiple bonds to their primary aggregators
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Multiple bonds can be moved back to their primary aggregators from the
main bond index page or from an aggregator bond page. For each bond that
should be moved back to its primary aggregator, click the bond's
checkbox on the left side of the table. When all the target bonds have
been selected, click the |image9| beside the Edit selected button.
Then click Restore selected to primary aggregator.

|image10|

The selected bonds are moved back to their primary aggregator.

Limitations
------------

Aggregator failover has a few limitations. These limitations will be
addressed in future versions of Bonded Internet.

-  Assignment of bonds to secondary aggregators needs to be done
   manually. See `Bond assignment
   strategies <bond-assignment-strategies.html>`__ for ideas on
   how to assign bonds to aggregators.


.. |image0| image:: /attachments/2228260/2818063.png
.. |image1| image:: /attachments/2228260/2392069.png
.. |image2| image:: /attachments/2228260/10354711.png
.. |image3| image:: /attachments/2228260/10354712.png
.. |image4| image:: /attachments/2228260/2392074.png
.. |image5| image:: /attachments/2228260/2392072.png
.. |image6| image:: /attachments/2228260/2392075.png
.. |image7| image:: /attachments/2228260/2392076.png
.. |image8| image:: /attachments/2228260/2392077.png
.. |image9| image:: /attachments/2228260/2392079.png
.. |image10| image:: /attachments/2228260/2392078.png
