=====================================
Bonded Internet 2013.4 release notes
=====================================

September 23, 2013

Bonding node
-------------

Additions
^^^^^^^^^^

- The main service on nodes has been replaced with a simplified application that controls tunnels, TCP proxies, iptables, routing, and all other configuration. This service, simply called node, runs as the unprivileged user bonding.
- A new service, called config, is responsible for configuration management on the node. It accepts updates from the management server, queues updates sent to the server, and reads and writes configuration files. It also runs as the user bonding.
- A new service, known as subprocess, is responsible for running all system commands needed by bonding-related applications. It also runs tunnels and TCP proxy instances. This improves security, because only subprocess needs to run as root, and enhances monitoring and debugging capabilities because all system commands are run from a central location.
- Node services communicate with each other using Unix domain sockets and the ZeroMQ library.
- The new architecture offers major performance improvements- restarting bonding on an aggregator with about 150 bonds took 302 seconds in 2013.3 and 81 seconds in 2013.4.
- Tunnel instances run as the user bonding.
- The config service communicates with the management server using the ZeroMQ protocol. The XML-RPC protocol is no longer used. If port filtering is enabled on the partner core network, port 8001 to the aggregators can be closed when the aggregators are upgraded to 2013.4.
- Crashes in node applications are reported to the management server and are relayed to a Technical Support host. This will allow Technical Support to quickly identify and fix problems in node applications.
- Node applications are checked every five minutes to ensure they are responsive. Unresponsive applications are killed and restarted.
- Status information about leg and connected IP interfaces is reported to the management server. Details include carrier state, link speed, and duplex settings.
- DHCP lease information is reported to the management server. Details include gateway address, lease length, domain, and client error messages.
- PPPoE client information is reported to the management server. Details include remote IP and MAC addresses and client error messages.
- Node operating system uptime is reported to the management server.
- Uptime of the node, config, and subprocess services, as well as tunnel and TCP proxy instances, are reported to the management server.
- Nodes report their SSH RSA and DSA fingerprints to the management server.
- The bonding software package will not install on a bonder if its configuration has no legs. Instead, the installer gives a warning that a leg must be added before the configuration can continue.
- DNS requests from a bonder work even when the tunnel is down, by routing them directly through a leg instead of through the bonded tunnel. Customer DNS requests continue to use the bonded tunnel, as always.
- Connectivity between bonder legs and the management server is monitored. This information is used for routing DNS requests and requests from nodeconfig and nodessl. The management server connectivity status of each leg can be shown with the legids -v command.
- The legping command shows the latest tunnel ping latencies for each leg in a given bond.
- Nodes can be put into debug mode with an option in the management server. Debug mode enables very verbose logging and is not recommended except under the direction of Technical Support.
- A BOND_ID field has been added to the environment of the leg, connected IP, CPE NAT IP, and routed block hooks.
- When bonding is restarted on a bonder, pppd and udhcpc processes are killed.
- Connected IP hooks have a new environment field, NETWORK.
- Added private WAN and DHCP server hooks.

Changes
^^^^^^^^

- There is a 2-second timeout enforced for hooks. Hooks that can run longer than 2 seconds must be adapted to daemonize themselves; see the hooks documentation for details.
- The tunnel accepts configuration changes without restarting. For example, the tunnel adds and removes legs and enables and disables compression without restarting.
- Legs are restarted in fewer circumstances. For example, changing the upload speed of a PPPoE leg no longer resets the PPPoE connection; it only changes the rate limit on the bonder.
- PPP and DHCP request backoff periods have been reduced from 30 seconds to 1 second. This reduces the time to successfully bring up a PPP or DHCP leg without significantly increasing the load on the bonder.
- DHCP and PPP clients are not started on interfaces that do not have an Ethernet carrier signal. The clients are started as soon as a carrier is detected.
- Logging frequency under certain circumstances has been reduced. For example, if a DHCP lease cannot be obtained on a leg, only the first failure is logged. To reduce the number of duplicate log messages, subsequent errors are not logged.
- The getconfig application has been renamed nodeconfig. getconfig is now a link to nodeconfig, so existing scripts or work procedures will continue to work.
- legids shows considerably more information than before and its output is formatted similarly to the ip suite of tools. It also works on aggregators and takes an optional bond ID argument.
- nodeconfig and nodessl no longer have any output when they work successfully. The new "--verbose" option can be used to show details of their operation.
- On bonders, nodeconfig and nodessl contact the management server through the legs instead of the bonded tunnel, and automatically use a leg that is known to have connectivity to the management server. They also have a "--bind" option to use a specific local IP address.
- The zebra protocol for the Quagga routing software is now enabled on all bonders and aggregators. The zebraenable command has been removed.
- The bonder eth0 troubleshooting network has been changed from 192.168.1.10/24 to 10.207.35.254/29, because 192.168.1.10/24 frequently conflicted with customer and DHCP leg networks. This change is applied to existing bonders that are still configured with the 192.168.1.10/24 network. However, the change will only take effect when an existing bonder is rebooted.
- PID files have moved from /var/run/ to /var/run/bonding/.
- Supervisor scripts have moved from /usr/share/bonding/service/ to /var/lib/bonding/services/.
- The cached configuration file has moved from /var/cache/bonding to /var/lib/bonding/configuration.json.
- The management VPN client log file is now rotated and is located in /var/log/bonding/.
- The management VPN client interface is now called mtun0.
- Tunnel compression speed is slightly improved.
- The tunnel receives its initial configuration over a socket instead of by command line arguments.
- Quagga configuration is no longer committed on every change. The configuration is committed after starting, after stopping, and every 30 seconds while running, as necessary.

Removals
^^^^^^^^^

- The following fields have been removed from the leg hook environment: GIVEN_IP and DETERMINED_IP
- The following field have been removed from the tunnel hook environment: CLAMP_TCP and DEF_ROUTE_SRC
- The file /etc/bonding.conf is no longer used if /etc/bonding/bonding.conf does not exist.
- dhclient processes are no longer killed when bonding is restarted on a bonder.

Fixes
^^^^^^

- The node cached configuration file is modified to reflect the updates it receives from the management server. This ensures that nodes use an up-to-date configuration even if the device is rebooted or the core services are restarted.
- Features that depend on Linux iptables, such as Quality of Service, speed test rate limiting, and TCP proxy, no longer fail due to occasional iptables "resource temporarily unavailable" errors.
- When the MTU field of a leg is cleared, the MTU is restored to its initial value- usually 1500 bytes.
- PPPoE legs now start properly even in the rare case that the PPP client adds its IP address to the interface after it daemonizes.
- When a node key is changed to make a bonder host a different bond, the node's SSL certificate is updated by nodeconfig. This prevents management VPN clients from being assigned incorrect IP addresses.
- DHCP legs work properly when multiple routers are specified by the DHCP server. Bonding uses the first router specified.
- DHCP legs respect the interface MTU value sent by the DHCP server. However, the DHCP server must be configured to always send MTU information, since the DHCP client cannot be configured to request it.
- Speed test results on low-latency links no longer show negative RTT pings.
- Two rare uncaught exception errors in the tunnel have been fixed.

Defects
^^^^^^^^

- The node config service sometimes quits on a ZeroMQ error after the management VPN client quits. Since the service is immediately restarted, this has no impact on system functionality.

Patches
^^^^^^^^

:2013.4-1: Fixes a possible MTU issue if a tunnel starts with no legs and fixes static IP legs failing to remove the IP address from the interface when stopped.
:2013.4-2: Fixes a number of issues: makes QoS handling more robust, increases the timeout for Quagga commands, fixes an issue checking connectivity to the management server, adjusted some logging priorities, and fixes a missing package dependency.
:2013.4-3: Adds the NETWORK field to connected IP hooks, adds private WAN and DHCP server hooks, and fixes a crashing bug in the tunnel related to MTU negotiation.
:2013.4-4: Fixes a number of issues: increases the heartbeat timeout for the node service, fixes issues seen on aggregators after moving a bond away from an aggregator and then back, fixes two unhandled exceptions in the tunnel application, batches Quagga configuration commits instead of committing on every routing change, fixes an issue adding a leg to an interface that doesn't exist on the bonder, makes DHCP and PPP legs start unless node is certain there is no Ethernet connectivity on the interface, and fixes an issue checking authentication in requests to nodes.
:2013.4-5: Fixes a potential packet loss issue after changing the balancing mode.
:2013.4-6: Fixes an issue that can prevent iptables rules being added immediately after restarting bonding. Increases the default file descriptor limit of node and subprocess services from 1,024 to 16,384.
:2013.4-7: Fixes an issue on aggregators removing bonds that have routes.
:2013.4-8: Fixes an issue on some aggregators when starting TCP proxy applications.

Bonding admin
--------------

Additions
^^^^^^^^^^

- User management forms, accessible to superusers only, have been added. This allows superusers to add, edit, and disable user accounts without contacting Technical Support.
- A node details page has been added that shows information about a node, including its device uptime, core service uptimes, and SSH fingerprints.
- A node status indicator has been added to the bond and aggregator index pages and bond, aggregator, and node details pages. This indicator shows whether or not the node management VPN client is connected. After a client connects or disconnects, it can take up to two minutes for the indicator to change.
- A link to the bond Nagios check page has been added. It is located at the bottom-right corner of the bond details page.
- The API leg resource now includes a field indicating the type of leg.
- Config updates are delivered to 2013.4 nodes using ZeroMQ.
- A command-line application has been added to automatically upgrade bonder software.

Changes
^^^^^^^^

- The Debian installer preseed files have been updated to work with our recommended PXE provisioning server setup.
- The serial speed for installer ISOs is now 115,200 baud.
- The management VPN server interface is now called mtun0.
- Huey, the asyncronous task queue library, has been upgraded to version 0.4.
- Documentation has been expanded and reorganized.

Removals
^^^^^^^^^

- When accessed over HTTP, the API site no longer redirects to the HTTPS site. This ensures that applications don't unknowingly transmit authentication details over HTTP.

Fixes
^^^^^^

- The Quality of Service packet filter form handles source and destination IPs properly.
- Updating a QoS profile sends an update to each aggregator, even ones that have no bond using the profile. This ensures the aggregator stays up-to-date in case the profile is assigned to one of the aggregator's bond at a later time.
- Static leg details always show currently-configured IPs, instead of sometimes showing previously-used IPs.
- The speed test results page now shows an error message if the results request times out.
- TCP proxy settings are no longer saved in speed test results, reflecting the fact that speed tests never use the TCP proxy. TCP proxy settings for existing test results have been corrected as well.
- Changing the failover leg ping time or fail time no longer sets a failover leg to normal mode.

Deprecated
^^^^^^^^^^^

- The following API fields for PPPoE legs and PPPoE/Radius legs have been deprecated: (1) rad_username, and (2) rad_password. These fields will be removed in a future release. Adapt your applications to use the ppp_username and ppp_password fields instead.

Patches
^^^^^^^^

:2013.4-1: Adds node status tooltips and fixes an issue handling updates from PPPoE/Radius legs.
:2013.4-2: Fixes an issue sending authentication information to nodes and fixes an issue detecting node management VPN client connection status.
:2013.4-3: Adds support for the updated documentation package format. 
