================================
Supported environments for nodes
================================

Bare metal is generally recommended for nodes. The networking software is
particularly sensitive to latency, which cannot be typically guaranteed in
virtualized environments.

In addition to bare-metal devices, the following virtualization technologies
are used during development and testing:

- LXC
- Systemd-nspawn
- QEMU/KVM

During the release-candidate phase, the software is also tested in a
production VMWare environment.

There are other environments that will work, but we cannot guarantee their
operation in all situations.

.. note::

    OpenVZ is known to be incompatible.

Optimizing IRQ handling
^^^^^^^^^^^^^^^^^^^^^^^

One potential bottleneck for server performance is the rate at which the server can process interrupt requests (IRQs) from the NIC.
In the context of bonding, this typically appears as one or two server cores totally pinned by a *ksoftirq* worker.
This issue is known to occur most frequently on private WAN routers, but can also occur on aggregators handling many connections.

.. note::

    This issue is not strictly caused by a large volume of traffic, but rather how many interrupts are generated by the traffic
    (a single 100 Gb connection is much less IRQ-intensive than one thousand 100 Mb connections).

.. note::

    The TCP proxy directly increases the ratio of interrupts generated per bit of traffic,
    and as such nodes hosting many bonds using the proxy are particularly susceptible to this bottleneck.

In general, there are two ways to go about improving interrupt handling.
Informally, they are "distribute interrupts across cores", and "generate less interrupts in the first place".
Formally, these ideas are called "IRQ affinity balancing" and "IRQ coalescing", and they are explained in more detail below.

IRQ affinity balancing
----------------------

Linux uses the irqbalance daemon to manage CPU load generated by interrupts across all CPUs.
By default, irqbalance identifies the highest frequency interrupt sources and isolates them all to a single CPU core.
For servers handling a large volume of traffic, this can result in only one or two cores being allocated to handle the entirety of network related interrupts.
In these situations, manually changing the system IRQ affinity (so as to distribute the interrupts across more cores) can significantly improve overall throughput.
This does generally come at the cost of some increased latency and jitter, however.

Resources:

-  `What is IRQ Affinity? <https://community.mellanox.com/s/article/what-is-irq-affinity-x>`__

-  `Interrupts and IRQ Tuning <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-cpu-irq>`__

-  `SMP IRQ affinity <https://www.kernel.org/doc/Documentation/IRQ-affinity.txt>`__

IRQ coalescing
--------------

As the name implies, IRQ coalescing is the act of batching multiple packets together so that only one interrupt is generated for the batch. In general, there are two ways to do this:

#. Only raise an interrupt after a certain number of frames have been queued.
#. Only raise an interrupt after a certain amount of time has passed since a packet was queued.

Note this requires a NIC supporting multiple interrupt vectors.
The actual implementation and configuration details for IRQ coalescing are largely hardware dependent.

Resources:

-  `Understanding Interrupt Moderation <https://community.mellanox.com/s/article/understanding-interrupt-moderation>`__

-  `Interrupt coalescence at the NIC layer <https://www.ibm.com/support/knowledgecenter/en/SSQPD3_2.6.0/com.ibm.wllm.doc/batchingnic.html>`__

Virtualization best practices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

SD-WAN will operate in many types of virtualization for all host
types—management servers, private WAN routers, aggregators, and bonders.
Virtualization makes it easy to provision and manage hosts but performance is
typically negatively impacted, even in situations where the virtual machine is
the only machine on a host.

The following best practices are intended for private WAN routers, aggregators
and bonders. As a core part of your customer data network, these nodes are
very sensitive to resource availability and efficiency. Management servers
should be configured using practices generally accepted for web and database
applications; for example, management server requirements focus on memory size
and storage performance rather than CPU and network device performance.

General recommendations for bonders, aggregators, and private WAN routers
-------------------------------------------------------------------------

CPU
+++

Due to the critical latency demands of networking, CPUs should be dedicated to
the virtual machines. Sharing CPU cores negatively affects latency which
results in lower throughput and general instability of bandwidth.

.. TIP::
    Disabling hyperthreading can yield performance improvements
    on bonders that are CPU-limited.

Memory
++++++

Memory must be reserved to the nodes. Generally 2GB should be enough for most
nodes, but this should be increased when using the TCP proxy or larger numbers
of private WAN spaces.

Storage
+++++++

Storage is generally not as critical as other resources, but care must be
taken to avoid high disk read/write latency. If disk I/O operations take too
long, service failures may occur.

Also, if the amount of memory is low, the disk will be used to swap memory
pages. If that occurs, the disk will be used more extensively and the entire
system performance will be negatively impacted.

Network
+++++++

Most network device virtualization methods incur an overhead on network
performance. A certain amount of CPU and memory is used to implement a virtual
interface that copies network packets between the physical interface and the
guest operating system.

Most virtualization systems have a relatively low overhead virtual device that
should be used instead of full emulation. For example, VMWare offers a
``VMXNET3`` device, while QEMU/KVM offers a ``VirtIO`` device. Container
systems such as LXC and nspawn already use a reasonably efficient ``veth``
device by default. The primary advantage of these devices is that they do not
have to emulate a physical device type, allowing the host and guest to pass
packets relatively quickly via system memory.

However, these virtual devices are still not as efficient as using the card
directly. Most modern server network devices have advanced offloading and
acceleration features that are not always exposed via virtual devices. In
situations where the traffic load is very high, you may want to consider
passing dedicated network devices directly into the guest operating system.

Tips for specific systems
-------------------------

VMWare
++++++

- Install VMware tools. The open source tools are acceptable; these can be
  installed from standard Debian repositories with:

    - ``apt-get install open-vm-tools -y``
    - ``service bonding restart``
- If you are using Private WAN with encryption, you must disable TCP
  segmentation offload (TSO) on all the aggregators and private WAN routers
  running in VMWare. The VMWare ``VMXNET3`` driver has an issue with TSO in
  combination with IPSEC that results in greatly reduced throughput.
- You may be able to reduce idle-wakeup latencies for guests by setting the
  `Latency Sensitivity <vmw-tuning-latency_>`_ option from *Normal* to *High*.
  This is found under *VM Settings* > *Options tab* > *Latency Sensitivity*.

Amazon Web Services (AWS)
+++++++++++++++++++++++++

- You may need to disable the "Source/Destination Checks" feature. Otherwise,
  traffic routed by nodes may be dropped by the networking infrastructure. See
  the documentation on `Disabling Source/Destination Checks
  <http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck>`__.


External Resources
------------------

Here are some external resources with a variety of useful information. If you
find that any of these links are no longer active, please let us know.

`VMware vSphere 5.5 Documentation
Center <http://pubs.vmware.com/vsphere-55/index.jsp>`__

`Performance Best Practices for VMware vSphere® 5.5
(PDF) <http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf>`__

`VMware KB: Troubleshooting ESX/ESXi virtual machine performance issues
(2001003) <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2001003>`__

`Common Mistake: Using CPU reservations to solve CPU
Ready <http://www.joshodgers.com/2012/07/22/common-mistake-using-cpu-reservations-to-solve-cpu-ready/>`__

`The Performance Cost of SMP – The Reason for
Rightsizing <http://blogs.vmware.com/vsphere/2013/02/the-performance-cost-of-smp-the-reason-for-rightsizing.html>`__

`VM Right Sizing – An example of the
benefits <http://www.joshodgers.com/2012/07/25/vm-right-sizing-an-example-of-the-benefits/>`__

`VMware: Choosing a network adapter for your virtual
machine <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805>`__

`VMware: Configuring disks to use VMware Paravirtual SCSI (PVSCSI)
adapters <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1010398>`__

`VMware: Best Practices for Performance Tuning of Latency-Sensitive Workloads
in vSphere VMs <vmw-tuning-latency_>`__

.. _vmw-tuning-latency: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmw-tuning-latency-sensitive-workloads-white-paper.pdf
