Since the 2017 summer maintenance, MPI jobs run on certain Flux Operating Environment nodes have been issuing unusual warnings and/or failing with unusual error messages. This document contains information on this problem, including information about:
- its cause
- the symptoms it produces
- the nodes it affects
- a temporary workaround
- a permanent fix
During the maintenance, we upgraded Mellanox OFED to version 3.4, which included driver updates for the InfiniBand HCAs (the technical term for an “infiniband card”) in our nodes.
Most of our InfiniBand HCAs are dual port models, and each port can run in either InfiniBand mode or Ethernet mode. By default, the HCA driver automatically detects whether each HCA port is plugged into an InfiniBand switch, an Ethernet switch, or no switch at all. On most of our compute nodes, port 1 is connected to our InfiniBand network and port 2 is unused.
On one specific model of HCA, the new driver incorrectly detects that port 2 is connected to an Ethernet switch and also incorrectly detects that the Ethernet link it thinks exists on port 2 is up and running. On nodes affected by this problem, the driver reports to software like MPI that there are two RDMA interfaces available – the InfiniBand network which really does exist on port 1, and a high-speed Ethernet network on port 2.
Programs are, obviously, unable to communicate over the non-existent Ethernet link on port 2.
Some code appears to detect this condition up front, issues a warning about an unusable port 2, and then runs correctly over the IB link. Warnings generated when this happens include:
"No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local device: mlx4_0 Local port: 2"
Note that the ‘openib’ BTL referenced in the warning is used for both InfiniBand and RDMA-over-Ethernet communication, which is not obvious from its name. The device name “mlx4_0” refers to the first (zero-indexed) (and, in this case, only) mlx4-based HCA in the machine, and “Local port: 2” refers to the second port on the HCA which OpenMPI has been told by the driver is up and running but, in actual fact, is not.
Other code does not seem to perform this check, and execution fails when it is unable to communicate over the non-existent network. Error messages generated when this happens include:
"Error number: 3 (IBV_EVENT_QP_ACCESS_ERR)" "no cpcs for port"
The node ranges nyx0930-nyx1009 and nyx1098-nyx1233 are affected by this problem.
It is possible to instruct OpenMPI to ignore the “mlx4_0:2” interface regardless of what the driver reports by setting the following variable:
$ export OMPI_MCA_btl_openib_if_exclude=mlx4_0:2
This, of course, needs to be done prior to submitting every job, whether batch or interactive, so that the variable is passed into the job execution environment.
Using a configuration file, it is possible to disable IB/Ethernet autodetection on an HCA and force both ports into InfiniBand mode. We tested this on an affected node which was idle and, when rebooted, the node came up with both ports in InfiniBand mode, and the driver correctly determined that there was no active IB link on the 2nd port.
We updated our configuration management system to deploy the autodetect-disabling configuration file on all nodes containing HCAs with board_id HP_0230240009. This file is now in place and will take effect on each node when it is rebooted.
It is supposed to be possible to change these port configurations on a live system without a reboot, but attempting to do so has caused a hard crash on every system we tested it on. Because of this, each affected node will need to be rebooted in order for the permanent fix to take effect. We are in contact with the owners of the affected systems and will be handling the node reboots as the owners see fit.