Bad experience installing MPI-capable OpenMC through spack on shared cluster

Hi everyone,

I’ve now attempted to install MPI-capable OpenMC through spack multiple times. I’ve been trying to get it working on a system with the job scheduler slurm, which means that OpenMPI has to be built with PMI support. This is apparently a library that communicates the resources given by a scheduler job directly to the MPI runner in slurm, srun.

So, in theory, this should work to install OpenMC as I need it:

spack install py-openmc+mpi ^openmc+mpi ^openmpi+pmi schedulers=slurm

The OpenMC I get, however, always crashes with this error whether run under srun or not:

[ridley@nsecluster ~]# salloc srun openmc                                                                     [1112/1967]
salloc: Granted job allocation 15788
[n011.cluster.com:25192] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[n011.cluster.com:25192] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate err
or messages, and not able to guarantee that all other processes were killed!
srun: error: n011: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 15788

However, it seems that spack always installs its own slurm. This is confusing, because after reading the spack documentation, it would seem that spack should be using my system’s slurm. I have informed spack where slurm is on my system:

[root@nsecluster ~]# cat .spack/bootstrap/config/packages.yaml
packages:
  bison:
    externals:
    - spec: bison@3.0.4
      prefix: /usr
  gawk:
    externals:
    - spec: gawk@4.0.2
      prefix: /usr
  slurm:
    paths:
      slurm: /usr
    buildable: False

Indeed, I have told spack the correct location to look for slurm:

[root@nsecluster ~]# whereis slurm
slurm: /usr/lib64/slurm /etc/slurm /usr/include/slurm /usr/share/man/man1/slurm.1.gz

My hypothesis behind that error is that while the OpenMPI installed by slurm was indeed built with PMI (as incorrectly reported by salloc srun openmc), OpenMPI was built against spack’s own installation of slurm, which is not carrying the libPMI which is actually being used by the job scheduler. Hence, they fail to properly communicate and give the error shown above.

Has anyone else experienced this? Is it possible that OpenMC’s own spack setup is causing this problem?