(parallel depletion) "OSError: Unable to create file"

Hi all,

I am trying to run parallel depletion to test my MPI functionality. I believe a recent PR enabled serial HDF5 use with MPI-parallel OpenMC runs. This is quite convenient for modestly sized clusters like the one I’m running on.

In trying to run the pincell_depletion case, I set OMP_NUM_THREADS, and do “mpirun -n 12 python run_depletion.py”.

This seems to work fine until depletion reaction rates are written, when each MPI rank seems to emit this error message:

Traceback (most recent call last): File "run_depletion.py", line 95, in <module> integrator.integrate() File "/opt/openmc/gnu-dev/openmc/deplete/abc.py", line 881, in integrate p, self._i_res + i, proc_time) File "/opt/openmc/gnu-dev/openmc/deplete/results.py", line 493, in save results.export_to_hdf5("depletion_results.h5", step_ind) File "/opt/openmc/gnu-dev/openmc/deplete/results.py", line 215, in export_to_hdf5 with h5py.File(filename, **kwargs) as handle: File "/opt/anaconda3/3.7/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in __init__ swmr=swmr) File "/opt/anaconda3/3.7/lib/python3.7/site-packages/h5py/_hl/files.py", line 176, in make_fid fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 105, in h5py.h5f.create OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

What is the cause of this? Should the call to integrator.integrate somehow tell it how many MPI processes it should use like with

openmc.run(mpi_args=['mpiexec', '-n', '32'])

?

Hi Gavin,

It is very interesting that this error is raised on all MPI ranks, as only rank 0 should be writing to the file - results.py#L211-L217 @ 9e0bab3018c7578cace9652ac21c305f208783f3

As for your question on informing the integrator on MPI ranks, that information is obtained internally by an mpi4py communicator.

I have a couple investigative questions that should help determine what is (or isn’t the problem)

    1. Does the failure occur at the first step, or at a later step? The file should be “freed” through the h5py.File context manager.
  1. Are other simulation results (e.g. statepoint) written?
  2. When you built this version of openmc, what compiler did you use? Specifically, is it some mpicpp provided by the cluster or via anaconda?
    Regards,

Drew

Hello Andrew,

I am experiencing exactly the same error as Gavin just now while running the OpenMC most recent development version while depleting. For me, it crashes after running the first transport calculation (I guess while the code is trying to create the updated h5 file with the new isotopic compositions of the depleted material).

However, if I run in the cluster the OpenMC version from the master branch, I have no problems whatsoever running in the cluster with an execution combining 4 mpi with 18 OMP. Is only while running the development branch where I encountered this issue. Of course, both OpenMC compilations were performed using exactly the same OMP, MPI, GCC, GXX, PHDF5 libraries (as I mentioned to you yesterday, I just wanted to try the newest capabilities from the development branch for depletion).

Best regards,

Augusto

Andrew please, disregard my previous message. I realized that before installing the new openmc python wrapper from the development branch, I cleaned the previous python installation that I used to have (from the master branch). By doing this, I accidentally erased the mpi4py libraries. By creating them once again, now my most recent installation of openmc is able to run in the cluster with mpi executions.

I would just suggest to Gavin to compile and create the mpi4py libraries by himself with his mpi compiler from which he created the openmc executable, in the first place.

Best regards,

Augusto

Hi Augusto,

Thank you very much for your input. Upon further inspection, I can indeed confirm that my mpi4py is screwed up. In fact, it is nonexistent. How silly of me. I began maintaining this system somewhat recently, and assumed we had this installed. Guess not!

So, it seems that if DummyCommunicator is used rather than a real MPI one, this error gets thrown during MPI-parallel openmc runs. Seeing as how Augusto also experienced this error on such a short timeframe, it may be good to add an explicit check in the instantiation of DummyCommunicator that an MPI run is not being done. After installing mpi4py, the issue was completely resolved.

That’s a good point Gavin. Unfortunately I don’t think there’s a portable way of checking whether a script was launched normally or with mpirun/mpiexec. If you can think of a clever way though, I’d certainly welcome a pull request.

Hey Paul,

Yeah, I looked into this some and it seems any environment variables or the like which could be used to easily identify this are platform/MPI implementation dependent, as you say. Actually, I think it would be easiest to add a try/except block where this error gets thrown, and add on an additional message to warn to check your mpi4py installation, in addition to throwing the original exception. What do you think of that?

Hi @gridley and @andrewjohnson,

I am currently struggling with MPI parallelization for depletion. I installed MPI parallel version of OpenMC v0.12.0 from source and it works perfectly well for transport calculation. Then I installed the PythonAPI and mpi4py 3.0.3. When I tested it with the pin cell depletion sample problem using command:
mpirun -n 4 python pindep.py
I always saw 4 independent processes running. Error prompts up in calling openmc.lib.init(intracomm=comm) in openmc/deplete/operator.py. The error message says:

[gl-login1:94265] *** Process received signal ***
[gl-login1:94265] Signal: Segmentation fault (11)
[gl-login1:94265] Signal code: Address not mapped (1)
[gl-login1:94265] Failing at address: 0x65561d88
[gl-login1:94274] *** Process received signal ***
[gl-login1:94274] Signal: Segmentation fault (11)
[gl-login1:94274] Signal code: Address not mapped (1)
[gl-login1:94274] Failing at address: 0x91e87d88
[gl-login1:94277] *** Process received signal ***
[gl-login1:94277] Signal: Segmentation fault (11)
[gl-login1:94277] Signal code: Address not mapped (1)
[gl-login1:94277] Failing at address: 0x28c81d88
[gl-login1:94278] *** Process received signal ***
[gl-login1:94278] Signal: Segmentation fault (11)
[gl-login1:94278] Signal code: Address not mapped (1)
[gl-login1:94278] Failing at address: 0xb207dd88

I put all the error messages in the attached file (error_message.py, actually a text file). It seems the MPI environment was not properly set up for the OpenMC executable. Any comments on this? Is it crucial to maintain strict consistency among the dependent libraries and compiling environment between OpenMC and its PythonAPI?

error_message.py (12.1 KB)
pindep.py (1.4 KB)

Well, here’s one thing for us to start on. Have you ever installed any other versions of OpenMC on that system? This seems like something that would potentially happen if python code from one version of OpenMC was accidentally used with a libopenmc of a different version.

As another check, it would be good to see if you get this error on the development branch of OpenMC if you pull it fresh from github. Maybe this is an error that has been fixed since 0.12 came out.

@gridley Thanks for the hint. I got the source code and python api with git, so they are from the same branch. I just tried a clean re-installation. To be clear, the build environment for OpenMC is:

  1. gcc/8.2.0
  2. cmake/3.13.2
  3. openmpi/4.0.4
  4. hdf5/1.8.21
  5. git/2.20.1

The PythonAPI was built with pip using the package distributed along with OpenMC source code. Third-party dependencies (like numpy, pandas, h5py, mpi4py, etc.) were pre-installed in conda. Previously, I used “mpirun -n 4 python script.py”, but I should have included mpi4py module by “mpirun -n 4 python -m mpi4py script.py”. With this correction, it seems MPI_COMM_WORLD can be correctly initiated for python parallelization. The problem is that the OpenMC initialization (#208: _dll.openmc_init(argc, argv, intracomm) in openmc.lib.core.py) does not finish. After each process reads input files, a bad termination is encountered:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 249074 RUNNING AT gl-login1.arc-ts.umich.edu
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
@paulromano It turned out that all the four process tried to write summary.h5 and cause conflicted I/O. In my understanding and according to the source code (initialize.cpp), only the master process is supposed to do this. How come this happen?

Best and thanks in advance for your kind help.

One thing to be careful of is if your HDF version both has parallel capabilities and is compiled using the same MPI implementation as mpirun. Installing h5py through conda may just give you the serial version but I’m not sure about that.

If your custom HDF5 installation is built against openmpi already, then I believe you can have pip install just the python library and link against your hdf5-openmpi installation. Try pip install --no-binary h5py

Thank you @andrewjohnson. My HDF5 is non-parallel and pre-compiled with OpenMPI. The h5py installed by conda is also serial. Do you mean I have to use parallel HDF5 and h5py for parallel depletion?

@pdeng, can you explain what you mean by

Looking at

my instinct is your HDF library is indeed fully serial and that is causing issues with running openmc in parallel. But, if it is compiled with openmpi, then you’ll need to build openmc against this specific installation of hdf5. Some additional instructions that might be helpful can be found at 2. Installation and Configuration — OpenMC Documentation

Best regards,

Andrew Johnson