Problems in read/write of HDF5 files in calculations on a cluster machine

Dear all,
I’m having problems while running OpenMC on a Cluster machine. The software, OpenMC 0.13.3, has been compiled from source using MPICH 4.0.3 and HDF5 1.12.2 (HDF5 compiled using option “-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64”). The python interface has not been installed. Job submission is performed using Slurm.
[root@n0009 ~]# ldd /var/sw/openmc/build/exec/bin/openmc
linux-vdso.so.1 (0x00007ffc5d68a000)
libopenmc.so => /var/sw/openmc/build/exec/lib64/libopenmc.so (0x00007f4677ddc000)
libhdf5.so.1000 => /var/sw/lib/hdf5/hdf5-install/lib/libhdf5.so.1000 (0x00007f4677653000)
libz.so.1 => /lib64/libz.so.1 (0x00007f467743c000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f4677238000)
libhdf5_hl.so.1000 => /var/sw/lib/hdf5/hdf5-install/lib/libhdf5_hl.so.1000 (0x00007f4677015000)
libpng16.so.16 => /lib64/libpng16.so.16 (0x00007f4676de0000)
libmpicxx.so.12 => /var/sw/libmpi/mpich-install/lib/libmpicxx.so.12 (0x00007f4676bbf000)
libmpi.so.12 => /var/sw/libmpi/mpich-install/lib/libmpi.so.12 (0x00007f46732c5000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f4672f30000)
libm.so.6 => /lib64/libm.so.6 (0x00007f4672bae000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f4672976000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f467275e000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f467253e000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4672179000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4678261000)
librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f4671f5e000)
libefa.so.1 => /lib64/libefa.so.1 (0x00007f4671d54000)
libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f4671b34000)
librt.so.1 => /lib64/librt.so.1 (0x00007f467192c000)
libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f4671709000)
libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x00007f4671483000)

Originally, I was running OpenMC calculations using .h5 libraries located in a repository shared between the different nodes of the cluster. In such a situation, the code got stuck while reading the .h5 libraries without dumping any error.
The problem has been partially solved by using the command ‘sbcast’ to broadcast the h5 libraries to each node and using a separate copy of the library in each node. However, also in this case the solver gets stuck while writing the statepoint.h5 file. I suspect the problem is due to the read/write permissions of h5 files across different nodes.
As a further detail, the problem has not been encountered while running calculations on the node in which I’m logged in (without using slurm).
Can you please help me solving this issue?

Cordially,
Matteo

maybe this will help, link

Thank @indesperate for the help.
The problem has been (partially) solved. It was related to the version of MPICH we were using to compile HDF5. See HDF5 stuck in read/write .h5 files written by OpenMC - HDF5 - HDF Forum (hdfgroup.org).

Anyway, we are still unable to write the statepoint files while running with several MPI processes.
OpenMC does not dump any error but it simply get stuck while writing source points in the statepoint file. At present, in order to get around the problem, we are disabling the print of source points.
We are not yet sure if this is a bug, or a problem related to the version of HDF we are using.
Can anyone please tell us if you have the same problem?

Thank you in advance,
Matteo

Did you compile OpenMC with:
-DHDF5_PREFER_PARALLEL=on

I’m exploring MPI and surface source h5 files myself right now.

I have problems when writing statepoint file, but I’m not sure we encounter the same problem, I’m using PBS(Portable Batch System) to run my work. Openmc works fine on host computer, but it breaks when posting a job onto nodes, it can’t read hdf5 format nuclear data, so I must use HDF5_USE_FILE_LOCK=False to run the job, but this optons causes problem when writing files. As the link above shows, hdf5 cause theses problems in some file systems in recent versions because it locks the file and doesn’t unlock the file. One way to fix this problem is just restarting the server :rofl: . Anyway, thanks for the reply, it helps a lot because I also encounter the mpich problem.

1 Like