Huge difference of elapsed times between openmp and mpi in parallel methods

Hello, everybody.
I found a strange problem when I ran openmc in different parallel methods in my computer. The elapsed times of openmp are larger than mpi. the only difference between those two case are the parallel methods.

What should I do to improve the parallel efficiency in openmp?

When I run openmc parallelly with command openmc -s 40, the Total time elapsed is just 1.9025e+02 seconds

=======================> TIMING STATISTICS <=======================
Total time for initialization = 7.4822e+00 seconds
Reading cross sections = 7.3256e+00 seconds
Total time in simulation = 1.8248e+02 seconds
Time in transport only = 1.7741e+02 seconds
Time in inactive batches = 8.9944e+00 seconds
Time in active batches = 1.7349e+02 seconds
Time synchronizing fission bank = 3.4782e-01 seconds
Sampling source sites = 3.0026e-01 seconds
SEND/RECV source sites = 4.6206e-02 seconds
Time accumulating tallies = 4.5518e+00 seconds
Time writing statepoints = 3.8154e-02 seconds
Total time for finalization = 6.9309e-02 seconds
Total time elapsed = 1.9025e+02 seconds
Calculation Rate (inactive) = 55590.3 particles/second
Calculation Rate (active) = 2882.02 particles/second

However, when I ran openmc parallelly with command mpiexec -n 40 openmc -s 1, the Total time elapsed is just 4.8384e+01 seconds

=======================> TIMING STATISTICS <=======================
Total time for initialization = 1.1176e+01 seconds
Reading cross sections = 1.0951e+01 seconds
Total time in simulation = 3.6932e+01 seconds
Time in transport only = 3.3904e+01 seconds
Time in inactive batches = 1.4050e+01 seconds
Time in active batches = 2.2883e+01 seconds
Time synchronizing fission bank = 2.6849e+00 seconds
Sampling source sites = 7.4304e-02 seconds
SEND/RECV source sites = 3.2699e-03 seconds
Time accumulating tallies = 2.4282e-01 seconds
Time writing statepoints = 7.1170e-02 seconds
Total time for finalization = 1.1638e-01 seconds
Total time elapsed = 4.8384e+01 seconds
Calculation Rate (inactive) = 35588.1 particles/second
Calculation Rate (active) = 21850.5 particles/second

OpenMC Version

Version | 0.13.0
Git SHA1 | 160aeb54a382de0e22ee2da7cb724d7555103f40

Following are my computer information

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping: 7
CPU MHz: 800.061
CPU max MHz: 3900.0000
CPU min MHz: 800.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0-19
NUMA node1 CPU(s): 20-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

Could be the NUMA nodes, you typically want to keep your OMP job limited to a single NUMA space, otherwise it get cache latency effects as it addresses between the spaces. I’ve found this to be quite notable on AMD Ryzen-type CPUs (OMP performance crashing beyond 1 chiplet)

Also, it looks like you have Hyper-Threading disabled. I have found typically that MCNP benefits from HT, and I am finding OpenMC mostly has the same behaviour w.r.t. HT
There are some problems that will not benefit from HT, it is problem dependent.

In addition to the comments from @yrrepy (which I agree with), I would also encourage you to check out our user’s guide section on Maximizing Parallel Performance.

@yrrepy Thanks for your comments. I still have no idea how to keep OMP job limited to a singe NUMA space because I am not familiar with the parallel technology. I could just use it in simple command which refer in the user guide :pensive:. Could you give me some examples about how to configure for my own computer or use some bash commands to do the things you said?

Yeah, I have disabled HT for my computer because some other software programs also run in the same computer which are not benefit from HT. But still, thanks for your suggestion.

All you need to do is pass an argument to mpiexec that tells it to bind the MPI process (and therefore all OpenMP threads within it) to a single NUMA node:

mpiexec --bind-to numa --map-by numa ...

The exact command-line option may vary by MPI implementation so check the manual for whichever one you’re using (MPICH, OpenMPI, etc)