Do OpenMC need special settings to achieve maximum computing speed when using multi-threaded computing?

Hello, everyone. I have recently purchased a computer with 2 Intel Xeon Platinum 8352v 36-core 72-thread CPUs.In order to achieve better computing performance, I have also equipped 4 32GB memory modules and a 2T Samsung solid-state drive as the system disk.

When I run OpenMC, the process manager shows that all 144 threads are performing calculations. However, it is strange that when comparing the same calculation example, the computing speed of this new computer is less than twice that of my personal computer. My personal computer only uses an Intel Core i5-12400F 6-core 12-thread CPU, as well as a 32GB memory module and a 500GB solid-state drive.

There is a significant hardware difference and price difference between these two computers, but the speed has not improved as expected.

According to the shared-memory-parallelism-openmp, OpenMP is enabled by default, and it uses the maximum hardware thread. I also tried to control it by setting openmc.run (threads=144), hoping to make some changes, but it has not been effective.

I would like to ask if this is reasonable and if certain settings are required to achieve faster computation speed.

I pieced together of the calculation process and results for everyone to view

Hi Harrison,

you need to split the computation by the NUMA spaces of the Xeon PC.
CPUs these days are split into chiplets, and get reduced performance when they access memory that is not in their chiplet.
You will need an MPI version of OpenMC for this.

In Linux use lscpu, count the number of NUMA nodes.
Divide your omp threads by that and use that NUMA number as the number of MPI threads.
so if there are 4 NUMA nodes for a Xeon Platinum 8352, use
openmc -n 4 -s 36

See also:
https://docs.openmc.org/en/stable/usersguide/parallel.html?highlight=numa#maximizing-performance

I’d just like to add that I’ve tried running one MPI rank per NUMA, but have only seen FAR from perfect scaling. It seemed a bit too challenging to figure out what’s going on at the moment.

Hi,
bad scaling is also what I found for depletion calculation in serpent. On the other hand gamma only problems in mcnp scale very well.
I also found that consumer cpu are indeed much more bang for the bug: my 13900k system outruns a four times as expensive modern (2023) clusternode at our university. Even the memory is not that much of a problem anymore in consumer pc’s that can handle 192 GB, perfect for memory hungry depletion.