Hello @paulromano @pshriwise @Shimwell
I am wondering if there has been any publications or background work other than this ( Allow MOAB k-d tree to be configured by paulromano · Pull Request #2976 · openmc-dev/openmc · GitHub ) on benchmarking performance using unstructured meshes to tally on. We are currently working with the super computers FASTER and LAUNCH at Texas A&M and a desktop cpu to compare the performance between OpenMP and OpenMPI while using unstructured meshes. a couple of the points we are looking at are:
- figure out what set up works well with OpenMP on AMD 7950X CPU but not well on FASTER’s Intel Xeon 8352Y. (this is a personal CPU vs HPC)
- Compare OpenMP vs MPI performance between a small mesh and a large mesh setup, to show how OpenMP can perform as well (or even better) than MPI for the small mesh setup, but can’t keep up with MPI for the large mesh setup.
- Compare conda-provided OpenMC vs manually-built OpenMC, with various optimization flags
- Run the large mesh setup setup with optimal OpenMC version to estimate how much time is needed to run the large mesh experiment on Launch.
We would like to see how performance is impacted, produce results that can help find the optimal settings for large mesh runs (mpi runs better but with large meshes the memory is too large), and possible develop methods to speed up the performance.
Sounds like a good study I would be interested to see the results. I have not seen a public study for this.
I think there are lots of factors to consider.
Libmesh, DAGMC or XDG for the unstructured mesh.
Mesh size and element type.
If you are doing an unstructured mesh with DAGMC surface geometry then Embree included or not. CPU as some have AX 512 and others don’t, so this can give different results. You can also do an unstructured mesh with CSG geometry instead of DAGMC surface geometry.
Delta tracking for unstructured meshes would also be interesting to look into.
Hi!
As @Shimwell said, quite a bit to consider here and it can be difficult to make apples to apples comparisons across different architectures. I’ve been working on the XDG project and have been able to improve shared memory parallelism performance significantly by bypassing the MOAB API for gathering element connectivity and coordinate information. On the algorithmic front, moving to a method that walks element adjacencies instead of the current method that does a lot of point location operations for track segments provided a big boost in serial performance as well. GPU implementations of DAGMC and unstructured mesh tallies are also underway in XDG; and the PUMI team at RPI (Jacob Merson and Fuad Hasan) have also been hard at work on some GPU implementations of unstructured mesh.
XDG repo – everything is a little bleeding edge right now but polishing is underway!
I’d be happy to chat about where your interests fit in here! The more the merrier as they say 
- Compare conda-provided OpenMC vs manually-built OpenMC, with various optimization flags
This item is particularly interesting to me. It’s something we often theorize about but haven’t had the bandwidth to look into in more detail. As one (perhaps outdated) example of default CONDA settings vs. manual builds I know something as simple that getting MOAB CMake settings correct (i.e. -DCMAKE_BUILD_TYPE=Unset/Debug/Release) can have a huge (~10x) impact in performance and historically this variable is not set by default in the project.
So, it would be really interesting to see what improvements a manual build could provide!
- Run the large mesh setup setup with optimal OpenMC version to estimate how much time is needed to run the large mesh experiment on Launch.
I’m currently on the hunt for XDG challenge problems, so if this would be an appropriate mesh to collaborate I’d love to try it out!
Cheers,
Patrick