Large runtime with large number of sources

Following Ben’s post I have built a full core PWR model with ~5e6 sources placed as a 3D pin-wise distribution in the fuel. I was never able to get this case to run to completion (it appeared to freeze and job died on walltime at 6 hours), so I reduced the number of sources to ~2e4 representing an assembly-wise distribution (i.e. I combined the sources for the 264 fuel rods in each assembly at each axial plane). This case ran successfully in 40 mins on 2304 processors using 1E10 total particles.

To better quantify the penalty due to the source size, I ran a series of test cases using these two models with different numbers of processors and particles. For consistent cases, the one with the larger number of sources took between 172 and 202 times longer to complete than the case with less sources. This appears to be on the order of the ratio of number of sources (which is 264).

Is there a way to improve the efficiency/runtime with these large number of sources? I appreciate your feedback and suggestions.
-Andrew