Sailfish uses default settings that should make most simulations run at a reasonable speed on modern GPUs. There are however several tunable parameters that can be used to increase the performance of specific simulations. Finding the optimal values of these parameters requires some experimentation on a case-by-case basis. The guide below suggests some useful strategies. In all following subsections we will assume that the CUDA backend is used.
This is the simplest optimization you should apply when trying to increase the speed of a simulation. Every lattice Boltzmann node in Sailfish is processed in a separate GPU thread. These threads are grouped into 1-dimensional blocks, which allows them to exchange data. The default block size is 64. You can adjust this to a higher value using the --block_size=N command line option. Values that are multiplies of 32 should be the most effective due to the architecture of the GPU hardware.
As a rough guideline, use lower block sizes for more complex models and higher block sizes for simpler ones (e.g. 128 for single fluid, but 64 for binary fluids). The more complex the LB model used, the more registers the generated GPU code will require. You can check the number of registers used by adding --cuda-kernel-stats to your command line options. The simulation will then output something similar to:
CollideAndPropagate: l:0 s:3072 r:30 occ:(1.000000 tb:4 w:32 l:regs) LBMUpdateTracerParticles: l:0 s:0 r:17 occ:(0.250000 tb:8 w:8 l:device) SetInitialConditions: l:0 s:0 r:19 occ:(1.000000 tb:4 w:32 l:warps)
Each line presents, in order: name of the CUDA kernel, number of bytes of local memory, number of bytes of shared memory, number of registers, occupancy, thread blocks per multiprocessor, warps per multiprocessor, name of the factor limiting occupancy. A large number of registers will limit the occupancy, which will usually result in a lower performance of the kernel. An occupancy of 0.5 or higher is best. You only need to optimize the occupancy of the kernels that are executed the most often (e.g. CollideAndPropagate in the example above is important because it is executed for every time step. SetInitialConditions on the other hand is irrelevant, as it is only used to initialize the simulation).
In order to increase the occupancy, you can force the CUDA compiler to use a lower number of registers in the GPU code. This can be done via the --cuda-nvcc-opts=--maxrregcount=X which will cause the compiler to limit the number of registers to X. If you use a low value of X, some of the variables in the kernel will be moved from registers to local memory (register spilling). Local memory is much slower than the registers however, so the net effect can be a performance degradation despite the higher occupancy. Experimentation is advised.
CUDA code can use a faster, but less precise version of several common mathematical functions (e.g. transcendental functions such as sine, cosine, square root or the exponential function). These so-called intrinsic functions will be used if the fast math mode is turned on, which can be done using the --cuda-nvcc-opts=--use_fast_math command line option. This might slightly increase the speed of some of the more complex LB models. If you decide to apply this optimization, watch out for degraded precision (always run regression tests of our simulation) and increased register usage.
If you care about the CPU time used for running Sailfish simulations, use --cuda-sched-yield and --cuda-minimize-cpu-usage to improve the default settings at a slight performance cost. The first option makes the CUDA threads yield to other ones when waiting for GPU, which can improve performance of CPU code in high load situations. The second option instructs CUDA to use blocking synchronization. This will decrease performance of the GPU simulation by increasing latency of host synchronization operations. The CPU usage decrease will be larger for bigger simulation domains and more complex models – the longer a step of the simulation takes, the lower the CPU usage.
Fermi devices are based on a new GPU architecture and can benefit from additional optimizations. The general guidelines presented above still apply.
By default, Sailfish uses a lower precision version of the division operator and square root function (same as in CUDA devices of compute capability 1.3 and lower). This helps with register usage and can be turned off by the --cuda-fermi-highprec option.
Fermi devices have more multiprocessors than GPUs of the previous generation. The multiprocessors can also handle more threads, and have better scheduling capabilities (which makes it possible for the GPU to execute several different kernels simultaneously).
To fully take advantage of the available computational power, a larger block size will usually be necessary (typically twice as large as for devices of the previous generation, but make sure to check occupancy and register usage as well).
The usage of L1 cache is often detrimental to performance of the simulation, especially in double precision. Use --cuda-disable-l1 to disable the usage of L1 for caching global memory accesses.
The ECC functionality available in Tesla-class cards can decrease performance by 10-30% compared to memory without error correction. If you don’t need the functionality and are sure that your hardware is stable, you can disable ECC using the nvidia-smi tool. Note that this requires root access and a machine reboot, and as such might not be possible in a shared computing environment.