F. Xavier Trias
home              about me              publications           research             teaching             personal interest

STG code


The STG (for Soria Trias Gorobets, main developers of the code)  is a dedicated code for large-scale simulations of incompressible turbulent flows in rectangular domains with one periodic direction. The core of the code is the highly-scalable hybrid direct-iterative Krylov–Schur–Fourier Decomposition (KSFD) Poisson solver [5] that has been designed and optimized for this particular domain configuration. This solver is among the world's faster flow solvers for this certain kind of applications. The numerical algorithm of the code is based on a fully conservative staggered fourth-order symmetry-preserving discretization [1] that provides high accuracy and unconditional stability for time-accurate simulations. Moreover, a fully explicit self-adaptive time integration scheme  [12] is used; compared with the classical CFL condition it significanly improves the efficiency of the time integration.

The code has a multilevel MPI+OpenMP parallelization [11] that allows to efficiently engage thousands of supercomputer nodes with modern multi-core CPUs. The parallelization is formed of the three levels:
  1. Specific MPI parallelization in the periodic direction where the FFT is applied; 
  2. Domain decomposition MPI parallelization in two other spatial directions;
  3. Throughout OpenMP parallelization.
Each of these level has specific bottlenecks that come from the KSFD solver. MPI parallelization in periodic direction uses a group data exchange for FFT that limits the number of blocks in this direction up to 8 with the required efficiency.
 
Domain decomposition is limited due to a bottleneck of the direct Schur complement based solver. It operates with the inverted interface matrix, that grows with both the mesh size and the number of subdomains. Its efficient range for a mesh with billion of nodes is around 200 subdomains. To enlarge this limit a special symmetry extension has been developed. It allows to decompose the algebraic problem into 4 independent sub-problems that are solved by four groups of processes allowing to use up to 800 domains for a mesh with 8 billion of nodes, respectively (additional factor of 2 comes from the periodic direction).
 
The number of OpenMP threads is naturally limited by the number of cores per CPU socket, which is 8 in the case of MareNostrum. Except extreme scalability cases we are not engaging multiple sockets with one MPI process in order to avoid losses on cache coherence and NUMA factor. Typically we use one or more MPI processes per socket with forced threads affinity.
 
Multiplying these three limiting numbers 8x800x8 gives us the limit of the efficient range of the number of CPU cores as at least 51200. Performance tests with up to 12800 CPU cores (see Figure below), which was maximum available at that time, can be found in [11], where the above-mentioned parallelization limits were studied separately in detail.

STG-code speed-up


The code also has a multi-level MPI+OpenMP+OpenCL parallelization [21]. The heterogeneous extension for GPGPUs is developed on a base of a multilevel parallel model that exploits several layers of parallelism of a modern hybrid supercomputer. Massively-parallel computing accelerators are engaged by means of the hardware independent OpenCL computing standard. The performance of the code has been tested on various AMD and NVIDIA GPUs and on multiple GPUs using up to 16 GPUs in one simulation (which is by far not a limit for the code). In average the speedup on a single GPU compared to a single 6- or 8-core CPU ranges between 2.5 and 4 times depending on particular model of CPU and accelerator.
 

Rayleigh-Bénard at Ra=1e10

The STG-code has been used for direct numerical simulations (DNS) of turbulent flows for more than 10 years. Numerous simulations of fundamental problems published in top-ranked international journals ensure the reliability and correctness of the code itself and of our overall simulation technology. The most recent is a DNS of a Rayleigh-Bénard configuration at Rayleigh number 1e10 (see Figure above). Details about our previous simulations can be found in [1][3][6][7][13][17][22][25][28][31].

Movie gallery of DNSs of natural convection flows
Movie gallery of DNSs of forced convection flows

[1] R.W.C.P.Verstappen and A.E.P.Veldman, “Symmetry-preserving discretization of turbulent flow,” J. Comput. Phys. 187, 343–368 (2003)






                                                                                  xavi@cttc.upc.edu       xavitrias@gmail.com