Parallel Implementation of CMAQ
1. General Structure of data
There are a few approaches such as data-parallelism, to parallelize an application. Data-parallelism is a paradigm which decomposes data into "equal" section and distributes among allocated processors. Each processor works on the portion it owns. CMAQ parallel implementation is based on this methodology.
CMAQ model operates on a 4D space (ncols, nrows, nlays, nspcs) and only the spatial domain is decomposed. When NPROCS number of processors is used to run CMAQ, NPCOL number of process out from NPROCS processors is assigned to the column dimension and the remaing NPROW number of processors is assigned to the row dimenion (NPROCS = NPCOL x NPROW). In the case of column dimenion is not divisible by NPCOL, the remainder is distributed equally to NPCOL processors. Same thing applies to the row dimension. For example, given a 100 by 75 (column x row) data grid and six processors with three processor along the column dimension and two processor along the row dimension, the subdomain size in each processor (NCOLS x NROWS): PE 0 is 34 x 38, PE 1 and PE 2 are 33 x 38, PE 3 is 34 x 37, PE 4 and PE 5 are 33 x 37 (Fig. 1).
Figure 1. An example of domain decomposition.
2. Interprocessor Communication
In some science processes such as advection, a processor requires data from neighboring processors (interprocessor communication) when the model runs on a distributed memory system. A interprocessor communication library, STENEX, was developed to provide simple and robust interface to handle various kinds of near neighbour communication. Near neighbour is defined as processors which are adjacent to itself (blue block) in the eight major geographical direction: N, NE, E, SE. S, SW, W, and NW (Fig. 2).
Figure 2. Near neighour configuration.
As an illustration of interprocessor data access, consider the following piece of code executing on Processor 2 with a 2x2, 4-processor domain decomposition. It is clear that calculation at grid cell denoted by "X" requires data denoted by red dots which resided in near neighbour processor 0 and 3 (Fig. 3).
DIMENSTION DATA( NCOLS, NROWS )
DO J = 1, NROWS DO I = 1, NCOLS DATA(I,J) = A(I+2,J) * A(I, J-1) END DO END DO
Figure 3. An example of a calculation requires data resided in near-by processor.
To facilate interprocessor communication, "ghost" regions are used, i.e. DIMENSION DATA (NCOLS+2, NROWS+1). Thickness of the ghost region depends of the amount of overlap that is required by the algorithm.
The Stencil Exchange (SE) Library is designed in Fortran 90 language using Object Oriented-base technology to handle various types of communication with the objective of hiding the management of the low level data movement. In this version, the library is extended to handle two grid structures: coarse grid, which is denoted by a character 'c' or 'C', and find grid, in the application. In addition, currently SE addresses four types of communication: interior to ghost region (Fig 4a), interior to interior (Fig 4b), sub-section data redistribution (Fig. 4c), and selective data collection (Fig. 4d)
Figure 4. Four types of communication supported by SE library.
Details of all these four schemes can be obtained through request.
3. Parallel I/O
All I/O operations in CMAQ are handle by IOAPI_3 library. However, IOAPI_3 library was design for serial code only and it does not work in a distributed memory environment where domain decomposition has been used. PARIO library was developed to perform file I/O for CMAQ applications running on distributed memory platforms. To maintain efficiency and correctness we have developed parallel access routines that have equivalent functionality and are effectively layered on top of the standard IOAPI_3. The following IOAPI_3 routines have PARIO equivalents: READ3, INTERP3, WRITE3, CHECK3, OPEN3, CLOSE3, DESC3, M3ERR, M3EXIT, M3WARN. The convention adopted is to use a "P" prefix for the parallel version. Hence POPEN3, PINTERP3, etc. Substitution of the PARIO subroutines is done in a precompilation step. Note that the interface (i.e., the argument lists) to IOAPI_3 routines and their PARIO equivalents is identical.
On the output side, all processors require to send their portion of data to processor 0 which will stitch each sub-part and then output it to the file. This considers as a “pseudo” parallel I/O approach (Fig. 5).
Figure 5. pseudo parallel I/O scheme.
Recently we have developed a true parallel I/O approach which allows each processor write their portion to the file simultaneously (Fig. 6) (Wong et. al.).
Figure 6. True parallel I/O scheme.
This approach has been incorporated into IOAPI version 3.2. Parallel I/O scheme has been implemented in CMAQ 5.2. User is required to turn on such feature at the model build step and link with IOAPI 3.2.
Wong, D. C., Yang, C. E., Fu, J. S., Wong, K., and Gao, Y., “An approach to enhance pnetCDF performance in environmental modeling applications”, Geosci. Model Dev., 8, 1033-1046, 2015.