Parallel Sweeps for Particle Transport Calculations

Project Personnel: Timmie Smith, Nancy Amato, Lawrence Rauchwerger

Project Collaborators: Marvin Adams (Nuclear Engineering, TAMU)

Introduction

There is a growing need to accurately simulate physical systems whose evolutions depend on the transport of subatomic particles. It has long been recognized that the huge computational demands of the transport problem mean that practical solution times will be obtained only by the efficient utilization of parallel processing. For example, since estimates place the time devoted to particle transport in multi-physics simulations at 50-80% of total execution time, parallelizing deterministic particle transport calculations is an important problem in many applications targeted by the DOE's Accelerated Strategic Computing Initiative (ASCI). One common approach to deterministic particle transport calculations is the discrete-ordinates method, whose most time consuming step is the transport sweep which involves multiple sweeps through the spatial grid, one for each direction of particle travel. The efficient parallel implementation of the transport sweeps is the key to parallelizing the discrete-ordinates method.

Contribution

Our key contribution to this field is a new general model that can be used to compare the running times of transport sweeps on three-dimensional orthogonal grids for various mappings of the grid cells to processors. Our model, which includes machine-dependent parameters such as computation cost and communication latency, can be used to analyze and compare the effects of various spatial decompositions on the running time of the transport sweep. Insight obtained from the model yields two significant contributions to the theory of optimal transport sweeps on orthogonal grids. First, our model provides a theoretical basis which explains why, and under what circumstances, the column decomposition of the current standard KBA algorithm is superior to the `balanced' decomposition obtained by classic domain decomposition techniques. Second, our model enables us to identify a new decomposition, which we call Hybrid, that proves to be almost as good as and sometimes better than the current standard KBA method.

During each iteration, an ordered traversal through the spatial grid is performed for each direction of particle travel. A sweep of the grid begins at one of the eight corners, depending on the direction desired. On an arbitrary grid, there may be as many unique sweep orderings as there are directions. However, for the orthogonal grids considered in this work, there are only eight unique sweep orderings, one for each octant of directions. That is, if all directions in a given octant are processed together, then eight distinct traversals of the spatial grid must be performed in each iteration. The sweep for a given octant progresses from its starting corner across the grid to the opposite corner following a diagonal trajectory.

The constraining factor is that every cell behind a diagonal plane must be calculated before any cell ahead of the diagonal plane can be calculated. To simultaneously process all directions, there is one sweep plane for each octant of directions. It would appear that this inherently sequential nature of the sweep would seriously inhibit any parallel implementation of the basic sweep algorithm. However, while the progress of the diagonal sweep plane is sequential, every cell inside the sweep plane is independent of all other cells in the same plane. Therefore, all cells in a sweep plane may be processed in parallel. Furthermore, the sweeps for each direction are independent and can be processed in parallel.

The key factor determining the parallel efficiency of the sweeps is the mapping of the spatial domain onto the processors, that is, the assignment of the spatial cells to the processors. In this thesis, we develop a model that can be used to analyze transport sweeps on regular orthogonal grids for any `regular' mapping of the grid cells to processors. While the theory is general, our analysis will concentrate on three different methods for performing this decomposition: the column decomposition of the KBA (which is the current standard), a Volumetric method which uses a `balanced' decomposition (obtained by classic domain decomposition techniques), and then a Hybrid approach that combines aspects of the first two methods. We have selected these three decompositions since they represent the two extreme cases (the KBA's columns vs. the Volumetric's balanced sub-domains) and an intermediate case (Hybrid).

Inter-processor communication will be required to move data from an `upstream' cell on one processor to an adjacent `downstream' cell on another processor, where which cells are upstream and downstream depends on the sweep direction. During a sweep, each processor will calculate one or more contiguous planes of cells in its partition before it processes the next set of cells in its partition and before the next processor can process its adjacent set of cells. It is important to note that the time to process one block of cells is highly dependent on the partitioning method being used.

Theoretical Model

The total time required by the sweep algorithm, or completion time, T, is the collective sum of all computation and communication required on the critical path. The computation time is the summation of the computation required for each individual step. The communication time is the summation of all outstanding communication required to perform the transport sweep. This assumption limits communication time to cross-processor communication that cannot be pipelined. Combining these assumptions gives us an expression for the completion time, T (we find it useful to normalize the equation by dividing through by the grind time, omega).

where,

The performance of any method can be tuned using the block size parameter, k. Specifically, we want to choose a k such that the completion time is minimized. The optimal value of k can easily be found by minimizing the model equation with respect to k.

The key contribution of this thesis is a new general model which can be used to predict the running time of parallel sweeps on orthogonal grids for any regular mapping of the grid cells to processors. In particular, our model accounts for machine-dependent parameters such as computation and communication/latency costs (assumed constant for all grid cells) and is parameterized by the number of processors, the dimensions of the underlying spatial transport grid, and the dimensions of the coarse grid processor overlay (which determines the dimensions of the sub-domain assigned to each processor). Thus, our model can be used to analyze and compare the effects of various spatial decompositions on the running time of the transport sweep.

Insight obtained from the model yields the following contributions to the theory of optimal transport sweeps on orthogonal grids.

Our model provides a theoretical basis that explains why, and under what circumstances, the column decomposition of the current standard KBA algorithm is superior to the `balanced' decomposition' obtained by classic domain decomposition techniques.

Our model enables us to identify a new decomposition that proves to be almost as good as and often better than the current standard KBA method. We call this new decomposition the Hybrid method because it incorporates positive aspects of both the KBA and the balanced decomposition.

A more minor (but still potentially valuable) contribution of our work is a theoretical expression for the optimal `block size' parameter in the sweep method (the number of cells a processor should process before communicating).

The methods we have analyzed represent a family of algorithms for three dimensional sweeps. Given an input grid and a parallel computer system (i.e., number of processors and estimates of computation and communication costs for each grid cell), one can select the best method for the given configuration. Furthermore, one can minimize the completion time by selecting an optimal block size. Future work on this subject will involve correctly accounting for restarts in the execution of the transport sweep.

One Sweep Only

All Distinct Sweeps

Provably Optimal Parallel Transport Sweeps on Regular Grids, W Hawkins, Timmie Smith, Michael Adams, Lawrence Rauchwerger, Nancy Amato, Marvin Adams, Teresa Bailey, Robert Falgout, In Proc. Int. Conf. on Math. Meth. and Supercomp. for Nuc. App., Idaho, May 2013.
Proceedings(pdf, abstract)

Efficient Massively Parallel Transport Sweeps, W Hawkins, Timmie Smith, Michael Adams, Lawrence Rauchwerger, Nancy Amato, Marvin Adams, Trans. Amer. Nucl. Soc., 107(1):477-481, Nov 2012.
Journal(pdf)

Finding Strongly Connected Components in Parallel in Particle Transport Sweeps, William McLendon, Bruce Hendrickson, Steve Plimpton, Lawrence Rauchwerger, In Proc. ACM Symp. Par. Alg. Arch. (SPAA), pp. 328-329, Crete, Greece, Jul 2001.
Proceedings(ps, pdf, abstract)

A General Performance Model for Parallel Sweeps on Orthogonal Grids for Particle Transport Calculations, Mark M. Mathis, Masters Thesis, Department of Computer Science and Engineering, Texas A&M University, Dec 2000.
Masters Thesis(ps, pdf, abstract)

A General Performance Model for Parallel Sweeps on Orthogonal Grids for Particle Transport Calculations, Mark M. Mathis, Nancy M. Amato, Marvin Adams, In Proc. ACM Int. Conf. Supercomputing (ICS), pp. 255-263, Santa Fe, NM, May 2000. Also, Technical Report, TR00-004, Parasol Laboratory, Department of Computer Science, Texas A&M University, Dec 1999.
Proceedings(ps, pdf, abstract) Technical Report(ps, pdf, abstract)

Supported by NSF, DOE (ASCI ASAP Level 2 and 3 Programs)

Project Alumni:Michael Adams,W Hawkins,Bruce Hendrickson,Mark M. Mathis,William McLendon,Steve Plimpton