## Related Publications

**Provably Optimal Parallel Transport Sweeps on Semi-Structured Grids**, Michael P. Adams, Marvin L. Adams, W. Daryl Hawkins, Timmie G. Smith, Lawrence Rauchwerger, Nancy M. Amato, Teresa S. Bailey, Robert D. Falgout, Adam Kunen, Peter Brown, Journal of Computational Physics, Vol: 407, Apr 2020. DOI: 10.1016/j.jcp.2020.109234

**Keywords:** Parallel Algorithms, Scientific Computing

Links : [Published]

## BibTex

@article{DBLP:journals/jcphy/AdamsAHSRABFKB20,

author = {Michael P. Adams and

Marvin L. Adams and

W. Daryl Hawkins and

Timmie G. Smith and

Lawrence Rauchwerger and

Nancy M. Amato and

Teresa S. Bailey and

Robert D. Falgout and

Adam Kunen and

Peter Brown},

title = {Provably optimal parallel transport sweeps on semi-structured grids},

journal = {J. Comput. Phys.},

volume = {407},

pages = {109234},

year = {2020},

url = {https://doi.org/10.1016/j.jcp.2020.109234},

doi = {10.1016/j.jcp.2020.109234},

timestamp = {Fri, 27 Mar 2020 14:16:32 +0100},

biburl = {https://dblp.org/rec/journals/jcphy/AdamsAHSRABFKB20.bib},

bibsource = {dblp computer science bibliography, https://dblp.org}

}

## Abstract

We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse blocks. We describe these algorithms and show that they always execute the full eight-octant (or four-quadrant if 2D) sweep in the minimum possible number of stages for a given Px x Py x Pz partitioning. Computational results confirm that our optimal scheduling algorithms execute sweeps in the minimum possible stage count. Observed parallel efficiencies agree well with our performance model. Our PDT transport code has achieved approximately 68% parallel efficiency with > 1.5M parallel threads, relative to 8 threads, on a simple weak-scaling problem with only three energy groups, 10 direction per octant, and 4096 cells/core. We demonstrate similar efficiencies on a much more realistic set of nuclear-reactor test problems, with unstructured meshes that resolve fine geometric details. These results demonstrate that discrete-ordinates transport sweeps can be executed with high efficiency using more than 106 parallel processes.

** Validation of Full-Domain Massively Parallel Transport Sweep Algorithms**, eresa S. Bailey, W. Daryl Hawkins, Marvin L. Adams, Peter N. Brown, Adam J. Kunen, Michael P. Adams, Timmie Smith, Nancy M. Amato, and Lawrence Rauchwerger, Transactions of the American Nuclear Society, Vol: 111, Issue: 1, pp. 699-702, Anaheim, CA, USA, Nov 2014.

**Keywords:** Parallel Algorithms, Scientific Computing, STAPL

Links : [Published]

## BibTex

@article{bailey--tans-2014

, author = "Teresa S. Bailey and W. Daryl Hawkins and Marvin L. Adams and Peter N. Brown and Adam J. Kunen and Michael P. Adams and Timmie Smith and Nancy M. Amato and Lawrence Rauchwerger"

, title = "Validation of Full-Domain Massively Parallel Transport Sweep Algorithms"

, journal = "Trans. American Nuclear Society"

, volume = "111"

, month = "November"

, year = "2014"

, pages = "699-702"

, note = "2014 ANS Annual Winter Meeting; Nov 9-13, 2014, Anaheim, CA, USA"

}

## Abstract

The scalability of sweep algorithms has been a somewhat disputed topic recently. Some papers have predicted that these algorithms will scale if applied appropriately, while others have claimed that sweep algorithms have limited scalability. We compare the parallel performance models for sweep algorithms to actual computational performance using two distinctly different discrete ordinates transport codes. We have run both codes beyond 1Million MPI ranks on Lawrence Livermore National LaboratoryÃ¢ÂÂs (LLNL) BGQ machines (SEQUOIA and VULCAN).

**Efficient Massively Parallel Transport Sweeps**, W. Daryl Hawkins, Timmie Smith, Michael P. Adams, Lawrence Rauchwerger, Nancy Amato, Marvin L. Adams, Transactions of the American Nuclear Society, Vol: 107, Issue: 1, pp. 477-481, San Diego, CA, USA, Nov 2012.

**Keywords:** Parallel Algorithms, Scientific Computing, STAPL

Links : [Published]

## BibTex

@article{hawkins-tans-2012

, author = "W. Daryl Hawkins and Timmie Smith and Michael P. Adams and Lawrence Rauchwerger and Nancy Amato and Marvin L. Adams"

, title = "Efficient Massively Parallel Transport Sweeps"

, journal = "Trans. American Nuclear Society"

, volume = "107"

, number = "1"

, month = "November"

, year = "2012"

, pages = "477-481"

, issn = "0003-018X"

, note = "2012 ANS Annual Winter Meeting ; Conference date: 11-11-2012 Through 15-11-2012",

}

## Abstract

The full-domain ÃÂ¢ÃÂÃÂsweep,ÃÂ¢ÃÂÃÂ in which all angular fluxes in a problem are calculated given previous-iterate values only for the volumetric source, forms the foundation for many iterative methods that have desirable properties. One important property is that iteration counts do not grow with mesh refinement. The sweep solution on parallel machines is complicated by the dependency of a given cell on its upstream neighbors. A sweep algorithm is defined by its partitioning (dividing the domain among processors), aggregation (grouping cells, directions, and energy groups into ÃÂ¢ÃÂÃÂtasksÃÂ¢ÃÂÃÂ), and scheduling (choosing which task to execute if more than one is available). The work presented here follows that of Bailey and Falgout, who theoretically and computationally evaluated the performance of three sweep algorithms. Their ÃÂ¢ÃÂÃÂdata-drivenÃÂ¢ÃÂÃÂ schedule appeared to be optimalÃÂ¢ÃÂÃÂexecuting the sweep in the minimum possible number of stagesÃÂ¢ÃÂÃÂfor tested partitionings and aggregations, but they were unable to prove this mathematically and they did not attempt to optimize across possible partitionings and aggregations. Here we consider 3D Cartesian grids of Nx ÃÂÃÂ Ny ÃÂÃÂ Nz spatial cells and simple Px ÃÂÃÂ Py ÃÂÃÂ Pz partitioning, and we permit general aggregation of cells, directions, and energy groups. We have found a provably optimal
family of scheduling algorithms and we present results from one of these. We exploit our guaranteed minimum stage count to further optimize sweep-execution time by choosing the best possible partitioning and aggregation parameters. Our results show excellent scaling out to 32,768 cores, significantly better than results previously reported for sweeps and in line with the optimistic projections.

**A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations**, Mark M. Mathis, Darren J. Kerbyson, Journal of Supercomputing, Vol: 34, Issue: 2, pp. 181-199, Nov 2005. DOI: 10.1007/s11227-005-2339-8

**Keywords:** High-Performance Computing, Parallel Algorithms

Links : [Published]

## BibTex

@Article{Mathis2005,

author={Mathis, Mark M.

and Kerbyson, Darren J.},

title={A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations},

journal={The Journal of Supercomputing},

year={2005},

month={Nov},

day={01},

volume={34},

number={2},

pages={181-199},

abstract={The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, the handling of unstructured meshes has an increased complexity in order to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics when using large-scale parallel systems. In this work we present a general performance model for the calculation of deterministic SNtransport on unstructured meshes that is also applicable to structured meshes. The model captures the key processing characteristics of the calculation and is parametric using both system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. A single formulation of the model is used to predict the performance of two quite different implementations of the same calculation. It is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high prediction accuracy.},

issn={1573-0484},

doi={10.1007/s11227-005-2339-8},

url={https://doi.org/10.1007/s11227-005-2339-8}

}

## Abstract

The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, the handling of unstructured meshes has an increased complexity in order to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics when using large-scale parallel systems. In this work we present a general performance model for the calculation of deterministic S N transport on unstructured meshes that is also applicable to structured meshes. The model captures the key processing characteristics of the calculation and is parametric using both system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. A single formulation of the model is used to predict the performance of two quite different implementations of the same calculation. It is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high prediction accuracy.

**Performance Modeling of Unstructured Mesh Particle Transport Computations**, Mark M. Mathis, Darren J. Kerbyson, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., pp. 245-, Santa Fe, NM, USA, Apr 2004. DOI: 10.1109/IPDPS.2004.1303300

**Keywords:** High-Performance Computing, Parallel Algorithms

Links : [Published]

## BibTex

@INPROCEEDINGS{1303300,

author={M. M. {Mathis} and D. J. {Kerbyson}},

booktitle={18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.},

title={Performance modeling of unstructured mesh particle transport computations},

year={2004},

volume={},

number={},

pages={245-},

doi={10.1109/IPDPS.2004.1303300}}

## Abstract

Summary form only given. The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, handling an unstructured mesh has an increased complexity to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics. Here we present a performance model for the calculation of deterministic S/sub N/ transport on unstructured meshes. It builds upon earlier work that successfully modeled the same calculation on structured meshes. The model captures the key processing characteristics and is parametric using both the system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. The model is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high accuracy. Importantly it is also shown that a single formulation of the model can be used to predict the performance of two quite different implementations of the same calculation.

**A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems**, Mark M. Mathis, Darren J. Kerbyson, Adolfy Hoisie, Computational Science (ICCS 2003), Lecture Notes in Computer Science, Vol: 2659, Melbourne, Australia, Jun 2003. DOI: 10.1007/3-540-44863-2_89

**Keywords:** High-Performance Computing, Parallel Algorithms

Links : [Published]

## BibTex

@InProceedings{10.1007/3-540-44863-2_89,

author="Mathis, Mark M.

and Kerbyson, Darren J.

and Hoisie, Adolfy",

editor="Sloot, Peter M. A.

and Abramson, David

and Bogdanov, Alexander V.

and Gorbachev, Yuriy E.

and Dongarra, Jack J.

and Zomaya, Albert Y.",

title="A Performance Model of Non-deterministic Particle Transport on Large-Scale Systems",

booktitle="Computational Science --- ICCS 2003",

year="2003",

publisher="Springer Berlin Heidelberg",

address="Berlin, Heidelberg",

pages="905--915",

abstract="In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of a non-deterministic particle transport application, MCNP. Previous studies on the scalability of parallel Monte Carlo eigenvalue calculations have been rather general in nature [[1]]. It can be used for the simulation of neutron, photon, electron, or coupled transport, and has found uses in many problem areas. The performance model is validated against measurements on an AlphaServer ES40 system showing high accuracy across many processor / problem combinations. It is parametric with both application characteristics (e.g. problem size), and system characteristics (e.g. communication latency, bandwidth, achieved processing rate) serving as input. The model is used to provide insight into the achievable performance that should be possible on systems containing thousands of processors and to quantify the impact that possible improvements in sub-system performance may have. In addition, the impact on performance of modifying the communication structure of the code is also quantified.",

isbn="978-3-540-44863-1"

}

## Abstract

In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of a non-deterministic particle transport application, MCNP. Previous studies on the scalability of parallel Monte Carlo eigenvalue calculations have been rather general in nature [[1]]. It can be used for the simulation of neutron, photon, electron, or coupled transport, and has found uses in many problem areas. The performance model is validated against measurements on an AlphaServer ES40 system showing high accuracy across many processor / problem combinations. It is parametric with both application characteristics (e.g. problem size), and system characteristics (e.g. communication latency, bandwidth, achieved processing rate) serving as input. The model is used to provide insight into the achievable performance that should be possible on systems containing thousands of processors and to quantify the impact that possible improvements in sub-system performance may have. In addition, the impact on performance of modifying the communication structure of the code is also quantified.

**Finding Strongly Connected Components in Parallel in Particle Transport Sweeps**, William McLendon, Bruce Hendrickson, Steve Plimpton, Lawrence Rauchwerger, Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 328-329, Crete Island, Greece, Jul 2001. DOI: 10.1145/378580.378751

**Keywords:** High-Performance Computing, Parallel Algorithms

Links : [Published]

## BibTex

@inproceedings{10.1145/378580.378751,

author = {Mclendon, William and Hendrickson, Bruce and Plimpton, Steve and Rauchwerger, Lawrence},

title = {Finding Strongly Connected Components in Parallel in Particle Transport Sweeps},

year = {2001},

isbn = {1581134096},

publisher = {Association for Computing Machinery},

address = {New York, NY, USA},

url = {https://doi.org/10.1145/378580.378751},

doi = {10.1145/378580.378751},

abstract = {Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.},

booktitle = {Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures},

pages = {328–329},

numpages = {2},

location = {Crete Island, Greece},

series = {SPAA '01}

}

## Abstract

Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.

**A General Performance Model for Parallel Sweeps on Orthogonal Grids for Particle Transport Calculations**, Mark M. Mathis, Masters Thesis, Department of Computer Science and Engineering, Texas A&M University, Dec 2000.

**Keywords:** High-Performance Computing, Parallel Algorithms

Links : [Published]

## BibTex

@mastersthesis{CitekeyMastersthesis,

author = "Jian Tang",

title = "Spin structure of the nucleon in the asymptotic limit",

school = "Texas A&M University",

year = 2000,

address = "College Station, TX",

month = Dec

}

## Abstract

There is a growing need to accurately simulate physical systems whose evolution depends on the transport of subatomic particles. It has long been recognized that the huge computational demands of the transport problem mean that practical solution times will be obtained only by the efficient utilization of parallel processing. For example, since estimates place the time devoted to particle transport in multi-physics simulations at 50-80% of total execution time, parallelizing deterministic particle transport calculations is an important problem in many applications targeted by the Accelerated Strategic Computing Initiative of the United States Department of Energy. One common approach to deterministic particle transport calculations is the discrete-ordinates method, whose most time consuming step is the transport sweep which involves multiple sweeps through the spatial grid, one for each direction of particle travel. The efficient parallel implementation of the transport sweeps is the key to parallelizing the discrete-ordinates method. The key contribution of this thesis is a new general model that can be used to compare the running times of transport sweeps on three-dimensional orthogonal grids for various mappings of the grid cells to processors. Our model, which includes machine-dependent parameters such as computation cost and communication latency, can be used to analyze and compare the effects of various spatial decompositions on the running time of the transport sweep. Insight obtained from the model yields two significant contributions to the theory of optimal transport sweeps on orthogonal grids. First, our model provides a theoretical basis that explains why, and under what circumstances, the column decomposition of the current standard KBA algorithm is superior to the 'balanced' decomposition obtained by classic domain decomposition techniques. Second, our model enables us to identify a new decomposition, which we call Hybrid, that proves to be almost as good as and often better than the current standard KBA method. We obtain expressions for the completion time and discuss theoretical results.