Generic Particle Transport Code

Related Projects: STAPL: Standard Template Adaptive Parallel Library STAPL Applications STAPL Graph Library STAPL Components pAlgorithms pContainers pViews

Current Contributors: Yunhuang Zhang, Marvin Adams (Nuclear Engineering, TAMU), Nancy Amato, Jim Morel (Nuclear Engineering, TAMU), Lawrence Rauchwerger
Project Alumni: Daryl Hawkins, Timmie Smith, Nathan Thomas, Anthony Barbu, Tarek Ghaddar, Jonathan Madsen, Carolyn McGraw, Michael Adams, Ping An, Teresa Bailey, Jae Chang, Tao Huang, Josh Jarell, Alin Jula, Alex Maslowski, Mark Mathis, Olga Pearce, Silvius Rus, Armando Solar, Lidia Smith, Gabriel Tanase, Mauro Bianco

The overall goal of the project is the development of strategies that will produce the desired discrete-ordinates transport solutions in the lowest possible wall-clock time on computational platforms of interest to the DOE's Accelerated Strategic Computing Initiative (ASCI). The project began in 1998 with the best known sequential algorithms for solving the discrete-ordinates transport problem as starting points for devising algorithms and coding strategies to yield the best possible performance on modern parallel cache-based systems. To date the project has produced a transport code (PDT) that is capable of solving steady-state discrete-ordinates problems using regular hexahedral grids and multiple transport algorithms and acceleration methods. Work to allow the code to handle arbitrary grids and to solve time-dependent transport problems is in progress.

Problem Statement

The deterministic transport problem solved by PDT can be described briefly as:

Given:

A domain in an N-dimensional space made of known materials.
An initial flow of particles through the domain at a starting time.
An initial set of sources generating particles inside the domain.
Knowledge about the behavior at the domain boundary.

Compute: The flow of particles at a subsequent time at every point in the domain.

Code Overview

PDT is written in C++ using STAPL. The code represents the problem as a set of objects described below. The main function of the application uses the constructors and initialization methods of the objects to setup the problem and then calls the solve method of the problem class.

Generic Problem

We represent a problem as a system containing:

A spatial discretization (Grid).
A material repository.
An energy and angular discretization system.
A solver.

Generic Grid and ElementMap

The Grid object represents the topology of the given spatial discretization. It is made of cells.

In order to solve the problem we introduced a parallel topology based on computational elements rather than spatial ones. That is the ElementMap. It is made of elements.

Parallel Solver

We have already introduced the concept of ElementMap, which was created to ease the abstraction of solving the problem. There are other constructs that ease the development of solvers for parallel machines:

The Chunk
The Executor

Generic Chunk

The Chunk is composed of:

A set of spatial cells
A set of energy groups
A set of angles

It is the computation unit for sweeping algorithms. A chunk is the atomic unit of work in the sweep. For distributed memory systems the chunks are also communications atoms since messages carrying information to cells on different processors are buffered until all cells in the set are processed..

Partitioner and Scheduler

The Partitioner reads the problem size from the input file and determines the assignment of cells to each thread of execution. The Scheduler accepts a set of dependence graphs on Cells as input and aggregates the cells into cell sets. The Scheduler also produces dependence graphs based on cell sets and angle sets.

Generic Executor

The executor is implemented as a pAlgorithm in STAPL named p_for_all. The algorithm takes as input a set of dependency graphs on Chunks (which are represented using the STAPL pRange) and a generic function to be executed on every Chunk. It manages parallel execution by determining the next available Chunk to process on each thread based on the dependence graphs given. On distributed memory systems it manages the communication too by forcing any buffered messages to be sent after each call of the work function given as input.

Related Publications

Provably Optimal Parallel Transport Sweeps on Semi-Structured Grids, Michael P. Adams, Marvin L. Adams, W. Daryl Hawkins, Timmie G. Smith, Lawrence Rauchwerger, Nancy M. Amato, Teresa S. Bailey, Robert D. Falgout, Adam Kunen, Peter Brown, Journal of Computational Physics, Vol: 407, Apr 2020. DOI: 10.1016/j.jcp.2020.109234
Keywords: Parallel Algorithms, Scientific Computing
Links : [Published]

BibTex

@article{DBLP:journals/jcphy/AdamsAHSRABFKB20,
author = {Michael P. Adams and
Marvin L. Adams and
W. Daryl Hawkins and
Timmie G. Smith and
Lawrence Rauchwerger and
Nancy M. Amato and
Teresa S. Bailey and
Robert D. Falgout and
Adam Kunen and
Peter Brown},
title = {Provably optimal parallel transport sweeps on semi-structured grids},
journal = {J. Comput. Phys.},
volume = {407},
pages = {109234},
year = {2020},
url = {https://doi.org/10.1016/j.jcp.2020.109234},
doi = {10.1016/j.jcp.2020.109234},
timestamp = {Fri, 27 Mar 2020 14:16:32 +0100},
biburl = {https://dblp.org/rec/journals/jcphy/AdamsAHSRABFKB20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Abstract

We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse blocks. We describe these algorithms and show that they always execute the full eight-octant (or four-quadrant if 2D) sweep in the minimum possible number of stages for a given Px x Py x Pz partitioning. Computational results confirm that our optimal scheduling algorithms execute sweeps in the minimum possible stage count. Observed parallel efficiencies agree well with our performance model. Our PDT transport code has achieved approximately 68% parallel efficiency with > 1.5M parallel threads, relative to 8 threads, on a simple weak-scaling problem with only three energy groups, 10 direction per octant, and 4096 cells/core. We demonstrate similar efficiencies on a much more realistic set of nuclear-reactor test problems, with unstructured meshes that resolve fine geometric details. These results demonstrate that discrete-ordinates transport sweeps can be executed with high efficiency using more than 106 parallel processes.

Validation of Full-Domain Massively Parallel Transport Sweep Algorithms, eresa S. Bailey, W. Daryl Hawkins, Marvin L. Adams, Peter N. Brown, Adam J. Kunen, Michael P. Adams, Timmie Smith, Nancy M. Amato, and Lawrence Rauchwerger, Transactions of the American Nuclear Society, Vol: 111, Issue: 1, pp. 699-702, Anaheim, CA, USA, Nov 2014.
Keywords: Parallel Algorithms, Scientific Computing, STAPL
Links : [Published]

BibTex

@article{bailey--tans-2014
, author = "Teresa S. Bailey and W. Daryl Hawkins and Marvin L. Adams and Peter N. Brown and Adam J. Kunen and Michael P. Adams and Timmie Smith and Nancy M. Amato and Lawrence Rauchwerger"
, title = "Validation of Full-Domain Massively Parallel Transport Sweep Algorithms"
, journal = "Trans. American Nuclear Society"
, volume = "111"
, month = "November"
, year = "2014"
, pages = "699-702"
, note = "2014 ANS Annual Winter Meeting; Nov 9-13, 2014, Anaheim, CA, USA"
}

Abstract

The scalability of sweep algorithms has been a somewhat disputed topic recently. Some papers have predicted that these algorithms will scale if applied appropriately, while others have claimed that sweep algorithms have limited scalability. We compare the parallel performance models for sweep algorithms to actual computational performance using two distinctly different discrete ordinates transport codes. We have run both codes beyond 1Million MPI ranks on Lawrence Livermore National LaboratoryÃ¢ÂÂs (LLNL) BGQ machines (SEQUOIA and VULCAN).

Efficient Massively Parallel Transport Sweeps, W. Daryl Hawkins, Timmie Smith, Michael P. Adams, Lawrence Rauchwerger, Nancy Amato, Marvin L. Adams, Transactions of the American Nuclear Society, Vol: 107, Issue: 1, pp. 477-481, San Diego, CA, USA, Nov 2012.
Keywords: Parallel Algorithms, Scientific Computing, STAPL
Links : [Published]

BibTex

@article{hawkins-tans-2012
, author = "W. Daryl Hawkins and Timmie Smith and Michael P. Adams and Lawrence Rauchwerger and Nancy Amato and Marvin L. Adams"
, title = "Efficient Massively Parallel Transport Sweeps"
, journal = "Trans. American Nuclear Society"
, volume = "107"
, number = "1"
, month = "November"
, year = "2012"
, pages = "477-481"
, issn = "0003-018X"
, note = "2012 ANS Annual Winter Meeting ; Conference date: 11-11-2012 Through 15-11-2012",
}

Abstract

The full-domain ÃÂ¢ÃÂÃÂsweep,ÃÂ¢ÃÂÃÂ in which all angular fluxes in a problem are calculated given previous-iterate values only for the volumetric source, forms the foundation for many iterative methods that have desirable properties. One important property is that iteration counts do not grow with mesh refinement. The sweep solution on parallel machines is complicated by the dependency of a given cell on its upstream neighbors. A sweep algorithm is defined by its partitioning (dividing the domain among processors), aggregation (grouping cells, directions, and energy groups into ÃÂ¢ÃÂÃÂtasksÃÂ¢ÃÂÃÂ), and scheduling (choosing which task to execute if more than one is available). The work presented here follows that of Bailey and Falgout, who theoretically and computationally evaluated the performance of three sweep algorithms. Their ÃÂ¢ÃÂÃÂdata-drivenÃÂ¢ÃÂÃÂ schedule appeared to be optimalÃÂ¢ÃÂÃÂexecuting the sweep in the minimum possible number of stagesÃÂ¢ÃÂÃÂfor tested partitionings and aggregations, but they were unable to prove this mathematically and they did not attempt to optimize across possible partitionings and aggregations. Here we consider 3D Cartesian grids of Nx ÃÂÃÂ Ny ÃÂÃÂ Nz spatial cells and simple Px ÃÂÃÂ Py ÃÂÃÂ Pz partitioning, and we permit general aggregation of cells, directions, and energy groups. We have found a provably optimal family of scheduling algorithms and we present results from one of these. We exploit our guaranteed minimum stage count to further optimize sweep-execution time by choosing the best possible partitioning and aggregation parameters. Our results show excellent scaling out to 32,768 cores, significantly better than results previously reported for sweeps and in line with the optimistic projections.

A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations, Mark M. Mathis, Darren J. Kerbyson, Journal of Supercomputing, Vol: 34, Issue: 2, pp. 181-199, Nov 2005. DOI: 10.1007/s11227-005-2339-8
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@Article{Mathis2005,
author={Mathis, Mark M.
and Kerbyson, Darren J.},
title={A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations},
journal={The Journal of Supercomputing},
year={2005},
month={Nov},
day={01},
volume={34},
number={2},
pages={181-199},
abstract={The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, the handling of unstructured meshes has an increased complexity in order to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics when using large-scale parallel systems. In this work we present a general performance model for the calculation of deterministic SNtransport on unstructured meshes that is also applicable to structured meshes. The model captures the key processing characteristics of the calculation and is parametric using both system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. A single formulation of the model is used to predict the performance of two quite different implementations of the same calculation. It is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high prediction accuracy.},
issn={1573-0484},
doi={10.1007/s11227-005-2339-8},
url={https://doi.org/10.1007/s11227-005-2339-8}
}

Abstract

The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, the handling of unstructured meshes has an increased complexity in order to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics when using large-scale parallel systems. In this work we present a general performance model for the calculation of deterministic S N transport on unstructured meshes that is also applicable to structured meshes. The model captures the key processing characteristics of the calculation and is parametric using both system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. A single formulation of the model is used to predict the performance of two quite different implementations of the same calculation. It is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high prediction accuracy.

Performance Modeling of Unstructured Mesh Particle Transport Computations, Mark M. Mathis, Darren J. Kerbyson, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., pp. 245-, Santa Fe, NM, USA, Apr 2004. DOI: 10.1109/IPDPS.2004.1303300
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@INPROCEEDINGS{1303300,
author={M. M. {Mathis} and D. J. {Kerbyson}},
booktitle={18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.},
title={Performance modeling of unstructured mesh particle transport computations},
year={2004},
volume={},
number={},
pages={245-},
doi={10.1109/IPDPS.2004.1303300}}

Abstract

Summary form only given. The performance of unstructured mesh applications presents a number of complexities and subtleties that do not arise for dense structured meshes. From a programming point of view, handling an unstructured mesh has an increased complexity to manage the necessary data structures and interactions between mesh-cells. From a performance point of view, there are added difficulties in understanding both the processing time on a single processor and the scaling characteristics. Here we present a performance model for the calculation of deterministic S/sub N/ transport on unstructured meshes. It builds upon earlier work that successfully modeled the same calculation on structured meshes. The model captures the key processing characteristics and is parametric using both the system performance data (latency, bandwidth, processing rate etc.) and application data (mesh size etc.) as input. The model is validated on two clusters (an HP AlphaServer and an Itanium-2 system) showing high accuracy. Importantly it is also shown that a single formulation of the model can be used to predict the performance of two quite different implementations of the same calculation.

A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems, Mark M. Mathis, Darren J. Kerbyson, Adolfy Hoisie, Computational Science (ICCS 2003), Lecture Notes in Computer Science, Vol: 2659, Melbourne, Australia, Jun 2003. DOI: 10.1007/3-540-44863-2_89
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-44863-2_89,
author="Mathis, Mark M.
and Kerbyson, Darren J.
and Hoisie, Adolfy",
editor="Sloot, Peter M. A.
and Abramson, David
and Bogdanov, Alexander V.
and Gorbachev, Yuriy E.
and Dongarra, Jack J.
and Zomaya, Albert Y.",
title="A Performance Model of Non-deterministic Particle Transport on Large-Scale Systems",
booktitle="Computational Science --- ICCS 2003",
year="2003",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="905--915",
abstract="In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of a non-deterministic particle transport application, MCNP. Previous studies on the scalability of parallel Monte Carlo eigenvalue calculations have been rather general in nature [[1]]. It can be used for the simulation of neutron, photon, electron, or coupled transport, and has found uses in many problem areas. The performance model is validated against measurements on an AlphaServer ES40 system showing high accuracy across many processor / problem combinations. It is parametric with both application characteristics (e.g. problem size), and system characteristics (e.g. communication latency, bandwidth, achieved processing rate) serving as input. The model is used to provide insight into the achievable performance that should be possible on systems containing thousands of processors and to quantify the impact that possible improvements in sub-system performance may have. In addition, the impact on performance of modifying the communication structure of the code is also quantified.",
isbn="978-3-540-44863-1"
}

Abstract

In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of a non-deterministic particle transport application, MCNP. Previous studies on the scalability of parallel Monte Carlo eigenvalue calculations have been rather general in nature [[1]]. It can be used for the simulation of neutron, photon, electron, or coupled transport, and has found uses in many problem areas. The performance model is validated against measurements on an AlphaServer ES40 system showing high accuracy across many processor / problem combinations. It is parametric with both application characteristics (e.g. problem size), and system characteristics (e.g. communication latency, bandwidth, achieved processing rate) serving as input. The model is used to provide insight into the achievable performance that should be possible on systems containing thousands of processors and to quantify the impact that possible improvements in sub-system performance may have. In addition, the impact on performance of modifying the communication structure of the code is also quantified.

Finding Strongly Connected Components in Parallel in Particle Transport Sweeps, William McLendon, Bruce Hendrickson, Steve Plimpton, Lawrence Rauchwerger, Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 328-329, Crete Island, Greece, Jul 2001. DOI: 10.1145/378580.378751
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@inproceedings{10.1145/378580.378751,
author = {Mclendon, William and Hendrickson, Bruce and Plimpton, Steve and Rauchwerger, Lawrence},
title = {Finding Strongly Connected Components in Parallel in Particle Transport Sweeps},
year = {2001},
isbn = {1581134096},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/378580.378751},
doi = {10.1145/378580.378751},
abstract = {Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.},
booktitle = {Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures},
pages = {328–329},
numpages = {2},
location = {Crete Island, Greece},
series = {SPAA '01}
}

Abstract

Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.

A General Performance Model for Parallel Sweeps on Orthogonal Grids for Particle Transport Calculations, Mark M. Mathis, Masters Thesis, Department of Computer Science and Engineering, Texas A&M University, Dec 2000.
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@mastersthesis{CitekeyMastersthesis,
author = "Jian Tang",
title = "Spin structure of the nucleon in the asymptotic limit",
school = "Texas A&M University",
year = 2000,
address = "College Station, TX",
month = Dec
}

Abstract

There is a growing need to accurately simulate physical systems whose evolution depends on the transport of subatomic particles. It has long been recognized that the huge computational demands of the transport problem mean that practical solution times will be obtained only by the efficient utilization of parallel processing. For example, since estimates place the time devoted to particle transport in multi-physics simulations at 50-80% of total execution time, parallelizing deterministic particle transport calculations is an important problem in many applications targeted by the Accelerated Strategic Computing Initiative of the United States Department of Energy. One common approach to deterministic particle transport calculations is the discrete-ordinates method, whose most time consuming step is the transport sweep which involves multiple sweeps through the spatial grid, one for each direction of particle travel. The efficient parallel implementation of the transport sweeps is the key to parallelizing the discrete-ordinates method. The key contribution of this thesis is a new general model that can be used to compare the running times of transport sweeps on three-dimensional orthogonal grids for various mappings of the grid cells to processors. Our model, which includes machine-dependent parameters such as computation cost and communication latency, can be used to analyze and compare the effects of various spatial decompositions on the running time of the transport sweep. Insight obtained from the model yields two significant contributions to the theory of optimal transport sweeps on orthogonal grids. First, our model provides a theoretical basis that explains why, and under what circumstances, the column decomposition of the current standard KBA algorithm is superior to the 'balanced' decomposition obtained by classic domain decomposition techniques. Second, our model enables us to identify a new decomposition, which we call Hybrid, that proves to be almost as good as and often better than the current standard KBA method. We obtain expressions for the completion time and discuss theoretical results.