Home Page for Lawrence Rauchwerger | Parasol Laboratory


Picture Lawrence Rauchwerger
Professor

Parasol Laboratory url: http://parasollab.web.illinois.edu/~rwerger/
Department of Computer Science email:
University of Illinois at Urbana-Champaign office: 4114 Siebel Center
Urbana, IL 61801, USA tel: (217) 244-0968



CV (pdf)

Research Projects
Publications

PACT 2011
PACT 2007
LCPC 2017
LCPC 2003

Publications (incomplete list)

Provably Optimal Parallel Transport Sweeps on Semi-Structured Grids, Michael P. Adams, Marvin L. Adams, W. Daryl Hawkins, Timmie G. Smith, Lawrence Rauchwerger, Nancy M. Amato, Teresa S. Bailey, Robert D. Falgout, Adam Kunen, Peter Brown, Journal of Computational Physics, Vol: 407, Apr 2020. DOI: 10.1016/j.jcp.2020.109234
Keywords: Parallel Algorithms, Scientific Computing
Links : [Published]

BibTex

@article{DBLP:journals/jcphy/AdamsAHSRABFKB20,
author = {Michael P. Adams and
Marvin L. Adams and
W. Daryl Hawkins and
Timmie G. Smith and
Lawrence Rauchwerger and
Nancy M. Amato and
Teresa S. Bailey and
Robert D. Falgout and
Adam Kunen and
Peter Brown},
title = {Provably optimal parallel transport sweeps on semi-structured grids},
journal = {J. Comput. Phys.},
volume = {407},
pages = {109234},
year = {2020},
url = {https://doi.org/10.1016/j.jcp.2020.109234},
doi = {10.1016/j.jcp.2020.109234},
timestamp = {Fri, 27 Mar 2020 14:16:32 +0100},
biburl = {https://dblp.org/rec/journals/jcphy/AdamsAHSRABFKB20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse blocks. We describe these algorithms and show that they always execute the full eight-octant (or four-quadrant if 2D) sweep in the minimum possible number of stages for a given Px x Py x Pz partitioning. Computational results confirm that our optimal scheduling algorithms execute sweeps in the minimum possible stage count. Observed parallel efficiencies agree well with our performance model. Our PDT transport code has achieved approximately 68% parallel efficiency with > 1.5M parallel threads, relative to 8 threads, on a simple weak-scaling problem with only three energy groups, 10 direction per octant, and 4096 cells/core. We demonstrate similar efficiencies on a much more realistic set of nuclear-reactor test problems, with unstructured meshes that resolve fine geometric details. These results demonstrate that discrete-ordinates transport sweeps can be executed with high efficiency using more than 106 parallel processes.


Bounded Asynchrony and Nested Parallelism for Scalable Graph Processing, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. Supercomputing (SC), Doctoral Showcase Poster, Denver, CO, Nov 2017.
Keywords: Approximate Algorithms, Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

N/A


Abstract

Processing large-scale graphs has become a critical component in a variety of fields, from scientific computing to social analytics. The irregular access pattern for graph workloads, coupled with complex graph structures and large data sizes makes efficiently executing parallel graph workloads challenging. In this dissertation, we develop two broad techniques for improving the performance of general parallel graph algorithms: bounded asynchrony and nested parallelism. Increasing asynchrony in a bounded manner allows one to avoid costly global synchronization at scale, while still avoiding the penalty of unbounded asynchrony including redundant work. Using this technique, we scale a BFS workload to 98,304 cores where traditional methods stop scaling. Additionally, asynchrony enables a new family of approximate algorithms for applications tolerant to fixed amounts of error. Representing graph algorithms in a nested parallel manner enables the full use of available parallelism inherent in graph algorithms, while efficiently managing communication through nested sections.


Fast Approximate Distance Queries in Unweighted Graphs using Bounded Asynchrony, Adam Fidel, Francisco Coral, Colton Riedel, Nancy M. Amato, Lawrence Rauchwerger, Workshop on Languages and Compilers for Parallel Computing (LCPC 2016). Lecture Notes in Computer Science, vol 10136. Springer, Cham., pp. 40-54, Rochester NY, USA, Jan 2017. DOI: 10.1007/978-3-319-52709-3_4
Keywords: Approximate Algorithms, Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@inproceedings{
author="Adam Fidel and Francisco Coral Sabido and Colton Riedel and Nancy M. Amato and Lawrence Rauchwerger",
title="Fast Approximate Distance Queries in Unweighted Graphs using Bounded Asynchrony",
year=2017,
booktitle="Languages and Compilers for Parallel Computing (LCPC 2016)",
series="Lecture Notes in Computer Science",
volume="10136",
publisher="Springer, Cham",
note = "DOI: https://doi.org/10.1007/978-3-319-52709-3_4"
}


Abstract

We introduce a new parallel algorithm for approximate breadth-first ordering of an unweighted graph by using bounded asynchrony to parametrically control both the performance and error of the algorithm. This work is based on the k-level asynchronous (KLA) paradigm that trades expensive global synchronizations in the level-synchronous model for local synchronizations in the asynchronous model, which may result in redundant work. Instead of correcting errors introduced by asynchrony and redoing work as in KLA, in this work we control the amount of work that is redone and thus the amount of error allowed, leading to higher performance at the expense of a loss of precision. Results of an implementation of this algorithm are presented on up to 32,768 cores, showing 2.27x improvement over the exact KLA algorithm and 3.8x improvement over the level-synchronous version with minimal error on several graph inputs.


Asynchronous Nested Parallelism for Dynamic Applications in Distributed Memory, Ioannis Papadopoulos, Nathan Thomas, Adam Fidel, Dielli Hoxha, Nancy M. Amato, Lawrence Rauchwerger, In Wkshp. on Lang. and Comp. for Par. Comp. (LCPC), pp. 106-121, Chapel Hill, NC, Feb 2016. DOI: 10.1007/978-3-319-29778-1_7
Keywords: Runtime Systems, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-319-29778-1_7,
author="Papadopoulos, Ioannis
and Thomas, Nathan
and Fidel, Adam
and Hoxha, Dielli
and Amato, Nancy M.
and Rauchwerger, Lawrence",
editor="Shen, Xipeng
and Mueller, Frank
and Tuck, James",
title="Asynchronous Nested Parallelism for Dynamic Applications in Distributed Memory",
booktitle="Languages and Compilers for Parallel Computing",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="106--121",
abstract="Nested parallelism is of increasing interest for both expressivity and performance. Many problems are naturally expressed with this divide-and-conquer software design approach. In addition, programmers with target architecture knowledge employ nested parallelism for performance, imposing a hierarchy in the application to increase locality and resource utilization, often at the cost of implementation complexity.",
isbn="978-3-319-29778-1"
}


Abstract

Nested parallelism is of increasing interest for both expressivity and performance. Many problems are naturally expressed with this divide-and-conquer software design approach. In addition, programmers with target architecture knowledge employ nested parallelism for performance, imposing a hierarchy in the application to increase locality and resource utilization, often at the cost of implementation complexity. While dynamic applications are a natural fit for the approach, support for nested parallelism in distributed systems is generally limited to well-structured applications engineered with distinct phases of intra-node computation and inter-node communication. This model makes expressing irregular applications difficult and also hurts performance by introducing unnecessary latency and synchronizations. In this paper we describe an approach to asynchronous nested parallelism which provides uniform treatment of nested computation across distributed memory. This approach allows efficient execution while supporting dynamic applications which cannot be mapped onto the machine in the rigid manner of regular applications. We use several graph algorithms as examples to demonstrate our library’s expressivity, flexibility, and performance.


An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms (Conference Best Paper Finalist), Harshvardhan, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. IEEE Int.Conf. on Parallel Architectures and Compilation Techniques (PACT), pp. 201-212, San Francisco, CA, Oct 2015. DOI: 10.1109/PACT.2015.34
Keywords: Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@INPROCEEDINGS{7429306,
author={ {Harshvardhan} and A. {Fidel} and N. M. {Amato} and L. {Rauchwerger}},
booktitle={2015 International Conference on Parallel Architecture and Compilation (PACT)},
title={An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms},
year={2015},
volume={},
number={},
pages={201-212},}


Abstract

Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. This work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithm-aware coarsening of messages sent during vertex visitation, reducing both the number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131,000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.


Finding schedule-sensitive branches, Jeff Huang, Lawrence Rauchwerger, Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, (ESEC/FSE), pp. 439-449, Bergamo, Italy, Aug 2015. DOI: 10.1145/2786805.2786840
Keywords: Adaptive Algorithm
Links : [Published]

BibTex

@inproceedings{DBLP:conf/sigsoft/HuangR15,
author = {Jeff Huang and
Lawrence Rauchwerger},
editor = {Elisabetta Di Nitto and
Mark Harman and
Patrick Heymans},
title = {Finding schedule-sensitive branches},
booktitle = {Proceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering, {ESEC/FSE} 2015, Bergamo, Italy, August 30 - September
4, 2015},
pages = {439--449},
publisher = {{ACM}},
year = {2015},
url = {https://doi.org/10.1145/2786805.2786840},
doi = {10.1145/2786805.2786840},
timestamp = {Tue, 06 Nov 2018 16:59:22 +0100},
biburl = {https://dblp.org/rec/conf/sigsoft/HuangR15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

This paper presents an automated, precise technique, TAME, for identifying schedule-sensitive branches (SSBs) in concurrent programs, i.e., branches whose decision may vary depending on the actual scheduling of concurrent threads. The technique consists of 1) tracing events at fine-grained level; 2) deriving the constraints for each branch; and 3) invoking an SMT solver to find possible SSB, by trying to solve the negated branch condition. To handle the infeasibly huge number of computations that would be generated by the fine-grained tracing, TAME leverages concolic execution and implements several sound approximations to delimit the number of traces to analyse, yet without sacrificing precision. In addition, TAME implements a novel distributed trace partition approach distributing the analysis into smaller chunks. Evaluation on both popular benchmarks and real applications shows that TAME is effective in finding SSBs and has good scalability. TAME found a total of 34 SSBs, among which 17 are related to concurrency errors, and 9 are ad hoc synchronizations.


Composing Algorithmic Skeletons to Express High-Performance Scientific Applications, Mani Zandifar, Mustafa Abdujabbar, Alireza Majidi, David Keyes, Nancy M. Amato, Lawrence Rauchwerger, In Proc. ACM Int. Conf. Supercomputing (ICS), pp. 415-424, Newport Beach, CA, USA. Conference Best Paper Award., Jun 2015. DOI: 10.1145/2751205.2751241
Keywords: Algorithmic Skeletons, Data Flow, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/2751205.2751241,
author = {Zandifar, Mani and Abdul Jabbar, Mustafa and Majidi, Alireza and Keyes, David and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {Composing Algorithmic Skeletons to Express High-Performance Scientific Applications},
year = {2015},
isbn = {9781450335591},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2751205.2751241},
doi = {10.1145/2751205.2751241},
abstract = {Algorithmic skeletons are high-level representations for parallel programs that hide the underlying parallelism details from program specification. These skeletons are defined in terms of higher-order functions that can be composed to build larger programs. Many skeleton frameworks support efficient implementations for stand-alone skeletons such as map, reduce, and zip for both shared-memory systems and small clusters. However, in these frameworks, expressing complex skeletons that are constructed through composition of fundamental skeletons either requires complete reimplementation or suffers from limited scalability due to required global synchronization. In the STAPL Skeleton Framework, we represent skeletons as parametric data flow graphs and describe composition of skeletons by point-to-point dependencies of their data flow graph representations. As a result, we eliminate the need for reimplementation and global synchronizations in composed skeletons. In this work, we describe the process of translating skeleton-based programs to data flow graphs and define rules for skeleton composition. To show the expressivity and ease of use of our framework, we show skeleton-based representations of the NAS EP, IS, and FT benchmarks. To show reusability and applicability of our framework on real-world applications we show an N-Body application using the FMM (Fast Multipole Method) hierarchical algorithm. Our results show that expressivity can be achieved without loss of performance even in complex real-world applications.},
booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing},
pages = {415–424},
numpages = {10},
keywords = {high-performance computing, data flow programming, distributed systems, algorithmic skeletons, patterns},
location = {Newport Beach, California, USA},
series = {ICS '15}
}


Abstract

Algorithmic skeletons are high-level representations for parallel programs that hide the underlying parallelism details from program specification. These skeletons are defined in terms of higher-order functions that can be composed to build larger programs. Many skeleton frameworks support efficient implementations for stand-alone skeletons such as map, reduce, and zip for both shared-memory systems and small clusters. However, in these frameworks, expressing complex skeletons that are constructed through composition of fundamental skeletons either requires complete reimplementation or suffers from limited scalability due to required global synchronization. In the STAPL Skeleton Framework, we represent skeletons as parametric data flow graphs and describe composition of skeletons by point-to-point dependencies of their data flow graph representations. As a result, we eliminate the need for reimplementation and global synchronizations in composed skeletons. In this work, we describe the process of translating skeleton-based programs to data flow graphs and define rules for skeleton composition. To show the expressivity and ease of use of our framework, we show skeleton-based representations of the NAS EP, IS, and FT benchmarks. To show reusability and applicability of our framework on real-world applications we show an N-Body application using the FMM (Fast Multipole Method) hierarchical algorithm. Our results show that expressivity can be achieved without loss of performance even in complex real-world applications.


STAPL-RTS: An Application Driven Runtime System, Ioannis Papadopoulos, Nathan Thomas, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. ACM Int. Conf. Supercomputing (ICS), pp. 425-434, Newport Beach, CA, USA, Jun 2015. DOI: 10.1145/2751205.2751233
Keywords: Communication Library, Runtime Systems, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/2751205.2751233,
author = {Papadopoulos, Ioannis and Thomas, Nathan and Fidel, Adam and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {STAPL-RTS: An Application Driven Runtime System},
year = {2015},
isbn = {9781450335591},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2751205.2751233},
doi = {10.1145/2751205.2751233},
abstract = {Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage all machine resources, writing programs decorated with low level primitives from multiple APIs (e.g. Hybrid MPI / OpenMP applications). Though seemingly necessary for efficient execution, it is an inherently non-scalable way to write software. Without a separation of concerns, only small programs written by expert developers actually achieve this efficiency. Furthermore, the implementations are rigid, difficult to extend, and not portable. Alternatively, users can adopt higher level programming environments to abstract away these concerns. Extensibility and portability, however, often come at the cost of lost performance. The mapping of a user's application onto the system now occurs without the contextual information that was immediately available in the more coupled approach.In this paper, we describe a framework for the transfer of high level, application semantic knowledge into lower levels of the software stack at an appropriate level of abstraction. Using the STAPL library, we demonstrate how this information guides important decisions in the runtime system (STAPL-RTS), such as multi-protocol communication coordination and request aggregation. Through examples, we show how generic programming idioms already known to C++ programmers are used to annotate calls and increase performance.},
booktitle = {Proceedings of the 29th ACM on International Conference on Supercomputing},
pages = {425–434},
numpages = {10},
keywords = {data flow, shared memory, remote method invocation, distributed memory, parallel programming, runtime systems, application driven optimizations},
location = {Newport Beach, California, USA},
series = {ICS '15}
}


Abstract

Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage all machine resources, writing programs decorated with low level primitives from multiple APIs (e.g. Hybrid MPI / OpenMP applications). Though seemingly necessary for efficient execution, it is an inherently non-scalable way to write software. Without a separation of concerns, only small programs written by expert developers actually achieve this efficiency. Furthermore, the implementations are rigid, difficult to extend, and not portable. Alternatively, users can adopt higher level programming environments to abstract away these concerns. Extensibility and portability, however, often come at the cost of lost performance. The mapping of a user's application onto the system now occurs without the contextual information that was immediately available in the more coupled approach. In this paper, we describe a framework for the transfer of high level, application semantic knowledge into lower levels of the software stack at an appropriate level of abstraction. Using the STAPL library, we demonstrate how this information guides important decisions in the runtime system (STAPL-RTS), such as multi-protocol communication coordination and request aggregation. Through examples, we show how generic programming idioms already known to C++ programmers are used to annotate calls and increase performance.


A Hybrid Approach To Processing Big Data Graphs on Memory-Restricted Systems, Harshvardhan, Brandon West, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. Int. Par. and Dist. Proc. Symp. (IPDPS), pp. 799-808, Hyderabad, India, May 2015. DOI: 10.1109/IPDPS.2015.28
Keywords: Big Data, Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@INPROCEEDINGS{7161566, author={ {Harshvardhan} and B. {West} and A. {Fidel} and N. M. {Amato} and L. {Rauchwerger}}, booktitle={2015 IEEE International Parallel and Distributed Processing Symposium}, title={A Hybrid Approach to Processing Big Data Graphs on Memory-Restricted Systems}, year={2015}, volume={}, number={}, pages={799-808},}


Abstract

With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques that can only process graphs that fit completely in RAM, or disk-based techniques that sacrifice performance. In this work, we propose a novel RAM-Disk hybrid approach to graph processing that can scale well from a single shared-memory node to large distributed-memory systems. It works by partitioning the graph into sub graphs that fit in RAM and uses a paging-like technique to load sub graphs. We show that without modifying the algorithms, this approach can scale from small memory-constrained systems (such as tablets) to large-scale distributed machines with 16, 000+ cores.


A Hierarchical Approach to Reducing Communication in Parallel Graph Algorithms, Harshvardhan, Nancy M. Amato, Lawrence Rauchwerger, In Proc. ACM SIGPLAN Symp. Prin. Prac. Par. Prog. (PPOPP), pp. 285-286, San Francisco, CA, USA, Feb 2015. DOI: 10.1145/2688500.2700994
Keywords: Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/2688500.2700994,
author = {Harshvardhan and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {A Hierarchical Approach to Reducing Communication in Parallel Graph Algorithms},
year = {2015},
isbn = {9781450332057},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2688500.2700994},
doi = {10.1145/2688500.2700994},
abstract = { Large-scale graph computing has become critical due to the ever-increasing size of data. However, distributed graph computations are limited in their scalability and performance due to the heavy communication inherent in such computations. This is exacerbated in scale-free networks, such as social and web graphs, which contain hub vertices that have large degrees and therefore send a large number of messages over the network. Furthermore, many graph algorithms and computations send the same data to each of the neighbors of a vertex. Our proposed approach recognizes this, and reduces communication performed by the algorithm without change to user-code, through a hierarchical machine model imposed upon the input graph. The hierarchical model takes advantage of locale information of the neighboring vertices to reduce communication, both in message volume and total number of bytes sent. It is also able to better exploit the machine hierarchy to further reduce the communication costs, by aggregating traffic between different levels of the machine hierarchy. Results of an implementation in the STAPL GL shows improved scalability and performance over the traditional level-synchronous approach, with 2.5\texttimes{}-8\texttimes{} improvement for a variety of graph algorithms at 12,000+ cores. },
booktitle = {Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {285–286},
numpages = {2},
keywords = {Distributed Computing, Graph Analytics, Parallel Graph Processing, Big Data},
location = {San Francisco, CA, USA},
series = {PPoPP 2015}
}

@article{10.1145/2858788.2700994,
author = {Harshvardhan and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {A Hierarchical Approach to Reducing Communication in Parallel Graph Algorithms},
year = {2015},
issue_date = {August 2015},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {50},
number = {8},
issn = {0362-1340},
url = {https://doi.org/10.1145/2858788.2700994},
doi = {10.1145/2858788.2700994},
abstract = { Large-scale graph computing has become critical due to the ever-increasing size of data. However, distributed graph computations are limited in their scalability and performance due to the heavy communication inherent in such computations. This is exacerbated in scale-free networks, such as social and web graphs, which contain hub vertices that have large degrees and therefore send a large number of messages over the network. Furthermore, many graph algorithms and computations send the same data to each of the neighbors of a vertex. Our proposed approach recognizes this, and reduces communication performed by the algorithm without change to user-code, through a hierarchical machine model imposed upon the input graph. The hierarchical model takes advantage of locale information of the neighboring vertices to reduce communication, both in message volume and total number of bytes sent. It is also able to better exploit the machine hierarchy to further reduce the communication costs, by aggregating traffic between different levels of the machine hierarchy. Results of an implementation in the STAPL GL shows improved scalability and performance over the traditional level-synchronous approach, with 2.5\texttimes{}-8\texttimes{} improvement for a variety of graph algorithms at 12,000+ cores. },
journal = {SIGPLAN Not.},
month = jan,
pages = {285–286},
numpages = {2},
keywords = {Distributed Computing, Parallel Graph Processing, Big Data, Graph Analytics}
}


Abstract

Large-scale graph computing has become critical due to the ever-increasing size of data. However, distributed graph computations are limited in their scalability and performance due to the heavy communication inherent in such computations. This is exacerbated in scale-free networks, such as social and web graphs, which contain hub vertices that have large degrees and therefore send a large number of messages over the network. Furthermore, many graph algorithms and computations send the same data to each of the neighbors of a vertex. Our proposed approach recognizes this, and reduces communication performed by the algorithm without change to user-code, through a hierarchical machine model imposed upon the input graph. The hierarchical model takes advantage of locale information of the neighboring vertices to reduce communication, both in message volume and total number of bytes sent. It is also able to better exploit the machine hierarchy to further reduce the communication costs, by aggregating traffic between different levels of the machine hierarchy. Results of an implementation in the STAPL GL shows improved scalability and performance over the traditional level-synchronous approach, with 2.5×-8× improvement for a variety of graph algorithms at 12,000+ cores.


Scalable conditional induction variables (CIV) analysis, Cosmin E. Oancea, Lawrence Rauchwerger, Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 213-224, San Francisco, California, USA, Feb 2015. DOI: 10.1109/CGO.2015.7054201
Keywords: Parallel Programming
Links : [Published]

BibTex

@inproceedings{DBLP:conf/cgo/OanceaR15,
author = {Cosmin E. Oancea and
Lawrence Rauchwerger},
editor = {Kunle Olukotun and
Aaron Smith and
Robert Hundt and
Jason Mars},
title = {Scalable conditional induction variables {(CIV)} analysis},
booktitle = {Proceedings of the 13th Annual {IEEE/ACM} International Symposium
on Code Generation and Optimization, {CGO} 2015, San Francisco, CA,
USA, February 07 - 11, 2015},
pages = {213--224},
publisher = {{IEEE} Computer Society},
year = {2015},
url = {https://doi.org/10.1109/CGO.2015.7054201},
doi = {10.1109/CGO.2015.7054201},
timestamp = {Wed, 16 Oct 2019 14:14:57 +0200},
biburl = {https://dblp.org/rec/conf/cgo/OanceaR15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

Subscripts using induction variables that cannot be expressed as a formula in terms of the enclosing-loop indices appear in the low-level implementation of common programming abstractions such as Alter, or stack operations and pose significant challenges to automatic parallelization. Because the complexity of such induction variables is often due to their conditional evaluation across the iteration space of loops we name them Conditional Induction Variables (CIV). This paper presents a flow-sensitive technique that summarizes both such CIV-based and affine subscripts to program level, using the same representation. Our technique requires no modifications of our dependence tests, which is agnostic to the original shape of the subscripts, and is more powerful than previously reported dependence tests that rely on the pairwise disambiguation of read-write references. We have implemented the CIV analysis in our parallelizing compiler and evaluated its impact on five Fortran benchmarks. We have found that that there are many important loops using CIV subscripts and that our analysis can lead to their scalable parallelization. This in turn has led to the parallelization of the benchmark programs they appear in.


Efficient, Reachability-based, Parallel Algorithms for Finding Strongly Connected Components, Daniel Tomkins, Timmie Smith, Nancy M. Amato, Lawrence Rauchwerger, Technical Report, TR15-002, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX 77843-3112, Jan 2015.
Keywords: Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

N/A


Abstract

Large, complex graphs are increasingly used to represent unstructured data in scientific applications. Applications such as discrete ordinates methods for radiation transport require these graphs to be directed with all cycles eliminated, where cycles are identified by finding the graph.s strongly connected components (SCCs). Deterministic parallel algorithms identify SCCs in polylogarithmic time, but the work of the algorithms on sparse graphs exceeds the linear sequential complexity bound. Randomized algorithms based on reachability — the ability to get from one vertex to another along a directed path — are generally favorable for sparse graphs, but either do not perform well on all graphs or introduce substantial overheads. This paper introduces two new algorithms, SCCMulti and SCCMulti-PriorityAware, which perform well on all graphs without significant overhead. We also provide a characterization that compares reachabilitybased SCC algorithms. Experimental results demonstrate scalability for both algorithms to 393,216 of cores on several types of graphs.


Validation of Full-Domain Massively Parallel Transport Sweep Algorithms, eresa S. Bailey, W. Daryl Hawkins, Marvin L. Adams, Peter N. Brown, Adam J. Kunen, Michael P. Adams, Timmie Smith, Nancy M. Amato, and Lawrence Rauchwerger, Transactions of the American Nuclear Society, Vol: 111, Issue: 1, pp. 699-702, Anaheim, CA, USA, Nov 2014.
Keywords: Parallel Algorithms, Scientific Computing, STAPL
Links : [Published]

BibTex

@article{bailey--tans-2014
, author = "Teresa S. Bailey and W. Daryl Hawkins and Marvin L. Adams and Peter N. Brown and Adam J. Kunen and Michael P. Adams and Timmie Smith and Nancy M. Amato and Lawrence Rauchwerger"
, title = "Validation of Full-Domain Massively Parallel Transport Sweep Algorithms"
, journal = "Trans. American Nuclear Society"
, volume = "111"
, month = "November"
, year = "2014"
, pages = "699-702"
, note = "2014 ANS Annual Winter Meeting; Nov 9-13, 2014, Anaheim, CA, USA"
}


Abstract

The scalability of sweep algorithms has been a somewhat disputed topic recently. Some papers have predicted that these algorithms will scale if applied appropriately, while others have claimed that sweep algorithms have limited scalability. We compare the parallel performance models for sweep algorithms to actual computational performance using two distinctly different discrete ordinates transport codes. We have run both codes beyond 1Million MPI ranks on Lawrence Livermore National Laboratory’s (LLNL) BGQ machines (SEQUOIA and VULCAN).


The STAPL Skeleton Framework, Mani Zandifar, Nathan Thomas, Nancy M. Amato, Lawrence Rauchwerger, In Wkshp. on Lang. and Comp. for Par. Comp. (LCPC), pp. 176-190, Hillsboro, OR, USA, Sep 2014. DOI: 10.1007/978-3-319-17473-0_12
Keywords: Data Flow, Parallel Libraries, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-319-17473-0_12,
author="Zandifar, Mani
and Thomas, Nathan
and Amato, Nancy M.
and Rauchwerger, Lawrence",
editor="Brodman, James
and Tu, Peng",
title="The stapl Skeleton Framework",
booktitle="Languages and Compilers for Parallel Computing",
year="2015",
publisher="Springer International Publishing",
address="Cham",
pages="176--190",
abstract="This paper describes the stapl Skeleton Framework, a high-level skeletal approach for parallel programming. This framework abstracts the underlying details of data distribution and parallelism from programmers and enables them to express parallel programs as a composition of existing elementary skeletons such as map, map-reduce, scan, zip, butterfly, allreduce, alltoall and user-defined custom skeletons.",
isbn="978-3-319-17473-0"
}


Abstract

This paper describes the stapl Skeleton Framework, a high-level skeletal approach for parallel programming. This framework abstracts the underlying details of data distribution and parallelism from programmers and enables them to express parallel programs as a composition of existing elementary skeletons such as map, map-reduce, scan, zip, butterfly, allreduce, alltoall and user-defined custom skeletons. Skeletons in this framework are defined as parametric data flow graphs, and their compositions are defined in terms of data flow graph compositions. Defining the composition in this manner allows dependencies between skeletons to be defined in terms of point-to-point dependencies, avoiding unnecessary global synchronizations. To show the ease of composability and expressivity, we implemented the NAS Integer Sort (IS) and Embarrassingly Parallel (EP) benchmarks using skeletons and demonstrate comparable performance to the hand-optimized reference implementations. To demonstrate scalable performance, we show a transformation which enables applications written in terms of skeletons to run on more than 100,000 cores.


KLA: A New Algorithmic Paradigm for Parallel Graph Computations (Conference Best Paper), Harshvardhan, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. IEEE Int.Conf. on Parallel Architectures and Compilation Techniques (PACT), pp. 27-38, Edmonton, AB, Canada, Aug 2014. DOI: 10.1145/2628071.2628091
Keywords: Parallel Algorithms, Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/2628071.2628091,
author = {Harshvardhan and Fidel, Adam and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {KLA: A New Algorithmic Paradigm for Parallel Graph Computations},
year = {2014},
isbn = {9781450328098},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2628071.2628091},
doi = {10.1145/2628071.2628091},
abstract = {This paper proposes a new algorithmic paradigm - k-level asynchronous (KLA) - that bridges level-synchronous and asynchronous paradigms for processing graphs. The KLA paradigm enables the level of asynchrony in parallel graph algorithms to be parametrically varied from none (level-synchronous) to full (asynchronous). The motivation is to improve execution times through an appropriate trade-off between the use of fewer, but more expensive global synchronizations, as in level-synchronous algorithms, and more, but less expensive local synchronizations (and perhaps also redundant work), as in asynchronous algorithms. We show how common patterns in graph algorithms can be expressed in the KLA pardigm and provide techniques for determining k, the number of asynchronous steps allowed between global synchronizations. Results of an implementation of KLA in the STAPL Graph Library show excellent scalability on up to 96K cores and improvements of 10x or more over level-synchronous and asynchronous versions for graph algorithms such as breadth-first search, PageRank, k-core decomposition and others on certain classes of real-world graphs.},
booktitle = {Proceedings of the 23rd International Conference on Parallel Architectures and Compilation},
pages = {27–38},
numpages = {12},
keywords = {parallel algorithms, big data, distributed computing, graph analytics, asynchronous graph algorithms},
location = {Edmonton, AB, Canada},
series = {PACT '14}
}


Abstract

This paper proposes a new algorithmic paradigm - k-level asynchronous (KLA) - that bridges level-synchronous and asynchronous paradigms for processing graphs. The KLA paradigm enables the level of asynchrony in parallel graph algorithms to be parametrically varied from none (level-synchronous) to full (asynchronous). The motivation is to improve execution times through an appropriate trade-off between the use of fewer, but more expensive global synchronizations, as in level-synchronous algorithms, and more, but less expensive local synchronizations (and perhaps also redundant work), as in asynchronous algorithms. We show how common patterns in graph algorithms can be expressed in the KLA pardigm and provide techniques for determining k, the number of asynchronous steps allowed between global synchronizations. Results of an implementation of KLA in the STAPL Graph Library show excellent scalability on up to 96K cores and improvements of 10× or more over level-synchronous and asynchronous versions for graph algorithms such as breadth-first search, PageRank, k-core decomposition and others on certain classes of real-world graphs.


From Petascale to the Pocket: Adaptively Scaling Parallel Programs for Mobile SoCs, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, In Proc. IEEE Int.Conf. on Parallel Architectures and Compilation Techniques (PACT), SRC Poster, pp. 511-512, Edmonton, AB, Canada, Aug 2014. DOI: 10.1145/2628071.2671426
Keywords: Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/2628071.2671426,
author = {Fidel, Adam and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {From Petascale to the Pocket: Adaptively Scaling Parallel Programs for Mobile SoCs},
year = {2014},
isbn = {9781450328098},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2628071.2671426},
doi = {10.1145/2628071.2671426},
booktitle = {Proceedings of the 23rd International Conference on Parallel Architectures and Compilation},
pages = {511–512},
numpages = {2},
keywords = {energy-efficient computing, parallel libraries, shared memory optimization},
location = {Edmonton, AB, Canada},
series = {PACT '14}
}


Abstract

With resource-constrained mobile and embedded devices being outfitted with multicore processors, there exists a need to allow existing parallel programs to be scaled down to efficiently utilize these devices. We study the marriage of programming models originally designed for distributed-memory supercomputers with smaller scale parallel architectures that are shared-memory and generally resource-constrained.


Processing Big Data Graphs on Memory-Restricted Systems, Harshvardhan, Nancy M. Amato, Lawrence Rauchwerger, In Proc. IEEE Int.Conf. on Parallel Architectures and Compilation Techniques (PACT), pp. 517-518, Edmonton, AB, Canada, Aug 2014. DOI: 10.1145/2628071.2671429
Keywords: Big Data, Out-Of-Core Graph Algorithms, Parallel Graph Algorithms
Links : [Published]

BibTex

@inproceedings{10.1145/2628071.2671429,
author = {Harshvardhan and Amato, Nancy M. and Rauchweger, Lawrence},
title = {Processing Big Data Graphs on Memory-Restricted Systems},
year = {2014},
isbn = {9781450328098},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2628071.2671429},
doi = {10.1145/2628071.2671429},
abstract = {With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques, which can only process graphs that fit completely in RAM, or disk-based techniques that sacrifice performance.In this work, we propose a novel RAM-Disk hybrid approach to graph processing that can scale well from a single shared-memory node to large distributed-memory systems. It works by partitioning the graph into subgraphs that fit in RAM and uses a paging-like technique to load subgraphs. We show that without modifying the algorithms, this approach can scale from small memory-constrained systems (such as tablets) to large-scale distributed machines with 16,000+ cores.},
booktitle = {Proceedings of the 23rd International Conference on Parallel Architectures and Compilation},
pages = {517–518},
numpages = {2},
keywords = {out-of-core graph algorithms, distributed computing, graph analytics, big data, parallel graph processing},
location = {Edmonton, AB, Canada},
series = {PACT '14}
}


Abstract

With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques, which can only process graphs that fit completely in RAM, or disk-based techniques that sacrifice performance. Contribution. In this work, we propose a novel RAM-Disk hybrid approach to graph processing that can scale well from a single shared-memory node to large distributed-memory systems. It works by partitioning the graph into subgraphs that fit in RAM and uses a paging-like technique to load subgraphs. We show that without modifying the algorithms, this approach can scale from small memory-constrained systems (such as tablets) to large-scale distributed machines with 16, 000+ cores.


Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Nancy M. Amato, Lawrence Rauchwerger, In Proc. Int. Par. and Dist. Proc. Symp. (IPDPS), pp. 573-582, Phoenix, AZ, USA, May 2014. DOI: 10.1109/IPDPS.2014.66
Keywords: Parallel Algorithms, Sampling-Based Motion Planning
Links : [Published]

BibTex

@INPROCEEDINGS{6877290, author={A. {Fidel} and S. A. {Jacobs} and S. {Sharma} and N. M. {Amato} and L. {Rauchwerger}}, booktitle={2014 IEEE 28th International Parallel and Distributed Processing Symposium}, title={Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms}, year={2014}, volume={}, number={}, pages={573-582},}


Abstract

Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.


Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Nancy M. Amato, Lawrence Rauchwerger, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 573-582, Phoenix, AZ, USA, May 2014. DOI: 10.1109/IPDPS.2014.66
Keywords: Parallel Algorithms, Sampling-Based Motion Planning
Links : [Published]

BibTex

@INPROCEEDINGS{6877290, author={A. {Fidel} and S. A. {Jacobs} and S. {Sharma} and N. M. {Amato} and L. {Rauchwerger}}, booktitle={2014 IEEE 28th International Parallel and Distributed Processing Symposium}, title={Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms}, year={2014}, volume={}, number={}, pages={573-582}, doi={10.1109/IPDPS.2014.66}}


Abstract

Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.


Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Nancy M. Amato, Lawrence Rauchwerger, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 573-582, Phoenix, AZ, USA, May 2014. DOI: 10.1109/IPDPS.2014.66
Keywords: Parallel Algorithms, Sampling-Based Motion Planning
Links : [Published]

BibTex

@INPROCEEDINGS{6877290, author={A. {Fidel} and S. A. {Jacobs} and S. {Sharma} and N. M. {Amato} and L. {Rauchwerger}}, booktitle={2014 IEEE 28th International Parallel and Distributed Processing Symposium}, title={Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms}, year={2014}, volume={}, number={}, pages={573-582}, doi={10.1109/IPDPS.2014.66}}


Abstract

Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.


Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Nancy M. Amato, Lawrence Rauchwerger, In. Proc. IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 2014. DOI: 10.1109/IPDPS.2014.66
Keywords: Parallel Algorithms, Parallel Processing, Sampling-Based Motion Planning
Links : [Published]

BibTex

@INPROCEEDINGS{6877290, author={A. {Fidel} and S. A. {Jacobs} and S. {Sharma} and N. M. {Amato} and L. {Rauchwerger}}, booktitle={2014 IEEE 28th International Parallel and Distributed Processing Symposium}, title={Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms}, year={2014}, volume={}, number={}, pages={573-582}, doi={10.1109/IPDPS.2014.66}}


Abstract

Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.


Load Balancing Techniques for Scalable Parallelization of Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Lawrence Rauchwerger, Nancy M. Amato, Technical Report, TR13-002 , Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Mar 2013.
Keywords: Parallel Algorithms, Sampling-Based Motion Planning, STAPL
Links : [Manuscript]

BibTex

@techreport{afidel-tr13-002-2013,
title = "Load Balancing Techniques for Scalable Parallelization of Sampling-Based Motion Planning Algorithms",
author = "Fidel, Adam and Jacobs, Sam Ade and Sharma, Shishir and Rauchwerger, Lawrence and Amato, Nancy M. ",
institution = "Texas A&M University",
address = "College Station, TX, USA",
number = "TR13-002",
year = 2013,
month = mar
}


Abstract

Motion planning, which is the problem of computing feasible paths through an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, particularly as the number of processors is increased, because planning time depends on region characteristics and, for most problems, the heterogeneity of the set of regions increases as the region size decreases. In this work, we introduce two techniques to address the problems of load imbalance in the parallelization of sampling-based motion planning algorithms: bulk-synchronous redistribution and an adaptive workstealing approach. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on 3,000+ cores.


Efficient Massively Parallel Transport Sweeps, W. Daryl Hawkins, Timmie Smith, Michael P. Adams, Lawrence Rauchwerger, Nancy Amato, Marvin L. Adams, Transactions of the American Nuclear Society, Vol: 107, Issue: 1, pp. 477-481, San Diego, CA, USA, Nov 2012.
Keywords: Parallel Algorithms, Scientific Computing, STAPL
Links : [Published]

BibTex

@article{hawkins-tans-2012
, author = "W. Daryl Hawkins and Timmie Smith and Michael P. Adams and Lawrence Rauchwerger and Nancy Amato and Marvin L. Adams"
, title = "Efficient Massively Parallel Transport Sweeps"
, journal = "Trans. American Nuclear Society"
, volume = "107"
, number = "1"
, month = "November"
, year = "2012"
, pages = "477-481"
, issn = "0003-018X"
, note = "2012 ANS Annual Winter Meeting ; Conference date: 11-11-2012 Through 15-11-2012",
}


Abstract

The full-domain “sweep,” in which all angular fluxes in a problem are calculated given previous-iterate values only for the volumetric source, forms the foundation for many iterative methods that have desirable properties. One important property is that iteration counts do not grow with mesh refinement. The sweep solution on parallel machines is complicated by the dependency of a given cell on its upstream neighbors. A sweep algorithm is defined by its partitioning (dividing the domain among processors), aggregation (grouping cells, directions, and energy groups into “tasks”), and scheduling (choosing which task to execute if more than one is available). The work presented here follows that of Bailey and Falgout, who theoretically and computationally evaluated the performance of three sweep algorithms. Their “data-driven” schedule appeared to be optimal—executing the sweep in the minimum possible number of stages—for tested partitionings and aggregations, but they were unable to prove this mathematically and they did not attempt to optimize across possible partitionings and aggregations. Here we consider 3D Cartesian grids of Nx × Ny × Nz spatial cells and simple Px × Py × Pz partitioning, and we permit general aggregation of cells, directions, and energy groups. We have found a provably optimal family of scheduling algorithms and we present results from one of these. We exploit our guaranteed minimum stage count to further optimize sweep-execution time by choosing the best possible partitioning and aggregation parameters. Our results show excellent scaling out to 32,768 cores, significantly better than results previously reported for sweeps and in line with the optimistic projections.


The STAPL Parallel Graph Library, Harshvardhan, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC), pp. 46-60, Tokyo, Japan, Sep 2012. DOI: 10.1007/978-3-642-37658-0_4
Keywords: Parallel Graph Algorithms, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-642-37658-0_4,
author=\"Harshvardhan
and Fidel, Adam
and Amato, Nancy M.
and Rauchwerger, Lawrence\",
editor=\"Kasahara, Hironori
and Kimura, Keiji\",
title=\"The STAPL Parallel Graph Library\",
booktitle=\"Languages and Compilers for Parallel Computing\",
year=\"2013\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"46--60\",
abstract=\"This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraphpViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges.\",
isbn=\"978-3-642-37658-0\"
}


Abstract

This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges.


Logical inference techniques for loop parallelization, Cosmin E. Oancea, Lawrence Rauchwerger, ACM Conference on Programming Language Design and Implementation (PLDI), pp. 509-520, Beijing, China, Jun 2012. DOI: 10.1145/2254064.2254124
Keywords: Parallel Programming
Links : [Published]

BibTex

@inproceedings{DBLP:conf/pldi/OanceaR12,
author = {Cosmin E. Oancea and
Lawrence Rauchwerger},
editor = {Jan Vitek and
Haibo Lin and
Frank Tip},
title = {Logical inference techniques for loop parallelization},
booktitle = {{ACM} {SIGPLAN} Conference on Programming Language Design and Implementation,
{PLDI} '12, Beijing, China - June 11 - 16, 2012},
pages = {509--520},
publisher = {{ACM}},
year = {2012},
url = {https://doi.org/10.1145/2254064.2254124},
doi = {10.1145/2254064.2254124},
timestamp = {Tue, 06 Nov 2018 16:59:30 +0100},
biburl = {https://dblp.org/rec/conf/pldi/OanceaR12.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop's memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S=0, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ==> S = 0). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-Club and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers.


A Hybrid Approach to Proving Memory Reference Monotonicity, Cosmin E. Oancea, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing, 24th International Workshop (LCPC), Vol: 7146, pp. 61-75, Fort Collins, Colorado, USA, Sep 2011. DOI: 10.1007/978-3-642-36036-7\_5
Keywords: Parallel Programming
Links : [Published]

BibTex

@inproceedings{DBLP:conf/lcpc/OanceaR11,
author = {Cosmin E. Oancea and
Lawrence Rauchwerger},
editor = {Sanjay V. Rajopadhye and
Michelle Mills Strout},
title = {A Hybrid Approach to Proving Memory Reference Monotonicity},
booktitle = {Languages and Compilers for Parallel Computing, 24th International
Workshop, {LCPC} 2011, Fort Collins, CO, USA, September 8-10, 2011.
Revised Selected Papers},
series = {Lecture Notes in Computer Science},
volume = {7146},
pages = {61--75},
publisher = {Springer},
year = {2011},
url = {https://doi.org/10.1007/978-3-642-36036-7\_5},
doi = {10.1007/978-3-642-36036-7\_5},
timestamp = {Tue, 14 May 2019 10:00:47 +0200},
biburl = {https://dblp.org/rec/conf/lcpc/OanceaR11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

Array references indexed by non-linear expressions or subscript arrays represent a major obstacle to compiler analysis and to automatic parallelization. Most previous proposed solutions either enhance the static analysis repertoire to recognize more patterns, to infer array-value properties, and to refine the mathematical support, or apply expensive run time analysis of memory reference traces to disambiguate these accesses. This paper presents an automated solution based on static construction of access summaries, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can (in)validate the reference monotonicity property and thus (dis)prove loop independence. Experiments on six benchmarks show that our general technique for dynamic validation of the monotonicity property can cover a large class of codes, incurs minimal run-time overhead and obtains good speedups.


The STAPL Parallel Container Framework, Gabriel Tanase, Antal Buss, Adam Fidel, Harshvardhan, Ioannis Papadopoulos, Olga Pearce, Timmie Smith, Nathan Thomas, Xiabing Xu, Nedhal Mourad, Jeremy Vu, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 235-246, San Antonio, TX, USA, Feb 2011. DOI: 10.1145/1941553.1941586
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@article{10.1145/2038037.1941586,
author = {Tanase, Gabriel and Buss, Antal and Fidel, Adam and Harshvardhan and Papadopoulos, Ioannis and Pearce, Olga and Smith, Timmie and Thomas, Nathan and Xu, Xiabing and Mourad, Nedal and Vu, Jeremy and Bianco, Mauro and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {The STAPL Parallel Container Framework},
year = {2011},
issue_date = {August 2011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {46},
number = {8},
issn = {0362-1340},
url = {https://doi.org/10.1145/2038037.1941586},
doi = {10.1145/2038037.1941586},
abstract = {The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL Parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. The PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). The PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.},
journal = {SIGPLAN Not.},
month = feb,
pages = {235–246},
numpages = {12},
keywords = {languages, libraries, data, structures, containers, parallel}
}

@inproceedings{10.1145/1941553.1941586,
author = {Tanase, Gabriel and Buss, Antal and Fidel, Adam and Harshvardhan and Papadopoulos, Ioannis and Pearce, Olga and Smith, Timmie and Thomas, Nathan and Xu, Xiabing and Mourad, Nedal and Vu, Jeremy and Bianco, Mauro and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {The STAPL Parallel Container Framework},
year = {2011},
isbn = {9781450301190},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1941553.1941586},
doi = {10.1145/1941553.1941586},
abstract = {The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL Parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. The PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). The PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.},
booktitle = {Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming},
pages = {235–246},
numpages = {12},
keywords = {containers, libraries, structures, languages, parallel, data},
location = {San Antonio, TX, USA},
series = {PPoPP \'11}
}


Abstract

The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL Parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. The PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). The PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.


Speculative Parallelization of Loops, Lawrence Rauchwerger, Encyclopedia of Parallel Computing, pp. 1901-1912, Jan 2011. DOI: 10.1007/978-0-387-09766-4\_35
Keywords: Parallel Programming
Links : [Published]

BibTex

@incollection{DBLP:reference/parallel/Rauchwerger11,
author = {Lawrence Rauchwerger},
editor = {David A. Padua},
title = {Speculative Parallelization of Loops},
booktitle = {Encyclopedia of Parallel Computing},
pages = {1901--1912},
publisher = {Springer},
year = {2011},
url = {https://doi.org/10.1007/978-0-387-09766-4\_35},
doi = {10.1007/978-0-387-09766-4\_35},
timestamp = {Wed, 12 Jul 2017 09:11:52 +0200},
biburl = {https://dblp.org/rec/reference/parallel/Rauchwerger11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}


Abstract

Speculative loop (thread) level parallelization is a compiler run-time technique that executes optimistically parallelized loops, verifies the correctness of their execution and, when necessary, backtracks to a safe state for possible re-execution. This technique includes a compiler (static) component for the transformation of the loop for speculative parallel execution as well as a run-time component which verifies correctness and re-executes when necessary.


The STAPL pView, Antal Buss, Adam Fidel, Harshvardhan, Timmie Smith, Gabriel Tanase, Nathan Thomas, Xiabing Xu, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC), pp. 261-275, Houston, Texas, USA, Oct 2010. DOI: 10.1007/978-3-642-19595-2_18
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-642-19595-2_18,
author=\"Buss, Antal
and Fidel, Adam
and Harshvardhan
and Smith, Timmie
and Tanase, Gabriel
and Thomas, Nathan
and Xu, Xiabing
and Bianco, Mauro
and Amato, Nancy M.
and Rauchwerger, Lawrence\",
editor=\"Cooper, Keith
and Mellor-Crummey, John
and Sarkar, Vivek\",
title=\"The STAPL pView\",
booktitle=\"Languages and Compilers for Parallel Computing\",
year=\"2011\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"261--275\",
abstract=\"The Standard Template Adaptive Parallel Library (STAPL) is a C++ parallel programming library that provides a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. STAPL algorithms are written in terms of pViews, which provide a generic access interface to pContainer data by abstracting common data structure concepts. Briefly, pViews allow the same pContainer to present multiple interfaces, e.g., enabling the same pMatrix to be `viewed\' (or used) as a row-major or column-major matrix, or even as a vector. In this paper, we describe the staplpView concept and its properties. pViews generalize the iterator concept and enable parallelism by providing random access to, and an ADT for, collections of elements. We illustrate how pViews provide support for managing the tradeoff between expressivity and performance and examine the performance overhead incurred when using pViews.\",
isbn=\"978-3-642-19595-2\"
}


Abstract

The Standard Template Adaptive Parallel Library (STAPL) is a C++ parallel programming library that provides a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. STAPL algorithms are written in terms of pViews, which provide a generic access interface to pContainer data by abstracting common data structure concepts. Briefly, pViews allow the same pContainer to present multiple interfaces, e.g., enabling the same pMatrix to be ‘viewed’ (or used) as a row-major or column-major matrix, or even as a vector. In this paper, we describe the stapl pView concept and its properties. pViews generalize the iterator concept and enable parallelism by providing random access to, and an ADT for, collections of elements. We illustrate how pViews provide support for managing the tradeoff between expressivity and performance and examine the performance overhead incurred when using pViews.


STAPL: Standard Template Adaptive Parallel Library, Antal Buss, Harshvardhan, Ioannis Papadopoulos, Olga Tkachyshyn, Timmie Smith, Gabriel Tanase, Nathan Thomas, Xiabing Xu, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the 3rd Annual Haifa Experimental Systems Conference, Haifa, Israel, May 2010. DOI: 10.1145/1815695.1815713
Keywords: High-Performance Computing, Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/1815695.1815713,
author = {Buss, Antal and Harshvardhan and Papadopoulos, Ioannis and Pearce, Olga and Smith, Timmie and Tanase, Gabriel and Thomas, Nathan and Xu, Xiabing and Bianco, Mauro and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {STAPL: Standard Template Adaptive Parallel Library},
year = {2010},
isbn = {9781605589084},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1815695.1815713},
doi = {10.1145/1815695.1815713},
abstract = {The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. The stapl runtime system provides the abstraction for communication and program execution. In this paper, we describe the major components of stapl and present performance results for both algorithms and data structures showing scalability up to tens of thousands of processors.},
booktitle = {Proceedings of the 3rd Annual Haifa Experimental Systems Conference},
articleno = {14},
numpages = {10},
keywords = {high productivity parallel programming, parallel data structures, library},
location = {Haifa, Israel},
series = {SYSTOR \'10}
}


Abstract

The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. The stapl runtime system provides the abstraction for communication and program execution. In this paper, we describe the major components of stapl and present performance results for both algorithms and data structures showing scalability up to tens of thousands of processors.


The STAPL pList, Gabriel Tanase, Xiabing Xu, Antal Buss, Harshvardhan, Ioannis Papadopoulos, Olga Tkachyshyn, Timmie Smith, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing, pp. 16-30, Newark, DE, USA, Oct 2009. DOI: 10.1007/978-3-642-13374-9_2
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1007/978-3-642-13374-9_2,
author = {Tanase, Gabriel and Xu, Xiabing and Buss, Antal and Harshvardhan and Papadopoulos, Ioannis and Pearce, Olga and Smith, Timmie and Thomas, Nathan and Bianco, Mauro and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {The STAPL Plist},
year = {2009},
isbn = {3642133738},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-642-13374-9_2},
doi = {10.1007/978-3-642-13374-9_2},
abstract = {We present the design and implementation of the staplpList, a parallel container that has the properties of a sequential list, but allows for scalable concurrent access when used in a parallel program. The Standard Template Adaptive Parallel Library (stapl) is a parallel programming library that extends C++ with support for parallelism. stapl provides a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. staplpContainers are thread-safe, concurrent objects, providing appropriate interfaces (e.g., views) that can be used by generic pAlgorithms. The pList provides stl equivalent methods, such as insert, erase, and splice, additional methods such as split, and efficient asynchronous (non-blocking) variants of some methods for improved parallel performance. We evaluate the performance of the staplpList on an IBM Power 5 cluster and on a CRAY XT4 massively parallel processing system. Although lists are generally not considered good data structures for parallel processing, we show that pList methods and pAlgorithms (p_generate and p_partial_sum) operating on pLists provide good scalability on more than 103 processors and that pList compares favorably with other dynamic data structures such as the pVector.},
booktitle = {Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing},
pages = {16–30},
numpages = {15},
location = {Newark, DE},
series = {LCPC\'09}
}


Abstract

We present the design and implementation of the staplpList, a parallel container that has the properties of a sequential list, but allows for scalable concurrent access when used in a parallel program. The Standard Template Adaptive Parallel Library (stapl) is a parallel programming library that extends C++ with support for parallelism. stapl provides a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. stapl pContainers are thread-safe, concurrent objects, providing appropriate interfaces (e.g., views) that can be used by generic pAlgorithms. The pList provides stl equivalent methods, such as insert, erase, and splice, additional methods such as split, and efficient asynchronous (non-blocking) variants of some methods for improved parallel performance. We evaluate the performance of the stapl pList on an IBM Power 5 cluster and on a CRAY XT4 massively parallel processing system. Although lists are generally not considered good data structures for parallel processing, we show that pList methods and pAlgorithms (p_generate and p_partial_sum) operating on pLists provide good scalability on more than 1000 processors and that pList compares favorably with other dynamic data structures such as the pVector.


Two Memory Allocators that Use Hints to Improve Locality, Alin Jula, Lawrence Rauchwerger, Proceedings of the 2009 International Symposium on Memory Management (ISMM), pp. 109-118, Dublin, Ireland, Jun 2009. DOI: 10.1145/1542431.1542447
Keywords: Memory Management, Operating Systems, Optimizing Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/1542431.1542447,
author = {Jula, Alin and Rauchwerger, Lawrence},
title = {Two Memory Allocators That Use Hints to Improve Locality},
year = {2009},
isbn = {9781605583471},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1542431.1542447},
doi = {10.1145/1542431.1542447},
abstract = {Dynamic memory allocators affect an application\'s performance through their data layout quality. They can use an application\'s allocation hints to improve the spatial locality of this layout. However, a practical approach needs to be automatic, without user intervention. In this paper we present two locality improving allocators, that use allocation hints provided automatically from the C++ STL library to improve an application\'s spatial locality. When compared to state-of-the-art allocators on seven real world applications, our allocators run on average 7% faster than the Lea allocator, and 17% faster than the FreeBSD\'s allocator, with the same memory fragmentation as the Lea allocator, one of the best allocators.While considering locality as an important goal, locality improving allocators must not abandon the existing constraints of fast allocation speed and low fragmentation. These constraints further challenge their design and implementation. We experimentally show that within a memory allocator, allocation speed, memory fragmentation, and spatial locality compete with each other in a game of rock, paper, scissors: when one tries to improve one trait, the others suffer. We conclude that our allocators achieve a good balance of these traits, and they can easily be adjusted to optimize the most important trait for each application.},
booktitle = {Proceedings of the 2009 International Symposium on Memory Management},
pages = {109–118},
numpages = {10},
location = {Dublin, Ireland},
series = {ISMM \'09}
}


Abstract

Dynamic memory allocators affect an application\'s performance through their data layout quality. They can use an application\'s allocation hints to improve the spatial locality of this layout. However, a practical approach needs to be automatic, without user intervention. In this paper we present two locality improving allocators, that use allocation hints provided automatically from the C++ STL library to improve an application\'s spatial locality. When compared to state-of-the-art allocators on seven real world applications, our allocators run on average 7% faster than the Lea allocator, and 17% faster than the FreeBSD\'s allocator, with the same memory fragmentation as the Lea allocator, one of the best allocators. While considering locality as an important goal, locality improving allocators must not abandon the existing constraints of fast allocation speed and low fragmentation. These constraints further challenge their design and implementation. We experimentally show that within a memory allocator, allocation speed, memory fragmentation, and spatial locality compete with each other in a game of rock, paper, scissors: when one tries to improve one trait, the others suffer. We conclude that our allocators achieve a good balance of these traits, and they can easily be adjusted to optimize the most important trait for each application.


Design for Interoperability in STAPL: pMatrices and Linear Algebra Algorithms, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC 2008). Lecture Notes in Computer Science, vol 5335. Springer, pp. 304-315, Edmonton, Alberta, Canada, Jul 2008. DOI: 10.1007/978-3-540-89740-8_21
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-540-89740-8_21,
author=\"Buss, Antal A.
and Smith, Timmie G.
and Tanase, Gabriel
and Thomas, Nathan L.
and Bianco, Mauro
and Amato, Nancy M.
and Rauchwerger, Lawrence\",
editor=\"Amaral, Jos{\\\'e} Nelson\",
title=\"Design for Interoperability in stapl: pMatrices and Linear Algebra Algorithms\",
booktitle=\"Languages and Compilers for Parallel Computing\",
year=\"2008\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"304--315\",
abstract=\"The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. To improve productivity and performance, it is essential for stapl to exploit third party libraries, including those developed in programming languages other than C++. In this paper we describe a methodology that enables third party libraries to be used with stapl. This methodology allows a developer to specify when these specialized libraries can correctly be used, and provides mechanisms to transparently invoke them when appropriate. It also provides support for using staplpAlgorithms and pContainers in external codes. As a concrete example, we illustrate how third party libraries, namely BLAS and PBLAS, can be transparently embedded into stapl to provide efficient linear algebra algorithms for the staplpMatrix, with negligible slowdown with respect to the optimized libraries themselves.\",
isbn=\"978-3-540-89740-8\"
}


Abstract

The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. To improve productivity and performance, it is essential for stapl to exploit third party libraries, including those developed in programming languages other than C++. In this paper we describe a methodology that enables third party libraries to be used with stapl. This methodology allows a developer to specify when these specialized libraries can correctly be used, and provides mechanisms to transparently invoke them when appropriate. It also provides support for using stapl pAlgorithms and pContainers in external codes. As a concrete example, we illustrate how third party libraries, namely BLAS and PBLAS, can be transparently embedded into stapl to provide efficient linear algebra algorithms for the stapl pMatrix, with negligible slowdown with respect to the optimized libraries themselves.


Associative Parallel Containers In STAPL, Gabriel Tanase, Chidambareswaran (Chids) Raman, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC), pp. 156-171, Urbana, IL, USA, Oct 2007. DOI: 10.1007/978-3-540-85261-2_11
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-540-85261-2_11,
author=\"Tanase, Gabriel
and Raman, Chidambareswaran
and Bianco, Mauro
and Amato, Nancy M.
and Rauchwerger, Lawrence\",
editor=\"Adve, Vikram
and Garzar{\\\'a}n, Mar{\\\'i}a Jes{\\\'u}s
and Petersen, Paul\",
title=\"Associative Parallel Containers in STAPL\",
booktitle=\"Languages and Compilers for Parallel Computing\",
year=\"2008\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"156--171\",
abstract=\"The Standard Template Adaptive Parallel Library (stapl) is a parallel programming framework that extends C++ and stl with support for parallelism. stapl provides a collection of parallel data structures (pContainers) and algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. staplpContainers are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. They also provide appropriate interfaces that can be used by generic pAlgorithms. In this work, we present the design and implementation of the stapl associative pContainers: pMap, pSet, pMultiMap, pMultiSet, pHashMap, and pHashSet. These containers provide optimal insert, search, and delete operations for a distributed collection of elements based on keys. Their methods include counterparts of the methods provided by the stl associative containers, and also some asynchronous (non-blocking) variants that can provide improved performance in parallel. We evaluate the performance of the stapl associative pContainers on an IBM Power5 cluster, an IBM Power3 cluster, and on a linux-based Opteron cluster, and show that the new pContainer asynchronous methods, generic pAlgorithms (e.g., pfind) and a sort application based on associative pContainers, all provide good scalability on more than 1000 processors.\",
isbn=\"978-3-540-85261-2\"
}


Abstract

The Standard Template Adaptive Parallel Library (stapl) is a parallel programming framework that extends C++ and stl with support for parallelism. stapl provides a collection of parallel data structures (pContainers) and algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. stapl pContainers are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. They also provide appropriate interfaces that can be used by generic pAlgorithms. In this work, we present the design and implementation of the stapl associative pContainers: pMap, pSet, pMultiMap, pMultiSet, pHashMap, and pHashSet. These containers provide optimal insert, search, and delete operations for a distributed collection of elements based on keys. Their methods include counterparts of the methods provided by the stl associative containers, and also some asynchronous (non-blocking) variants that can provide improved performance in parallel. We evaluate the performance of the stapl associative pContainers on an IBM Power5 cluster, an IBM Power3 cluster, and on a linux-based Opteron cluster, and show that the new pContainer asynchronous methods, generic pAlgorithms (e.g., pfind) and a sort application based on associative pContainers, all provide good scalability on more than 1000 processors.


The STAPL pArray, Gabriel Tanase, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the 2007 Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture (MEDEA), pp. 73-80, Brasov, Romania, Sep 2007. DOI: 10.1145/1327171.1327180
Keywords: Parallel Containers, Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/1327171.1327180,
author = {Tanase, Gabriel and Bianco, Mauro and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {The STAPL PArray},
year = {2007},
isbn = {9781595938077},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1327171.1327180},
doi = {10.1145/1327171.1327180},
abstract = {The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming framework that extends C++ and STL with support for parallelism. STAPL provides parallel data structures (pContainers) and generic parallel algorithms (pAlgorithms), and a methodology for extending them to provide customized functionality. STAPL pContainers are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. They provide views as a generic means to access data that can be passed as input to generic pAlgorithms.In this work, we present the STAPL pArray, the parallel equivalent of the sequential STL valarray, a fixed-size data structure optimized for storing and accessing data based on one-dimensional indices. We describe the pArray design and show how it can support a variety of underlying data distribution policies currently available in STAPL, such as blocked or blocked cyclic. We provide experimental results showing that pAlgorithms using the pArray scale well to more than 2,000 processors. We also provide results using different data distributions that illustrate that the performance of pAlgorithms and pArray methods is usually sensitive to the underlying data distribution, and moreover, that there is no one data distribution that performs best for all pAlgorithms, processor counts, or machines.},
booktitle = {Proceedings of the 2007 Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture},
pages = {73–80},
numpages = {8},
location = {Brasov, Romania},
series = {MEDEA \'07}
}


Abstract

The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming framework that extends C++ and STL with support for parallelism. STAPL provides parallel data structures (pContainers) and generic parallel algorithms (pAlgorithms), and a methodology for extending them to provide customized functionality. STAPL pContainers are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. They provide views as a generic means to access data that can be passed as input to generic pAlgorithms. In this work, we present the STAPL pArray, the parallel equivalent of the sequential STL valarray, a fixed-size data structure optimized for storing and accessing data based on one-dimensional indices. We describe the pArray design and show how it can support a variety of underlying data distribution policies currently available in STAPL, such as blocked or blocked cyclic. We provide experimental results showing that pAlgorithms using the pArray scale well to more than 2,000 processors. We also provide results using different data distributions that illustrate that the performance of pAlgorithms and pArray methods is usually sensitive to the underlying data distribution, and moreover, that there is no one data distribution that performs best for all pAlgorithms, processor counts, or machines.


Sensitivity Analysis for Automatic Parallelization on Multi-Cores, Silvius Rus, Maikel Pennings, Lawrence Rauchwerger, Proceedings of the 21st Annual International Conference on Supercomputing (ICS), pp. 263-273, Seattle, Washington, USA, Jun 2007. DOI: 10.1145/1274971.1275008
Keywords: Optimizing Compilers, Parallelising Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/1274971.1275008,
author = {Rus, Silvius and Pennings, Maikel and Rauchwerger, Lawrence},
title = {Sensitivity Analysis for Automatic Parallelization on Multi-Cores},
year = {2007},
isbn = {9781595937681},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1274971.1275008},
doi = {10.1145/1274971.1275008},
abstract = {Sensitivity Analysis (SA) is a novel compiler technique that complements, and integrates with, static automatic parallelization analysis for the cases when relevant program behavior is input sensitive. In this paper we show how SA can extract all the input dependent, statically unavailable, conditions for which loops can be dynamically parallelized. SA generates a sequence of sufficient conditions which, when evaluated dynamically in order of their complexity, can each validate the dynamic parallel execution of the corresponding loop. For example, SA can first attempt to validate parallelization by checking simple conditions related to loop bounds. If such simple conditions cannot be met, then validating dynamic parallelization may require evaluating conditions related to the entire memory reference trace of a loop, thus decreasing the benefits of parallel execution.We have implemented Sensitivity Analysis in the Polaris compiler and evaluated its performance using 22 industry standard benchmark codes running on two multicore systems. In most cases we have obtained speedups superior to the Intel Ifort compiler because with SA we could complement static analysis with minimum cost dynamic analysis and extract most of the available coarse grained parallelism.},
booktitle = {Proceedings of the 21st Annual International Conference on Supercomputing},
pages = {263–273},
numpages = {11},
keywords = {parallelism, sensitivity analysis, multicore},
location = {Seattle, Washington},
series = {ICS '07}
}


Abstract

Sensitivity Analysis (SA) is a novel compiler technique that complements, and integrates with, static automatic parallelization analysis for the cases when relevant program behavior is input sensitive. In this paper we show how SA can extract all the input dependent, statically unavailable, conditions for which loops can be dynamically parallelized. SA generates a sequence of sufficient conditions which, when evaluated dynamically in order of their complexity, can each validate the dynamic parallel execution of the corresponding loop. For example, SA can first attempt to validate parallelization by checking simple conditions related to loop bounds. If such simple conditions cannot be met, then validating dynamic parallelization may require evaluating conditions related to the entire memory reference trace of a loop, thus decreasing the benefits of parallel execution. We have implemented Sensitivity Analysis in the Polaris compiler and evaluated its performance using 22 industry standard benchmark codes running on two multicore systems. In most cases we have obtained speedups superior to the Intel Ifort compiler because with SA we could complement static analysis with minimum cost dynamic analysis and extract most of the available coarse grained parallelism.


Sensitivity Analysis for Migrating Programs to Multi-Cores, Silvius Rus, Marinus Pennings, Lawrence Rauchwerger, Technical Report, TR06-015, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Dec 2006.
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr06-015-2006,
title = "Sensitivity Analysis for Migrating Programs to Multi-Cores",
author = "Rus, Silvius Rus and Pennings, Marinus and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX, USA",
number = "TR06-015",
year = 2006,
month = dec
}


Abstract

Recently, a technique known as Hybrid Analysis has demonstrated the possibility of automatic coarse grain program parallelization through an integrated combination of static and dynamic analysis techniques. The recent introduction of multi-cores for the mass market has on the one hand exacerbated the need for such a technology and, on the other hand, changed the cost/benefit ratio of paralleliza- tion. Multi-Cores have low communication costs but the degree of parallelism is relatively small (¡100 processors for the next 5 years). Therefore hybrid parallelization techniques (static/dynamic analysis) need to keep their dynamic overhead very low in order to benefit multi-core systems. The Sensitivity Analysis (SA), presented in this paper is a novel technique that can reduce the dynamic overhead of previous Hybrid Analysis technology. Instead of dynamically evaluating the ag- gregated memory reference sets representing the potential data dependences, SA can frequently extract light weight, sufficient conditions for which these dependence sets are empty. The cost of the run-time evaluation of these conditions, named predicates, is low enough to make parallelization profitable on multi-cores. We have implemented the SA technology in the Polaris compiler and evaluated its performance on 22 benchmark SPEC and PERFECT codes. We have obtained speedups superior to the Intel Ifort compiler in most cases because we could extract most of the available coarse grained parallelism.


Custom Memory Allocation for Free: Improving Data Locality with Container-Centric Memory Allocation, Alin Jula, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 4382, New Orleans, Louisiana, USA, Nov 2006. DOI: 10.1007/978-3-540-72521-3_22
Keywords: Memory Management, Runtime Systems, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-540-72521-3_22,
author="Jula, Alin
and Rauchwerger, Lawrence",
editor="Alm{\'a}si, George
and Ca{\c{s}}caval, C{\u{a}}lin
and Wu, Peng",
title="Custom Memory Allocation for Free",
booktitle="Languages and Compilers for Parallel Computing",
year="2007",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="299--313",
abstract="We present a novel and efficient container-centric memory allocator, named Defero, which allows a container to guide the allocation of its elements. The guidance is supported by the semantic-rich context of containers in which a new element is inserted. Defero allocates based on two attributes: size and location. Our policy of allocating a new object close to a related object often results in significantly increased memory reference locality. Defero has been integrated to work seamlessly with the C++ Standard Template Library (STL) containers. The communication between containers and the memory allocator is very simple and insures portability. STL container modification is the only needed code change to achieve custom memory allocation. We present experimental results that show the performance improvements that can be obtained by using Defero as a custom allocator for STL applications. We have applied our memory allocator to the molecular dynamics and compiler applications and obtained significant performance improvements over using the standard GNU STL allocator. With our approach custom memory allocation has been achieved without any modification of the actual applications, i.e., without additional programming efforts.",
isbn="978-3-540-72521-3"
}


Abstract

We present a novel and efficient container-centric memory allocator, named Defero, which allows a container to guide the allocation of its elements. The guidance is supported by the semantic-rich context of containers in which a new element is inserted. Defero allocates based on two attributes: size and location. Our policy of allocating a new object close to a related object often results in significantly increased memory reference locality. Defero has been integrated to work seamlessly with the C++ Standard Template Library (STL) containers. The communication between containers and the memory allocator is very simple and insures portability. STL container modification is the only needed code change to achieve custom memory allocation. We present experimental results that show the performance improvements that can be obtained by using Defero as a custom allocator for STL applications. We have applied our memory allocator to the molecular dynamics and compiler applications and obtained significant performance improvements over using the standard GNU STL allocator. With our approach custom memory allocation has been achieved without any modification of the actual applications, i.e., without additional programming efforts.


Design and use of HTAlib - A library for hierarchically tiled arrays, Ganesh Bikshandy, Jia Guo, Christoph von Praun, Gabriel Tanase, Basilio Fraguela, Maria Jesus Garzaran, David Padua, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 4382, pp. 17-32, New Orleans, LA, USA, Nov 2006. DOI: 10.1007/978-3-540-72521-3_3
Keywords: Memory Management, Parallel Algorithms, Parallel Containers
Links : [Published]

BibTex

@InProceedings{10.1007/978-3-540-72521-3_3,
author="Bikshandi, Ganesh
and Guo, Jia
and von Praun, Christoph
and Tanase, Gabriel
and Fraguela, Basilio B.
and Garzar{\'a}n, Mar{\'i}a J.
and Padua, David
and Rauchwerger, Lawrence",
editor="Alm{\'a}si, George
and Ca{\c{s}}caval, C{\u{a}}lin
and Wu, Peng",
title="Design and Use of htalib -- A Library for Hierarchically Tiled Arrays",
booktitle="Languages and Compilers for Parallel Computing",
year="2007",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="17--32",
abstract="Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a global view of distributed data as well as a single-threaded view of the execution. In this paper we present htalib, a C++ implementation of HTAs. This library provides several novel constructs: (i) A map-reduce operator framework that facilitates the implementation of distributed operations with HTAs. (ii) Overlapped tiling in support of tiling in stencil codes. (iii) Data layering, facilitating the use of HTAs in adaptive mesh refinement applications. We describe the interface and design of htalib and our experience with the new programming constructs.",
isbn="978-3-540-72521-3"
}


Abstract

Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a global view of distributed data as well as a single-threaded view of the execution. In this paper we present htalib, a C++ implementation of HTAs. This library provides several novel constructs: (i) A map-reduce operator framework that facilitates the implementation of distributed operations with HTAs. (ii) Overlapped tiling in support of tiling in stencil codes. (iii) Data layering, facilitating the use of HTAs in adaptive mesh refinement applications. We describe the interface and design of htalib and our experience with the new programming constructs.


ARMI: A High Level Communication Library for STAPL, Nathan Thomas, Steven Saunders, Tim Smith, Gabriel Tanase, Lawrence Rauchwerger, Parallel Processing Letters, Vol: 16, Issue: 2, pp. 261-280, Jun 2006. DOI: 10.1142/S0129626406002617
Keywords: Communication Library, Remote Method Invocation, Runtime Systems
Links : [Published]

BibTex

@article{10.1142/S0129626406002617,
author = "Nathan Thomas and Steven Saunders and Tim Smith and Gabriel Tanase and Lawrence Rauchwerger",
title = "ARMI: A High Level Communication Library for STAPL",
journal = "Parallel Processing Letters",
year = 2006,
volume = "16",
number = "2",
pages = "261-280",
}


Abstract

ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP-V2200, Origin 3800, IBM Regatta and IBM RS/6000 SP cluster.


Region Array SSA, Silvius Rus, Guobin He, Lawrence Rauchwerger, Technical Report, TR06-007, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, May 2006.
Keywords: Data Flow, Optimizing Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr06-007-2006,
title = "Region Array SSA",
author = "Rus, Silvius and He, Guobin and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR06-007",
year = 2006,
month = may
}


Abstract

Static Single Assignment (SSA) has become the intermediate program representation of choice in most modern compilers because it enables efficient data flow analysis of scalars and thus leads to better scalar optimizations. Unfortunately not much progress has been achieved in applying the same techniques to array data flow analysis, a very important and potentially powerful technology. In this paper we propose to improve the applicability of previous efforts in array SSA through the use of a symbolic memory access descriptor that can aggregate the accesses to the elements of an array over large, interprocedural program contexts. We then show the power of our new representation by using it to implement a basic data flow algorithm, reaching definitions. Finally we apply this analysis to array constant propagation and array privatization and show performance improvement (speedups) for benchmark codes.


SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms, Lawrence Rauchwerger, Nancy Amato, ACM SIGOPS Operating Systems Review, Vol: 40, Issue: 2, pp. 73-82, Apr 2006. DOI: 10.1145/1131322.1131338
Keywords: High-Performance Computing, Memory Management, Runtime Systems
Links : [Published]

BibTex

@article{10.1145/1131322.1131338,
author = {Rauchwerger, Lawrence and Amato, Nancy M.},
title = {SmartApps: Middle-Ware for Adaptive Applications on Reconfigurable Platforms},
year = {2006},
issue_date = {April 2006},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {40},
number = {2},
issn = {0163-5980},
url = {https://doi.org/10.1145/1131322.1131338},
doi = {10.1145/1131322.1131338},
abstract = {One general avenue to obtain optimized performance on large and complex systems is to approach optimization from a global perspective of the complete system in a customized manner for each application, i.e., application-centric optimization. Lately, there have been encouraging developments in reconfigurable operating systems and hardware that will enable customized optimization. For example, machines built with PIM's and FPGA's can be quickly reconfigured to better fit a certain application and operating systems, such as IBM's K42, can have their services customized to fit the needs and characteristics of an application. While progress in operating system and hardware and hardware has made re-configuration possible, we still need strategies and techniques to exploit them for improved application performance.In this paper, we describe the approach we are using in our smart application (SMARTAPPS) project. In the SMARTAPP executable, the compiler embeds most run-time system services and a feedback loop to monitor performance and trigger run-time adaptations. At run-time, after incorporating the code's input and determining the system's state, the SMARTAPP performs an instance specific optimization. During execution, the application continually monitors its performance and the available resources to determine if restructuring should occur. The framework includes mechanisms for performing the actual restructuring at various levels including: algorithmic adaptation, tuning reconfigurable OS services (scheduling policy, page size, etc.), and system configuration (e.g., number of processors). This paper concentrates on the techniques for providing customized system services for communication, thread scheduling, memory management, and performance monitoring and modeling.},
journal = {SIGOPS Oper. Syst. Rev.},
month = apr,
pages = {73–82},
numpages = {10}
}


Abstract

One general avenue to obtain optimized performance on large and complex systems is to approach optimization from a global perspective of the complete system in a customized manner for each application, i.e., application-centric optimization. Lately, there have been encouraging developments in reconfigurable operating systems and hardware that will enable customized optimization. For example, machines built with PIM's and FPGA's can be quickly reconfigured to better fit a certain application and operating systems, such as IBM's K42, can have their services customized to fit the needs and characteristics of an application. While progress in operating system and hardware and hardware has made re-configuration possible, we still need strategies and techniques to exploit them for improved application performance.In this paper, we describe the approach we are using in our smart application (SMARTAPPS) project. In the SMARTAPP executable, the compiler embeds most run-time system services and a feedback loop to monitor performance and trigger run-time adaptations. At run-time, after incorporating the code's input and determining the system's state, the SMARTAPP performs an instance specific optimization. During execution, the application continually monitors its performance and the available resources to determine if restructuring should occur. The framework includes mechanisms for performing the actual restructuring at various levels including: algorithmic adaptation, tuning reconfigurable OS services (scheduling policy, page size, etc.), and system configuration (e.g., number of processors). This paper concentrates on the techniques for providing customized system services for communication, thread scheduling, memory management, and performance monitoring and modeling.


Hybrid Dependence Analysis for Automatic Parallelization, Silvius Rus, Lawrence Rauchwerger, Technical Report, TR05-013, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Nov 2005.
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr05-013-2005,
title = "Hybrid Dependence Analysis for Automatic Parallelization",
author = "Rus, Silvius and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR05-013",
year = 2005,
month = nov
}


Abstract

Automatic program parallelization has been an elusive goal for many years. It has recently become more important due to the widespread introduction of multi-cores in PCs. Automatic parallelization could not be achieved because classic compiler analysis was neither powerful enough and program behavior was found to be in many cases input dependent. Run-time thread level parallelization, introduced in 1995, was a welcome but somewhat different avenue for advancing parallelization coverage. In this paper we introduce a novel analysis, Hybrid Analysis (HA), which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and generates minimum run-time overhead. In this paper we will present how we can extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods and generate sufficient conditions which, when evaluated dynamically can validate optimizations. A large number of experiments confirm the viability of our techniques, which have been implemented in the Polaris compiler.


An Experimental Evaluation of the HP V-Class and SGI Origin 2000 Multiprocessors using Microbenchmarks and Scientific Applications, Ravi Iyer, Jack Perdue, Lawrence Rauchwerger, Nancy M. Amato and Laxmi Bhuyan, International Journal of Parallel Programming, Vol: 33, pp. 307-350, Aug 2005. DOI: 10.1007/s10766-004-1187-0
Keywords: High-Performance Computing, Parallel Processing, Scientific Computing
Links : [Published]

BibTex

@article{iyer-ijpp-2005
, author = "Ravi Iyer and Jack Perdue and Lawrence Rauchwerger and Nancy M. Amato and Laxmi Bhuyan"
, title = "An Experimental Evaluation of the HP V-Class and SGI Origin 2000 Multiprocessors using Microbenchmarks and Scientific Applications"
, journal = "Internaational Journal of Parallel Programming"
, volume = "33"
, pages = "307--350"
, year = "2005"
, doi = "10.1007/s10766-004-1187-0"
}


Abstract

As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We present the impact of processor design, cache/memory hierarchies and coherence protocol optimizations on the memory system performance of these multiprocessors. We also study the effect of parallelism overheads such as process creation and synchronization on the user-level performance of these multiprocessors. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance. Our microbenchmark results show the impact of Ll/L2 cache size and TLB size on uniprocessor load/store latencies, the effect of coherence protocol design/optimizations and data sharing patterns on multiprocessor memory access latencies and finally the overhead of parallelism. Our application-based evaluation shows the impact of problem size, dominant sharing patterns and number of Processors used on speedup and raw execution time. Finally, we use hardware counter measurements to study the correlation of system-level performance metrics and the application’s execution time performance.


A Framework for Adaptive Algorithm Selection in STAPL, Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277-288, Chicago, IL, USA, Jun 2005. DOI: 10.1145/1065944.1065981
Keywords: Parallel Algorithms, Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/1065944.1065981,
author = {Thomas, Nathan and Tanase, Gabriel and Tkachyshyn, Olga and Perdue, Jack and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {A Framework for Adaptive Algorithm Selection in STAPL},
year = {2005},
isbn = {1595930809},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1065944.1065981},
doi = {10.1145/1065944.1065981},
abstract = {Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system.},
booktitle = {Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {277–288},
numpages = {12},
keywords = {machine learning, parallel algorithms, matrix multiplication, sorting, adaptive algorithms},
location = {Chicago, IL, USA},
series = {PPoPP \'05}
}


Abstract

Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system.


The Value Evolution Graph and its Use in Memory Reference Analysis, Silvius Rus, Dongmin Zhang, Lawrence Rauchwerger, 13th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 243-254, Antibes, Juan-les-Pins, France, Sep 2004. DOI: 10.1109/PACT.2004.1342558
Keywords: Data Flow, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{1342558, author={S. {Rus} and D. {Zhang} and L. {Rauchwerger}}, booktitle={Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004.}, title={The value evolution graph and its use in memory reference analysis}, year={2004}, volume={}, number={}, pages={243-254}, doi={10.1109/PACT.2004.1342558}}


Abstract

We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the value evolution graph (VEG), which models the global flow of values taken by induction variable with and without closed forms. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays addressed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.


Automatic Parallelization Using the Value Evolution Graph, Silvius Rus, Dongmin Zhang, Lawrence Rauchwerger, Languages and Compilers for High Performance Computing (LCPC), pp. 379-393, West Lafayette, Indiana, USA, Sep 2004. DOI: 10.1007/11532378_27
Keywords: Data Flow, Optimizing Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/11532378_27,
author=\"Rus, Silvius
and Zhang, Dongmin
and Rauchwerger, Lawrence\",
editor=\"Eigenmann, Rudolf
and Li, Zhiyuan
and Midkiff, Samuel P.\",
title=\"Automatic Parallelization Using the Value Evolution Graph\",
booktitle=\"Languages and Compilers for High Performance Computing\",
year=\"2005\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"379--393\",
abstract=\"We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the Value Evolution Graph (VEG), which models the global flow of scalar and array values within a program. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays indexed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.\",
isbn=\"978-3-540-31813-2\"
}


Abstract

We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the Value Evolution Graph (VEG), which models the global flow of scalar and array values within a program. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays indexed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.


An Adaptive Algorithm Selection Framework, Hao Yu, Dongmin Zhang, Lawrence Rauchwerger, 13th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 278-289, Antibes, Juan-les-Pins, France, Sep 2004. DOI: 10.1109/PACT.2004.1342561
Keywords: Adaptive Algorithm, Machine Learning
Links : [Published]

BibTex

@INPROCEEDINGS{1342561, author={H. {Yu} and D. {Zhang} and L. {Rauchwerger}}, booktitle={Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004.}, title={An adaptive algorithm selection framework}, year={2004}, volume={}, number={}, pages={278-289}, doi={10.1109/PACT.2004.1342561}}


Abstract

Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallel algorithms in particular. We present an adaptive algorithm selection framework which can collect and interpret the inputs of a particular instance of a parallel algorithm and select the best performing one from an existing library. We present the dynamic selection of parallel reduction algorithms. First we introduce a set of high-level parameters that can characterize different parallel reduction algorithms. Then we describe an offline, systematic process to generate predictive models, which can be used for run-time algorithm selection. Our experiments show that our framework: (a) selects the most appropriate algorithms in 85% of the cases studied, (b) overall delivers 98% of the optimal performance, (c) adaptively selects the best algorithms for dynamic phases of a running program (resulting in performance improvements otherwise not possible), and (d) adapts to the underlying machine architecture (tested on IBM Regatta and HP V-Class systems).


An Adaptive Algorithm Selection Framework, Hao Yu, Dongmin Zhang, Francis Dang, Lawrence Rauchwerger, Technical Report, TR04-002, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Mar 2004.
Keywords: Adaptive Algorithm, Machine Learning
Links : [Manuscript]

BibTex

@techreport{hyu-tr04-002-2004,
title = "An Adaptive Algorithm Selection Framework",
author = " Yu, Hao and Zhang, Dongmin and Dang, Francis and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR04-002",
year = 2004,
month = mar
}


ARMI: An Adaptive, Platform Independent Communication Library, Steven Saunders, Lawrence Rauchwerger, Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 230-241, San Diego, California, USA, Jun 2003. DOI: 10.1145/781498.781534
Keywords: Communication Library, Remote Method Invocation, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/781498.781534,
author = {Saunders, Steven and Rauchwerger, Lawrence},
title = {ARMI: An Adaptive, Platform Independent Communication Library},
year = {2003},
isbn = {1581135882},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/781498.781534},
doi = {10.1145/781498.781534},
abstract = {ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.},
booktitle = {Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {230–241},
numpages = {12},
keywords = {MPI, parallel programming, RPC, Pthreads, communication library, OpenMP, run-time system, RMI},
location = {San Diego, California, USA},
series = {PPoPP \'03}
}

@article{10.1145/966049.781534,
author = {Saunders, Steven and Rauchwerger, Lawrence},
title = {ARMI: An Adaptive, Platform Independent Communication Library},
year = {2003},
issue_date = {October 2003},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {38},
number = {10},
issn = {0362-1340},
url = {https://doi.org/10.1145/966049.781534},
doi = {10.1145/966049.781534},
abstract = {ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.},
journal = {SIGPLAN Not.},
month = jun,
pages = {230–241},
numpages = {12},
keywords = {RPC, RMI, Pthreads, run-time system, OpenMP, parallel programming, MPI, communication library}
}


Abstract

ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.


A parallel communication infrastructure for STAPL, Steven Saunders, Lawrence Rauchwerger, Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL), New York, NY, USA, Jun 2002.
Keywords: Communication Library, Runtime Systems, STAPL
Links : [Published] [Talk]

BibTex

N/A


Abstract

Communication is an important but difficult aspect of parallel programming. This paper describes a parallel communication infrastructure, based on remote method invocation, to simplify parallel programming by abstracting lowlevel shared-memory or message passing details while maintaining high performance and portability. STAPL, the Standard Template Adaptive Parallel Library, builds upon this infrastructure to make communication transparent to the user. The basic design is discussed, as well as the mechanisms used in the current Pthreads and MPI implementations. Performance comparisons between STAPL and explicit Pthreads or MPI are given on a variety of machines, including an HPV2200, Origin 3800 and a Linux Cluster.


The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, Francis Dang, Hao Yu, Lawrence Rauchwerger, Proceedings 16th International Parallel and Distributed Processing Symposium (IPDPS), pp. 10-, Ft. Lauderdale, FL, USA, Apr 2002. DOI: 10.1109/IPDPS.2002.1015493
Keywords: Parallel Processing, Parallel Programming, Parallelising Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{1015493, author={F. {Dang} and {Hao Yu} and L. {Rauchwerger}}, booktitle={Proceedings 16th International Parallel and Distributed Processing Symposium}, title={The R-LRPD test: speculative parallelization of partially parallel loops}, year={2002}, volume={}, number={}, pages={10 pp-}, doi={10.1109/IPDPS.2002.1015493}}


Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. In our previously proposed framework we have speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. We now propose a generalization of our speculative doall parallelization technique, called the Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the runtime dependence test itself. We present the base algorithm and an analysis of the different heuristics for its practical application and a few experimental results on loops from Track, Spice, and FMA3D.


SmartApps, An Application Centric Approach to High Performance Computing: Compiler-Assisted Software and Hardware Support for Reduction Operations, Francis Dang, Maria Jesus Garzaran, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Nancy Amato, Lawrence Rauchwerger, Josep Torrellas, Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS), pp. 172-181, Fort Lauderdale, Florida, USA, Apr 2002. DOI: 10.1109/IPDPS.2002.1016572
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{1016572, author={F. {Dang} and M. {Jesus Garzaran} and M. {Prvulovic} and {Ye Zhang} and A. {Jula} and {Hao Yu} and N. {Amato} and L. {Rauchwerger} and J. {Torrellas}}, booktitle={Proceedings 16th International Parallel and Distributed Processing Symposium}, title={Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations}, year={2002}, volume={}, number={}, pages={10 pp-}, doi={10.1109/IPDPS.2002.1016572}}


Abstract

State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code's input and the system's resources and state, the SMARTAPP performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of SMARTAPPS and then present some achievements to date, focusing on compiler-assisted software and hardware techniques for parallelizing reduction operations. These illustrate SMARTAPPS use of adaptive algorithm selection and moderately reconfigurable hardware.


Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors, Maria Jesus Garzaran, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, Josep Torrellas, Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 243-254, Barcelona, Spain, Spain, Sep 2001. DOI: 10.1109/PACT.2001.953304
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{953304, author={M. J. {Garzaran} and M. {Prvulovic} and {Ye Zhang} and A. {Jula} and {Hao Yu} and L. {Rauchwerger} and J. {Torrellas}}, booktitle={Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques}, title={Architectural support for parallel reductions in scalable shared-memory multiprocessors}, year={2001}, volume={}, number={}, pages={243-254}, doi={10.1109/PACT.2001.953304}}


Abstract

Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6.


STAPL: An Adaptive, Generic Parallel C++ Library, Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 2624, pp. 193-208, Cumberland Falls, KY, USA, Aug 2001. DOI: 10.1007/3-540-35767-X_13
Keywords: High-Performance Computing, Parallel Programming, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-35767-X_13,
author="An, Ping
and Jula, Alin
and Rus, Silvius
and Saunders, Steven
and Smith, Tim
and Tanase, Gabriel
and Thomas, Nathan
and Amato, Nancy
and Rauchwerger, Lawrence",
editor="Dietz, Henry G.",
title="STAPL: An Adaptive, Generic Parallel C++ Library",
booktitle="Languages and Compilers for Parallel Computing",
year="2003",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="193--208",
abstract="The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5{\%} of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.",
isbn="978-3-540-35767-4"
}


Abstract

The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5% of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.


Finding Strongly Connected Components in Parallel in Particle Transport Sweeps, William McLendon, Bruce Hendrickson, Steve Plimpton, Lawrence Rauchwerger, Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 328-329, Crete Island, Greece, Jul 2001. DOI: 10.1145/378580.378751
Keywords: High-Performance Computing, Parallel Algorithms
Links : [Published]

BibTex

@inproceedings{10.1145/378580.378751,
author = {Mclendon, William and Hendrickson, Bruce and Plimpton, Steve and Rauchwerger, Lawrence},
title = {Finding Strongly Connected Components in Parallel in Particle Transport Sweeps},
year = {2001},
isbn = {1581134096},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/378580.378751},
doi = {10.1145/378580.378751},
abstract = {Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.},
booktitle = {Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures},
pages = {328–329},
numpages = {2},
location = {Crete Island, Greece},
series = {SPAA '01}
}


Abstract

Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.


STAPL: A standard template adaptive parallel C++ library, Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, Lawrence Rauchwerger, Int. Wkshp on Adv. Compiler Technology for High Perf. and Embedded Processors, pp. 10-, Bucharest, Romania, Jul 2001.
Keywords: C++, Parallel Libraries, STAPL
Links : [Published]

BibTex

N/A


Abstract

The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni-or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism in areas such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms, which use dynamically linked data structures. STAPL provides several different algorithms for some library routines, and selects among them adaptively at run-time. STAPL can replace STL automatically by invoking a preprocessing translation phase. The performance of translated code is close to the results obtained using STAPL directly (less than 5% performance deterioration). However, STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.


SmartApps: An Application Centric Approach to High Performance Computing, Lawrence Rauchwerger, Nancy Amato, Josep Torrellas, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 2017, pp. 82-96, Yorktown Heights, New York, USA, Aug 2000. DOI: 10.1007/3-540-45574-4_6
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-45574-4_6,
author="Rauchwerger, Lawrence
and Amato, Nancy M.
and Torrellas, Josep",
editor="Midkiff, Samuel P.
and Moreira, Jos{\'e} E.
and Gupta, Manish
and Chatterjee, Siddhartha
and Ferrante, Jeanne
and Prins, Jan
and Pugh, William
and Tseng, Chau-Wen",
title="SmartApps: An Application Centric Approach to High Performance Computing",
booktitle="Languages and Compilers for Parallel Computing",
year="2001",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="82--96",
abstract="State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code's input and the system's resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.",
isbn="978-3-540-45574-5"
}


Abstract

State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application’s performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code’s input and the system’s resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.


Speculative Parallelization of Partially Parallel Loops, Francis H. Dang, Lawrence Rauchwerger, Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR.) Lecture Notes in Computer Science, Vol: 1915, pp. 285-299, Rochester, New York, USA, May 2000. DOI: 10.1007/3-540-40889-4_22
Keywords: Optimizing Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-40889-4_22,
author="Dang, Francis H.
and Rauchwerger, Lawrence",
editor="Dwarkadas, Sandhya",
title="Speculative Parallelization of Partially Parallel Loops",
booktitle="Languages, Compilers, and Run-Time Systems for Scalable Computers",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="285--299",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently de- fined access patterns. We have previously proposed a framework for their identifi- cation. We speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method ex- ploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. In this paper we propose a generalization of our speculative doall parallelization technique, named Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the run-time dependence test itself, i.e., removes the time lost due to incorrect parallel execution. The asymptotic time-complexity is, for fully serial loops, equal to the sequential execution time. We present the base algorithm and an analysis of the different heuristics for its practical application. Some preliminary experimental results on loops from Track will show the performance of this new technique.",
isbn="978-3-540-40889-5"
}


Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently de- fined access patterns. We have previously proposed a framework for their identifi- cation. We speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method ex- ploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. In this paper we propose a generalization of our speculative doall parallelization technique, named Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the run-time dependence test itself, i.e., removes the time lost due to incorrect parallel execution. The asymptotic time-complexity is, for fully serial loops, equal to the sequential execution time. We present the base algorithm and an analysis of the different heuristics for its practical application. Some preliminary experimental results on loops from Track will show the performance of this new technique.


Techniques for Reducing the Overhead of Run-time Parallelization, Hao Yu, Lawrence Rauchwerger, Compiler Construction (CC.) Lecture Notes in Computer Science, Vol: 1781, pp. 232-248, Berlin, Germany, Mar 2000. DOI: 10.1007/3-540-46423-9_16
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-46423-9_16,
author="Yu, Hao
and Rauchwerger, Lawrence",
editor="Watt, David A.",
title="Techniques for Reducing the Overhead of Run-Time Parallelization",
booktitle="Compiler Construction",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="232--248",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its practical success depends on the fraction of ideal speedup that can be obtained on modest to moderately large parallel machines. Maximum parallelism can be obtained only through a minimization of the run-time overhead of the method, which in turn depends on its level of integration within a classic restructuring compiler and on its adaptation to characteristics of the parallelized application. We present several compiler and run-time techniques designed specifically for optimizing the run-time parallelization of sparse applications. We show how we minimize the run-time overhead associated with the speculative parallelization of sparse applications by using static control flow information to reduce the number of memory references that have to be collected at run-time. We then present heuristics to speculate on the type and data structures used by the program and thus reduce the memory requirements needed for tracing the sparse access patterns. We present an implementation in the Polaris infrastructure and experimental results.",
isbn="978-3-540-46423-5"
}


Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its practical success depends on the fraction of ideal speedup that can be obtained on modest to moderately large parallel machines. Maximum parallelism can be obtained only through a minimization of the run-time overhead of the method, which in turn depends on its level of integration within a classic restructuring compiler and on its adaptation to characteristics of the parallelized application. We present several compiler and run-time techniques designed specifically for optimizing the run-time parallelization of sparse applications. We show how we minimize the run-time overhead associated with the speculative parallelization of sparse applications by using static control flow information to reduce the number of memory references that have to be collected at run-time. We then present heuristics to speculate on the type and data structures used by the program and thus reduce the memory requirements needed for tracing the sparse access patterns. We present an implementation in the Polaris infrastructure and experimental results.


Run-time Parallelization Optimization Techniques, Hao Yu, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 1863, La Jolla, California, USA, Aug 1999. DOI: 10.1007/3-540-44905-1_36
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-44905-1_36,
author="Yu, Hao
and Rauchwerger, Lawrence",
editor="Carter, Larry
and Ferrante, Jeanne",
title="Run-Time Parallelization Optimization Techniques",
booktitle="Languages and Compilers for Parallel Computing",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="481--484",
abstract="In this paper we first present several compiler techniques to reduce the overhead of run-time parallelization. We show how to use static control flow information to reduce the number of memory references that need to be traced at run-time. Then we introduce several methods designed specifically for the parallelization of sparse applications. We detail some heuristics on how to speculate on the type and data structures used by the original code and thus reduce the memory requirements for tracing the sparse access patterns without performing any additional work. Optimization techniques for the sparse reduction parallelization and speculative loop distribution conclude the paper.",
isbn="978-3-540-44905-8"
}


Abstract

In this paper we first present several compiler techniques to reduce the overhead of run-time parallelization. We show how to use static control flow information to reduce the number of memory references that need to be traced at run-time. Then we introduce several methods designed specifically for the parallelization of sparse applications. We detail some heuristics on how to speculate on the type and data structures used by the original code and thus reduce the memory requirements for tracing the sparse access patterns without performing any additional work. Optimization techniques for the sparse reduction parallelization and speculative loop distribution conclude the paper.


The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger, David Padua, IEEE Transactions on Parallel and Distributed Systems, Vol: 10, Issue: 2, pp. 160-180, Feb 1999. DOI: 10.1109/71.752782
Keywords: Data Dependence, Parallel Processing, Parallelising Compilers
Links : [Published]

BibTex

@ARTICLE{752782, author={L. {Rauchwerger} and D. A. {Padua}}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization}, year={1999}, volume={10}, number={2}, pages={160-180}, doi={10.1109/71.752782}}


Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks, which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods.


Hardware for Speculative Parallelization of Partially-Parallel Loops in DSM Multiprocessors, Ye Zhang, Lawrence Rauchwerger, Josep Torrellas, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp. 135-139, Orlando, FL, USA, USA, Jan 1999. DOI: 10.1109/HPCA.1999.744351
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{744351, author={ {Ye Zhang} and L. {Rauchwerger} and J. {Torrellas}}, booktitle={Proceedings Fifth International Symposium on High-Performance Computer Architecture}, title={Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors}, year={1999}, volume={}, number={}, pages={135-139}, doi={10.1109/HPCA.1999.744351}}


Abstract

Recently, we introduced a novel framework for speculative parallelization in hardware (Y. Zhang et al., 1998). The scheme is based on a software based run time parallelization scheme that we proposed earlier (L. Rauchwerger and D. Padue, 1995). The idea is to execute the code (loops) speculatively in parallel. As parallel execution proceeds, extra hardware added to the directory based cache coherence of the DSM machine detects if there is a dependence violation. If such a violation occurs, execution is interrupted, the state is rolled back in software to the most recent safe state, and the code is re-executed serially from that point. The safe state is typically established at the beginning of the loop. Such a scheme is somewhat related to speculative parallelization inside a multiprocessor chip, which also relies on extending the cache coherence protocol to detect dependence violations. Our scheme, however, is targeted to large scale DSM parallelism. In addition, it does not have some of the limitations of the proposed chip-multiprocessor schemes. Such limitations include the need to bound the size of the speculative state to fit in a buffer or L1 cache, and a strict in-order task commit policy that may result in load imbalance among processors. Unfortunately, our scheme has higher recovery costs if a dependence violation is detected, because execution has to backtrack to a safe state that is usually the beginning of the loop. Therefore, the aim of the paper is to extend our previous hardware scheme to effectively handle codes (loops) with a modest number of cross-iteration dependences.


Principles of Speculative Run-time Parallelization, Devang Patel, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 1656, Chapel Hill, NC, USA, Aug 1998. DOI: 10.1007/3-540-48319-5_21
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-48319-5_21,
author="Patel, Devang
and Rauchwerger, Lawrence",
editor="Chatterjee, Siddhartha
and Prins, Jan F.
and Carter, Larry
and Ferrante, Jeanne
and Li, Zhiyuan
and Sehr, David
and Yew, Pen-Chung",
title="Principles of Speculative Run---Time Parallelization",
booktitle="Languages and Compilers for Parallel Computing",
year="1999",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="323--337",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re-executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the PERFECT, SPEC and NCSA Benchmark suites show that these techniques yield speedups not obtainable by any other known method.",
isbn="978-3-540-48319-9"
}


Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re-executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the PERFECT, SPEC and NCSA Benchmark suites show that these techniques yield speedups not obtainable by any other known method.


Standard Templates Adaptive Parallel Library (STAPL), Lawrence Rauchwerger, Francisco Arzu, K Ouchi, Languages, Compilers, and Run-Time Systems for Scalable Computers, pp. 402-409, Pittsburgh, PA, USA, May 1998. DOI: 10.1007/3-540-49530-4_32
Keywords: High-Performance Computing, Parallel Programming, STAPL
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-49530-4_32,
author=\"Rauchwerger, Lawrence
and Arzu, Francisco
and Ouchi, Koji\",
editor=\"O\'Hallaron, David R.\",
title=\"Standard Templates Adaptive Parallel Library (STAPL)\",
booktitle=\"Languages, Compilers, and Run-Time Systems for Scalable Computers\",
year=\"1998\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"402--409\",
abstract=\"STAPL (Standard Adaptive Parallel Library) is a parallel C++ library designed as a superset of the STL, sequentially consistent for functions with the same name, and executes on uni- or multiprocessors. STAPL is implemented using simple parallel extensions of C++ which provide a SPMD model of parallelism supporting recursive parallelism. The library is intended to be of generic use but emphasizes irregular, non-numeric programs to allow the exploitation of parallelism in areas such as geometric modeling or graph algorithms which use dynamic linked data structures. Each library routine has several dicfferent algorithmic options, and the choice among them will be made adaptively based on a performance model, statistical feedback, and current run-time conditions. Builtc-in performance monitors can measure actual performance and, using an extension of the BSP model predict the relative performance of the algorithmic choices for each library routine. STAPL is intended to possibly replace STL in a user transparent manner and run on small to medium scale shared memory multiprocessors which support OpenMP.\",
isbn=\"978-3-540-49530-7\"
}


Abstract

STAPL (Standard Adaptive Parallel Library) is a parallel C++ library designed as a superset of the STL, sequentially consistent for functions with the same name, and executes on uni- or multiprocessors. STAPL is implemented using simple parallel extensions of C++ which provide a SPMD model of parallelism supporting recursive parallelism. The library is intended to be of generic use but emphasizes irregular, non-numeric programs to allow the exploitation of parallelism in areas such as geometric modeling or graph algorithms which use dynamic linked data structures. Each library routine has several different algorithmic options, and the choice among them will be made adaptively based on a performance model, statistical feedback, and current run-time conditions. Built-in performance monitors can measure actual performance and, using an extension of the BSP model predict the relative performance of the algorithmic choices for each library routine. STAPL is intended to possibly replace STL in a user transparent manner and run on small to medium scale shared memory multiprocessors which support OpenMP.


Restructuring programs for high-speed computers with Polaris, Blume and Eigenmann and Faigin and Grout and Jaejin Lee and Lawrence and Hoeflinger and Padua and Yunheung Paek and Petersen and Pottenger and Rauchwerger and Peng Tu and Weatherford, In Proc. ICPP Workshop on Challenges for Parallel Processing, Ithaca, NY, USA, Aug 1996. DOI: 10.1109/ICPPW.1996.538601
Keywords: Optimizing Compilers, Parallel Programming
Links : [Published]

BibTex

@INPROCEEDINGS {538601,
author = {Blume and Eigenmann and Faigin and Grout and Jaejin Lee and Lawrence and Hoeflinger and Padua and Yunheung Paek and Petersen and Pottenger and Rauchwerger and Peng Tu and Weatherford},
booktitle = {Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing},
title = {Restructuring programs for high-speed computers with Polaris},
year = {1996},
volume = {},
issn = {1530-2016},
pages = {149-161},
keywords = {program compilers;parallel programming},
doi = {10.1109/ICPPW.1996.538601},
url = {https://doi.ieeecomputersociety.org/10.1109/ICPPW.1996.538601},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {aug}
}


A Scalable Method for Run-Time Loop Parallelization, Lawrence Rauchwerger, Nancy M. Amato, David A. Padua, International Journal of Parallel Programming, Vol: 23, Issue: 6, pp. 537-576, Dec 1995. DOI: 10.1007/BF02577866
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@article{cite-key,
Abstract = {Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and},
Author = {Rauchwerger, Lawrence and Amato, Nancy M. and Padua, David A.},
Da = {1995/12/01},
Date-Added = {2020-11-02 22:20:42 -0600},
Date-Modified = {2020-11-02 22:20:42 -0600},
Doi = {10.1007/BF02577866},
Id = {Rauchwerger1995},
Isbn = {1573-7640},
Journal = {International Journal of Parallel Programming},
Number = {6},
Pages = {537--576},
Title = {A scalable method for run-time loop parallelization},
Ty = {JOUR},
Url = {https://doi.org/10.1007/BF02577866},
Volume = {23},
Year = {1995},
Bdsk-Url-1 = {https://doi.org/10.1007/BF02577866}}


Abstract

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that perform as run-time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization and reduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and


Run-Time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy M. Amato, David A. Padua, Proceedings of the 9th International Conference on Supercomputing, pp. 137-146, Barcelona, Spain, Jul 1995. DOI: 10.1145/224538.224553
Keywords: Data Dependence, Parallel Programming, Parallelising Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/224538.224553,
author = {Rauchwerger, Lawrence and Amato, Nancy M. and Padua, David A.},
title = {Run-Time Methods for Parallelizing Partially Parallel Loops},
year = {1995},
isbn = {0897917286},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/224538.224553},
doi = {10.1145/224538.224553},
booktitle = {Proceedings of the 9th International Conference on Supercomputing},
pages = {137–146},
numpages = {10},
location = {Barcelona, Spain},
series = {ICS '95}
}


Abstract

In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run-time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelizatiort (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop.


Parallelizing While Loops for Multiprocessor Systems, Lawrence Rauchwerger, David Padua, Proceedings of 9th International Parallel Processing Symposium (IPPS), pp. 347-356, Santa Barbara, CA, USA, Apr 1995. DOI: 10.1109/IPPS.1995.395955
Keywords: Data Dependence, Parallelising Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{395955, author={L. {Rauchwerger} and D. {Padua}}, booktitle={Proceedings of 9th International Parallel Processing Symposium}, title={Parallelizing while loops for multiprocessor systems}, year={1995}, volume={}, number={}, pages={347-356}, doi={10.1109/IPPS.1995.395955}}


Abstract

Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Because these types of loops arise frequently in practice, we have developed techniques that can automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals-something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compile-time. Experimental results on loops from the PERFECT Benchmarks and sparse matrix packages show that these techniques can yield significant speedups.


The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization, Lawrence Rauchwerger, David Padua, Proceedings of the 8th International Conference on Supercomputing (ICS), pp. 33-43, Manchester, England, Jul 1994. DOI: 10.1145/181181.181254
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/181181.181254,
author = {Rauchwerger, Lawrence and Padua, David},
title = {The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization},
year = {1994},
isbn = {0897916654},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/181181.181254},
doi = {10.1145/181181.181254},
abstract = {Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL test—a technique for identifying fully parallel loops at run-time, and dynamically privatizing scalars and arrays. The test itself is fully parallel, and can be applied to any loop, regardless of the structure of its data and/or control flow. The technique can be utilized in two modes: (i) the test is performed before executing the loop and indicates whether the loop can be executed as a DOALL; (ii) speculatively—the loop and the test are executed simultaneously, and it is determined later if the loop was in fact parallel. The test can also be used for debugging parallel programs. We discuss how the test can be inserted automatically by the compiler and outline a cost/performance analysis that can be performed to decide when to use the test. Our conclusion is that the test should almost always be applied—because, as we show, the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop), is minimal. We present some experimental results on loops from the PERFECT Benchmarks which confirm our conclusion that this test can lead to significant speedups.},
booktitle = {Proceedings of the 8th International Conference on Supercomputing},
pages = {33–43},
numpages = {11},
location = {Manchester, England},
series = {ICS '94}
}


Abstract

Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL test—a technique for identifying fully parallel loops at run-time, and dynamically privatizing scalars and arrays. The test itself is fully parallel, and can be applied to any loop, regardless of the structure of its data and/or control flow. The technique can be utilized in two modes: (i) the test is performed before executing the loop and indicates whether the loop can be executed as a DOALL; (ii) speculatively—the loop and the test are executed simultaneously, and it is determined later if the loop was in fact parallel. The test can also be used for debugging parallel programs. We discuss how the test can be inserted automatically by the compiler and outline a cost/performance analysis that can be performed to decide when to use the test. Our conclusion is that the test should almost always be applied—because, as we show, the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop), is minimal. We present some experimental results on loops from the PERFECT Benchmarks which confirm our conclusion that this test can lead to significant speedups.