Compiler Optimization

Current Contributors: Lawrence Rauchwerger
Project Alumni: Marinus Pennings, Silvius Rus, Hao Yu, Francis Dang, Julio Antonio Carvallo De Ochoa, Devang Patel

Motivation

Current parallelizing compilers can do a reasonable job of analyzing programs with regular, static access patterns. However, there are several problems which make analysis impossible or impractical: dynamic applications depend on input/computed data; complexity of symbolic calculus is often too large; system resources or configuration change dynamically.

Since irregular programs represent a large and important fraction of all applications, an automatable framework for run-time optimization is needed to complement existing and future static compiler techniques.

Related Publications

Sensitivity Analysis for Migrating Programs to Multi-Cores, Silvius Rus, Marinus Pennings, Lawrence Rauchwerger, Technical Report, TR06-015, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Dec 2006.
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr06-015-2006,
title = "Sensitivity Analysis for Migrating Programs to Multi-Cores",
author = "Rus, Silvius Rus and Pennings, Marinus and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX, USA",
number = "TR06-015",
year = 2006,
month = dec
}

Abstract

Recently, a technique known as Hybrid Analysis has demonstrated the possibility of automatic coarse grain program parallelization through an integrated combination of static and dynamic analysis techniques. The recent introduction of multi-cores for the mass market has on the one hand exacerbated the need for such a technology and, on the other hand, changed the cost/benefit ratio of paralleliza- tion. Multi-Cores have low communication costs but the degree of parallelism is relatively small (ÃÂÃÂ¡100 processors for the next 5 years). Therefore hybrid parallelization techniques (static/dynamic analysis) need to keep their dynamic overhead very low in order to benefit multi-core systems. The Sensitivity Analysis (SA), presented in this paper is a novel technique that can reduce the dynamic overhead of previous Hybrid Analysis technology. Instead of dynamically evaluating the ag- gregated memory reference sets representing the potential data dependences, SA can frequently extract light weight, sufficient conditions for which these dependence sets are empty. The cost of the run-time evaluation of these conditions, named predicates, is low enough to make parallelization profitable on multi-cores. We have implemented the SA technology in the Polaris compiler and evaluated its performance on 22 benchmark SPEC and PERFECT codes. We have obtained speedups superior to the Intel Ifort compiler in most cases because we could extract most of the available coarse grained parallelism.

Region Array SSA, Silvius Rus, Guobin He, Lawrence Rauchwerger, Technical Report, TR06-007, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, May 2006.
Keywords: Data Flow, Optimizing Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr06-007-2006,
title = "Region Array SSA",
author = "Rus, Silvius and He, Guobin and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR06-007",
year = 2006,
month = may
}

Abstract

Static Single Assignment (SSA) has become the intermediate program representation of choice in most modern compilers because it enables efficient data flow analysis of scalars and thus leads to better scalar optimizations. Unfortunately not much progress has been achieved in applying the same techniques to array data flow analysis, a very important and potentially powerful technology. In this paper we propose to improve the applicability of previous efforts in array SSA through the use of a symbolic memory access descriptor that can aggregate the accesses to the elements of an array over large, interprocedural program contexts. We then show the power of our new representation by using it to implement a basic data flow algorithm, reaching definitions. Finally we apply this analysis to array constant propagation and array privatization and show performance improvement (speedups) for benchmark codes.

Hybrid Dependence Analysis for Automatic Parallelization, Silvius Rus, Lawrence Rauchwerger, Technical Report, TR05-013, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Nov 2005.
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Manuscript]

BibTex

@techreport{srus-tr05-013-2005,
title = "Hybrid Dependence Analysis for Automatic Parallelization",
author = "Rus, Silvius and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR05-013",
year = 2005,
month = nov
}

Abstract

Automatic program parallelization has been an elusive goal for many years. It has recently become more important due to the widespread introduction of multi-cores in PCs. Automatic parallelization could not be achieved because classic compiler analysis was neither powerful enough and program behavior was found to be in many cases input dependent. Run-time thread level parallelization, introduced in 1995, was a welcome but somewhat different avenue for advancing parallelization coverage. In this paper we introduce a novel analysis, Hybrid Analysis (HA), which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and generates minimum run-time overhead. In this paper we will present how we can extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods and generate sufficient conditions which, when evaluated dynamically can validate optimizations. A large number of experiments confirm the viability of our techniques, which have been implemented in the Polaris compiler.

A Framework for Adaptive Algorithm Selection in STAPL, Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, Lawrence Rauchwerger, Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 277-288, Chicago, IL, USA, Jun 2005. DOI: 10.1145/1065944.1065981
Keywords: Parallel Algorithms, Parallel Programming, STAPL
Links : [Published]

BibTex

@inproceedings{10.1145/1065944.1065981,
author = {Thomas, Nathan and Tanase, Gabriel and Tkachyshyn, Olga and Perdue, Jack and Amato, Nancy M. and Rauchwerger, Lawrence},
title = {A Framework for Adaptive Algorithm Selection in STAPL},
year = {2005},
isbn = {1595930809},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1065944.1065981},
doi = {10.1145/1065944.1065981},
abstract = {Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system.},
booktitle = {Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {277–288},
numpages = {12},
keywords = {machine learning, parallel algorithms, matrix multiplication, sorting, adaptive algorithms},
location = {Chicago, IL, USA},
series = {PPoPP \'05}
}

Abstract

Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system.

The Value Evolution Graph and its Use in Memory Reference Analysis, Silvius Rus, Dongmin Zhang, Lawrence Rauchwerger, 13th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 243-254, Antibes, Juan-les-Pins, France, Sep 2004. DOI: 10.1109/PACT.2004.1342558
Keywords: Data Flow, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{1342558, author={S. {Rus} and D. {Zhang} and L. {Rauchwerger}}, booktitle={Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004.}, title={The value evolution graph and its use in memory reference analysis}, year={2004}, volume={}, number={}, pages={243-254}, doi={10.1109/PACT.2004.1342558}}

Abstract

We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the value evolution graph (VEG), which models the global flow of values taken by induction variable with and without closed forms. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays addressed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.

Automatic Parallelization Using the Value Evolution Graph, Silvius Rus, Dongmin Zhang, Lawrence Rauchwerger, Languages and Compilers for High Performance Computing (LCPC), pp. 379-393, West Lafayette, Indiana, USA, Sep 2004. DOI: 10.1007/11532378_27
Keywords: Data Flow, Optimizing Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/11532378_27,
author=\"Rus, Silvius
and Zhang, Dongmin
and Rauchwerger, Lawrence\",
editor=\"Eigenmann, Rudolf
and Li, Zhiyuan
and Midkiff, Samuel P.\",
title=\"Automatic Parallelization Using the Value Evolution Graph\",
booktitle=\"Languages and Compilers for High Performance Computing\",
year=\"2005\",
publisher=\"Springer Berlin Heidelberg\",
address=\"Berlin, Heidelberg\",
pages=\"379--393\",
abstract=\"We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the Value Evolution Graph (VEG), which models the global flow of scalar and array values within a program. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays indexed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.\",
isbn=\"978-3-540-31813-2\"
}

Abstract

We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the Value Evolution Graph (VEG), which models the global flow of scalar and array values within a program. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays indexed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.

An Adaptive Algorithm Selection Framework, Hao Yu, Dongmin Zhang, Lawrence Rauchwerger, 13th International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 278-289, Antibes, Juan-les-Pins, France, Sep 2004. DOI: 10.1109/PACT.2004.1342561
Keywords: Adaptive Algorithm, Machine Learning
Links : [Published]

BibTex

@INPROCEEDINGS{1342561, author={H. {Yu} and D. {Zhang} and L. {Rauchwerger}}, booktitle={Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004.}, title={An adaptive algorithm selection framework}, year={2004}, volume={}, number={}, pages={278-289}, doi={10.1109/PACT.2004.1342561}}

Abstract

Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallel algorithms in particular. We present an adaptive algorithm selection framework which can collect and interpret the inputs of a particular instance of a parallel algorithm and select the best performing one from an existing library. We present the dynamic selection of parallel reduction algorithms. First we introduce a set of high-level parameters that can characterize different parallel reduction algorithms. Then we describe an offline, systematic process to generate predictive models, which can be used for run-time algorithm selection. Our experiments show that our framework: (a) selects the most appropriate algorithms in 85% of the cases studied, (b) overall delivers 98% of the optimal performance, (c) adaptively selects the best algorithms for dynamic phases of a running program (resulting in performance improvements otherwise not possible), and (d) adapts to the underlying machine architecture (tested on IBM Regatta and HP V-Class systems).

An Adaptive Algorithm Selection Framework, Hao Yu, Dongmin Zhang, Francis Dang, Lawrence Rauchwerger, Technical Report, TR04-002, Parasol Laboratory, Department of Computer Science, Texas A&M University, College Station, TX, USA, Mar 2004.
Keywords: Adaptive Algorithm, Machine Learning
Links : [Manuscript]

BibTex

@techreport{hyu-tr04-002-2004,
title = "An Adaptive Algorithm Selection Framework",
author = " Yu, Hao and Zhang, Dongmin and Dang, Francis and Rauchwerger, Lawrence",
institution = "Parasol Laboratory, Department of Computer Science, Texas A&M University",
address = "College Station, TX",
number = "TR04-002",
year = 2004,
month = mar
}

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, Francis Dang, Hao Yu, Lawrence Rauchwerger, Proceedings 16th International Parallel and Distributed Processing Symposium (IPDPS), pp. 10-, Ft. Lauderdale, FL, USA, Apr 2002. DOI: 10.1109/IPDPS.2002.1015493
Keywords: Parallel Processing, Parallel Programming, Parallelising Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{1015493, author={F. {Dang} and {Hao Yu} and L. {Rauchwerger}}, booktitle={Proceedings 16th International Parallel and Distributed Processing Symposium}, title={The R-LRPD test: speculative parallelization of partially parallel loops}, year={2002}, volume={}, number={}, pages={10 pp-}, doi={10.1109/IPDPS.2002.1015493}}

Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. In our previously proposed framework we have speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. We now propose a generalization of our speculative doall parallelization technique, called the Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the runtime dependence test itself. We present the base algorithm and an analysis of the different heuristics for its practical application and a few experimental results on loops from Track, Spice, and FMA3D.

Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors, Maria Jesus Garzaran, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, Josep Torrellas, Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 243-254, Barcelona, Spain, Spain, Sep 2001. DOI: 10.1109/PACT.2001.953304
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{953304, author={M. J. {Garzaran} and M. {Prvulovic} and {Ye Zhang} and A. {Jula} and {Hao Yu} and L. {Rauchwerger} and J. {Torrellas}}, booktitle={Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques}, title={Architectural support for parallel reductions in scalable shared-memory multiprocessors}, year={2001}, volume={}, number={}, pages={243-254}, doi={10.1109/PACT.2001.953304}}

Abstract

Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6.

Speculative Parallelization of Partially Parallel Loops, Francis H. Dang, Lawrence Rauchwerger, Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR.) Lecture Notes in Computer Science, Vol: 1915, pp. 285-299, Rochester, New York, USA, May 2000. DOI: 10.1007/3-540-40889-4_22
Keywords: Optimizing Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-40889-4_22,
author="Dang, Francis H.
and Rauchwerger, Lawrence",
editor="Dwarkadas, Sandhya",
title="Speculative Parallelization of Partially Parallel Loops",
booktitle="Languages, Compilers, and Run-Time Systems for Scalable Computers",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="285--299",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently de- fined access patterns. We have previously proposed a framework for their identifi- cation. We speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method ex- ploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. In this paper we propose a generalization of our speculative doall parallelization technique, named Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the run-time dependence test itself, i.e., removes the time lost due to incorrect parallel execution. The asymptotic time-complexity is, for fully serial loops, equal to the sequential execution time. We present the base algorithm and an analysis of the different heuristics for its practical application. Some preliminary experimental results on loops from Track will show the performance of this new technique.",
isbn="978-3-540-40889-5"
}

Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently de- fined access patterns. We have previously proposed a framework for their identifi- cation. We speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method ex- ploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. In this paper we propose a generalization of our speculative doall parallelization technique, named Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the run-time dependence test itself, i.e., removes the time lost due to incorrect parallel execution. The asymptotic time-complexity is, for fully serial loops, equal to the sequential execution time. We present the base algorithm and an analysis of the different heuristics for its practical application. Some preliminary experimental results on loops from Track will show the performance of this new technique.

Techniques for Reducing the Overhead of Run-time Parallelization, Hao Yu, Lawrence Rauchwerger, Compiler Construction (CC.) Lecture Notes in Computer Science, Vol: 1781, pp. 232-248, Berlin, Germany, Mar 2000. DOI: 10.1007/3-540-46423-9_16
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-46423-9_16,
author="Yu, Hao
and Rauchwerger, Lawrence",
editor="Watt, David A.",
title="Techniques for Reducing the Overhead of Run-Time Parallelization",
booktitle="Compiler Construction",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="232--248",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its practical success depends on the fraction of ideal speedup that can be obtained on modest to moderately large parallel machines. Maximum parallelism can be obtained only through a minimization of the run-time overhead of the method, which in turn depends on its level of integration within a classic restructuring compiler and on its adaptation to characteristics of the parallelized application. We present several compiler and run-time techniques designed specifically for optimizing the run-time parallelization of sparse applications. We show how we minimize the run-time overhead associated with the speculative parallelization of sparse applications by using static control flow information to reduce the number of memory references that have to be collected at run-time. We then present heuristics to speculate on the type and data structures used by the program and thus reduce the memory requirements needed for tracing the sparse access patterns. We present an implementation in the Polaris infrastructure and experimental results.",
isbn="978-3-540-46423-5"
}

Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its practical success depends on the fraction of ideal speedup that can be obtained on modest to moderately large parallel machines. Maximum parallelism can be obtained only through a minimization of the run-time overhead of the method, which in turn depends on its level of integration within a classic restructuring compiler and on its adaptation to characteristics of the parallelized application. We present several compiler and run-time techniques designed specifically for optimizing the run-time parallelization of sparse applications. We show how we minimize the run-time overhead associated with the speculative parallelization of sparse applications by using static control flow information to reduce the number of memory references that have to be collected at run-time. We then present heuristics to speculate on the type and data structures used by the program and thus reduce the memory requirements needed for tracing the sparse access patterns. We present an implementation in the Polaris infrastructure and experimental results.

Run-time Parallelization Optimization Techniques, Hao Yu, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 1863, La Jolla, California, USA, Aug 1999. DOI: 10.1007/3-540-44905-1_36
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-44905-1_36,
author="Yu, Hao
and Rauchwerger, Lawrence",
editor="Carter, Larry
and Ferrante, Jeanne",
title="Run-Time Parallelization Optimization Techniques",
booktitle="Languages and Compilers for Parallel Computing",
year="2000",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="481--484",
abstract="In this paper we first present several compiler techniques to reduce the overhead of run-time parallelization. We show how to use static control flow information to reduce the number of memory references that need to be traced at run-time. Then we introduce several methods designed specifically for the parallelization of sparse applications. We detail some heuristics on how to speculate on the type and data structures used by the original code and thus reduce the memory requirements for tracing the sparse access patterns without performing any additional work. Optimization techniques for the sparse reduction parallelization and speculative loop distribution conclude the paper.",
isbn="978-3-540-44905-8"
}

Abstract

In this paper we first present several compiler techniques to reduce the overhead of run-time parallelization. We show how to use static control flow information to reduce the number of memory references that need to be traced at run-time. Then we introduce several methods designed specifically for the parallelization of sparse applications. We detail some heuristics on how to speculate on the type and data structures used by the original code and thus reduce the memory requirements for tracing the sparse access patterns without performing any additional work. Optimization techniques for the sparse reduction parallelization and speculative loop distribution conclude the paper.

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger, David Padua, IEEE Transactions on Parallel and Distributed Systems, Vol: 10, Issue: 2, pp. 160-180, Feb 1999. DOI: 10.1109/71.752782
Keywords: Data Dependence, Parallel Processing, Parallelising Compilers
Links : [Published]

BibTex

@ARTICLE{752782, author={L. {Rauchwerger} and D. A. {Padua}}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization}, year={1999}, volume={10}, number={2}, pages={160-180}, doi={10.1109/71.752782}}

Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks, which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods.

Hardware for Speculative Parallelization of Partially-Parallel Loops in DSM Multiprocessors, Ye Zhang, Lawrence Rauchwerger, Josep Torrellas, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp. 135-139, Orlando, FL, USA, USA, Jan 1999. DOI: 10.1109/HPCA.1999.744351
Keywords: Computer Architecture, High-Performance Computing, Optimizing Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{744351, author={ {Ye Zhang} and L. {Rauchwerger} and J. {Torrellas}}, booktitle={Proceedings Fifth International Symposium on High-Performance Computer Architecture}, title={Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors}, year={1999}, volume={}, number={}, pages={135-139}, doi={10.1109/HPCA.1999.744351}}

Abstract

Recently, we introduced a novel framework for speculative parallelization in hardware (Y. Zhang et al., 1998). The scheme is based on a software based run time parallelization scheme that we proposed earlier (L. Rauchwerger and D. Padue, 1995). The idea is to execute the code (loops) speculatively in parallel. As parallel execution proceeds, extra hardware added to the directory based cache coherence of the DSM machine detects if there is a dependence violation. If such a violation occurs, execution is interrupted, the state is rolled back in software to the most recent safe state, and the code is re-executed serially from that point. The safe state is typically established at the beginning of the loop. Such a scheme is somewhat related to speculative parallelization inside a multiprocessor chip, which also relies on extending the cache coherence protocol to detect dependence violations. Our scheme, however, is targeted to large scale DSM parallelism. In addition, it does not have some of the limitations of the proposed chip-multiprocessor schemes. Such limitations include the need to bound the size of the speculative state to fit in a buffer or L1 cache, and a strict in-order task commit policy that may result in load imbalance among processors. Unfortunately, our scheme has higher recovery costs if a dependence violation is detected, because execution has to backtrack to a safe state that is usually the beginning of the loop. Therefore, the aim of the paper is to extend our previous hardware scheme to effectively handle codes (loops) with a modest number of cross-iteration dependences.

Principles of Speculative Run-time Parallelization, Devang Patel, Lawrence Rauchwerger, Languages and Compilers for Parallel Computing (LCPC.) Lecture Notes in Computer Science, Vol: 1656, Chapel Hill, NC, USA, Aug 1998. DOI: 10.1007/3-540-48319-5_21
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@InProceedings{10.1007/3-540-48319-5_21,
author="Patel, Devang
and Rauchwerger, Lawrence",
editor="Chatterjee, Siddhartha
and Prins, Jan F.
and Carter, Larry
and Ferrante, Jeanne
and Li, Zhiyuan
and Sehr, David
and Yew, Pen-Chung",
title="Principles of Speculative Run---Time Parallelization",
booktitle="Languages and Compilers for Parallel Computing",
year="1999",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="323--337",
abstract="Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re-executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the PERFECT, SPEC and NCSA Benchmark suites show that these techniques yield speedups not obtainable by any other known method.",
isbn="978-3-540-48319-9"
}

Abstract

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re-executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the PERFECT, SPEC and NCSA Benchmark suites show that these techniques yield speedups not obtainable by any other known method.

Parallel programming with Polaris, W. Blume; R. Doallo; R. Eigenmann; J. Grout; J. Hoeflinger; T. Lawrence, IEEE Computer, Vol: 29, Issue: 12, pp. 78-82, Dec 1996. DOI: 10.1109/2.546612
Keywords: Parallel Programming, Parallelising Compilers
Links : [Published]

BibTex

@ARTICLE{546612,
author={W. {Blume} and R. {Doallo} and R. {Eigenmann} and J. {Grout} and J. {Hoeflinger} and T. {Lawrence}},
journal={Computer},
title={Parallel programming with Polaris},
year={1996},
volume={29},
number={12},
pages={78-82},
doi={10.1109/2.546612}}

A Scalable Method for Run-Time Loop Parallelization, Lawrence Rauchwerger, Nancy M. Amato, David A. Padua, International Journal of Parallel Programming, Vol: 23, Issue: 6, pp. 537-576, Dec 1995. DOI: 10.1007/BF02577866
Keywords: Parallelising Compilers
Links : [Published]

BibTex

@article{cite-key,
Abstract = {Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and},
Author = {Rauchwerger, Lawrence and Amato, Nancy M. and Padua, David A.},
Da = {1995/12/01},
Date-Added = {2020-11-02 22:20:42 -0600},
Date-Modified = {2020-11-02 22:20:42 -0600},
Doi = {10.1007/BF02577866},
Id = {Rauchwerger1995},
Isbn = {1573-7640},
Journal = {International Journal of Parallel Programming},
Number = {6},
Pages = {537--576},
Title = {A scalable method for run-time loop parallelization},
Ty = {JOUR},
Url = {https://doi.org/10.1007/BF02577866},
Volume = {23},
Year = {1995},
Bdsk-Url-1 = {https://doi.org/10.1007/BF02577866}}

Abstract

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that perform as run-time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization and reduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and

Run-Time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy M. Amato, David A. Padua, Proceedings of the 9th International Conference on Supercomputing, pp. 137-146, Barcelona, Spain, Jul 1995. DOI: 10.1145/224538.224553
Keywords: Data Dependence, Parallel Programming, Parallelising Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/224538.224553,
author = {Rauchwerger, Lawrence and Amato, Nancy M. and Padua, David A.},
title = {Run-Time Methods for Parallelizing Partially Parallel Loops},
year = {1995},
isbn = {0897917286},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/224538.224553},
doi = {10.1145/224538.224553},
booktitle = {Proceedings of the 9th International Conference on Supercomputing},
pages = {137–146},
numpages = {10},
location = {Barcelona, Spain},
series = {ICS '95}
}

Abstract

In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run-time preprocessing of the loopÃÂ¢ÃÂÃÂs access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelizatiort (elementÃÂ¢ÃÂÃÂwise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop.

Parallelizing While Loops for Multiprocessor Systems, Lawrence Rauchwerger, David Padua, Proceedings of 9th International Parallel Processing Symposium (IPPS), pp. 347-356, Santa Barbara, CA, USA, Apr 1995. DOI: 10.1109/IPPS.1995.395955
Keywords: Data Dependence, Parallelising Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{395955, author={L. {Rauchwerger} and D. {Padua}}, booktitle={Proceedings of 9th International Parallel Processing Symposium}, title={Parallelizing while loops for multiprocessor systems}, year={1995}, volume={}, number={}, pages={347-356}, doi={10.1109/IPPS.1995.395955}}

Abstract

Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Because these types of loops arise frequently in practice, we have developed techniques that can automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals-something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compile-time. Experimental results on loops from the PERFECT Benchmarks and sparse matrix packages show that these techniques can yield significant speedups.

The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization, Lawrence Rauchwerger, David Padua, Proceedings of the 8th International Conference on Supercomputing (ICS), pp. 33-43, Manchester, England, Jul 1994. DOI: 10.1145/181181.181254
Keywords: Data Dependence, Optimizing Compilers, Parallelising Compilers
Links : [Published]

BibTex

@inproceedings{10.1145/181181.181254,
author = {Rauchwerger, Lawrence and Padua, David},
title = {The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization},
year = {1994},
isbn = {0897916654},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/181181.181254},
doi = {10.1145/181181.181254},
abstract = {Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL test—a technique for identifying fully parallel loops at run-time, and dynamically privatizing scalars and arrays. The test itself is fully parallel, and can be applied to any loop, regardless of the structure of its data and/or control flow. The technique can be utilized in two modes: (i) the test is performed before executing the loop and indicates whether the loop can be executed as a DOALL; (ii) speculatively—the loop and the test are executed simultaneously, and it is determined later if the loop was in fact parallel. The test can also be used for debugging parallel programs. We discuss how the test can be inserted automatically by the compiler and outline a cost/performance analysis that can be performed to decide when to use the test. Our conclusion is that the test should almost always be applied—because, as we show, the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop), is minimal. We present some experimental results on loops from the PERFECT Benchmarks which confirm our conclusion that this test can lead to significant speedups.},
booktitle = {Proceedings of the 8th International Conference on Supercomputing},
pages = {33–43},
numpages = {11},
location = {Manchester, England},
series = {ICS '94}
}

Abstract

Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL testÃ¢ÂÂa technique for identifying fully parallel loops at run-time, and dynamically privatizing scalars and arrays. The test itself is fully parallel, and can be applied to any loop, regardless of the structure of its data and/or control flow. The technique can be utilized in two modes: (i) the test is performed before executing the loop and indicates whether the loop can be executed as a DOALL; (ii) speculativelyÃ¢ÂÂthe loop and the test are executed simultaneously, and it is determined later if the loop was in fact parallel. The test can also be used for debugging parallel programs. We discuss how the test can be inserted automatically by the compiler and outline a cost/performance analysis that can be performed to decide when to use the test. Our conclusion is that the test should almost always be appliedÃ¢ÂÂbecause, as we show, the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop), is minimal. We present some experimental results on loops from the PERFECT Benchmarks which confirm our conclusion that this test can lead to significant speedups.

Measuring limits of parallelism and characterizing its vulnerability to resource constraints, Lawrence Rauchwerger, Pradeep K. Dubey, Ravi Nair, Proceedings of the 26th Annual International Symposium on Microarchitecture, pp. 105-117, Austin, Texas, USA, Dec 1993. DOI: 10.1109/MICRO.1993.282747
Keywords: Parallel Processing, Parallelising Compilers
Links : [Published]

BibTex

@INPROCEEDINGS{282747,
author={Rauchwerger, Lawrence and Dubey, Pradeep K. and Nair, Ravi},
booktitle={Proceedings of the 26th Annual International Symposium on Microarchitecture},
title={Measuring limits of parallelism and characterizing its vulnerability to resource constraints},
year={1993},
volume={},
number={},
pages={105-117},
doi={10.1109/MICRO.1993.282747}}

Abstract

Introduces a technique to enhance the ability of dynamic ILP processors to exploit (speculatively executed) parallelism. Existing branch prediction mechanisms used to establish a dynamic window from which ILP can be extracted are limited in their abilities to: (i) create a large, accurate dynamic window, (ii) initiate a large number of instructions into this window in every cycle, and (iii) traverse multiple branches of the control flow graph per prediction. The authors introduce control flow prediction which uses information in the control flow graph of a program to overcome these limitations. They discuss how information present in the control flow graph can be represented using multiblocks, and conveyed to the hardware using Control Flow Tables and Control Flow Prediction Buffers. They evaluate the potential of control flow prediction on an abstract machine and on a dynamic ILP processing model. The results indicate that control flow prediction is a powerful and effective assist to the hardware in making more informed run time decisions about program control flow.