As graphics processors become powerful, ubiquitous and easier to program, they have also become more amenable to general purpose high-performance computing, including the computationally expensive task of drawing large graphs. This paper describes a new parallel analysis of the multipole method of graph drawing to support its efficient GPU implementation. We use a variation of the Fast Multipole Method to estimate the long distance repulsive forces in force directed layout. We support these multipole computations efficiently with a k-d tree constructed and traversed on the GPU. The algorithm achieves impressive speedup over previous CPU and GPU methods, drawing graphs with hundreds of thousands of vertices within a few seconds via CUDA on an NVIDIA GeForce 8800 GTX.

**Apeksha Godiyal, Jared Hoberock, **University of Illinois

**Michael Garland,** NVIDIA Corporation

**John C Hart,** University of Illinois

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

**Nathan Bell, Michael Garland,** NVIDIA Corporation

Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flatten- ing transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

**Shubhabrata Sengupta,** University of California, Davis

**Mark Harris, Michael Garland, **NVIDIA Corporation

Abstract

We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.

Authors

**C. Lauterbach,** University of North Carolina at Chapel Hill

**M. Garland,** NVIDIA Corporation

**S. Sengupta,** University of California Davis

**D. Luebke,** NVIDIA Corporation

**D. Manocha, **University of North Carolina at Chapel Hill

Program optimization for highly parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly data-parallel applications for these platforms while lacking the substantial experience and knowledge needed to maximize application performance. In addition, hand-optimization even by motivated and informed developers takes a significant amount of time and generally still underutilizes the performance of the hardware by double-digit percentages. This creates a need for structured and automatable optimization techniques that are capable of finding a near-optimal program configuration for this new class of architecture. My work discusses various strategies for optimizing programs on a highly dataparallel architecture with fine-grained sharing of resources. I first investigate useful strategies in optimizing a suite of applications. I then introduce program optimization carving, an approach that discovers high-performance application configurations for data-parallel, many-core architectures. Instead of applying a particular phase ordering of optimizations, it starts with an optimization space of major transformations and then reduces the space by examining the static code and pruning configurations that do not maximize desirable qualities in isolation or combination. Careful selection of pruning criteria for applications running on the NVIDIA GeForce 8800 GTX reduces the optimization space by as much as 98% while finding configurations within 1% of the best performance. Random sampling, in contrast, can require nearly five times as many configurations to find performance within 10% of the best. I also examine the technique’s effectiveness when varying pruning criteria.

The advent of systems biology requires the simulation of everlargerbiomolecular systems, demanding a commensurate growth incomputational power. This paper examines the use of the NVIDIATesla C870 graphics card programmed through the CUDA toolkitto accelerate the calculation of cutoff pair potentials, one of themost prevalent computations required by many different molecularmodeling applications. We present algorithms to calculate electrostaticpotential maps for cutoff pair potentials. Whereas a straightforwardapproach for decomposing atom data leads to low computeefciency, a newer strategy enables ne-grained spatial decompositionof atom data that maps efciently to the C870’s memorysystem while increasing work-efciency of atom data traversalby a factor of 5. The memory addressing exibility exposedthrough CUDA’s SPMD programming model is crucial in enablingthis new strategy. An implementation of the new algorithm providesa greater than threefold performance improvement over ourpreviously published implementation and runs 12 to 20 times fasterthan optimized CPU-only code. The lessons learned are generallyapplicable to algorithms accelerated by uniform grid spatial decomposition.

**Christopher I. Rodrigues, **Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

**David J. Hardy, John E. Stone,** Klaus Schulten, Beckman Institute University of Illinois at Urbana-Champaign

**Wen-Mei W. Hwu, **Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

Computational acceleration on graphics processing units(GPUs) can make advanced magnetic resonance imaging(MRI) reconstruction algorithms attractive in clinical settings,thereby improving the quality of MR images across abroad spectrum of applications. At present, MR imaging isoften limited by high noise levels, signicant imaging artifacts,and/or long data acquisition (scan) times. Advancedimage reconstruction algorithms can mitigate these limitationsand improve image quality by simultaneously operatingon scan data acquired with arbitrary trajectories and incorporatingadditional information such as anatomical constraints.However, the improvements in image quality comeat the expense of a considerable increase in computation.This paper describes the acceleration of an advanced reconstructionalgorithm on NVIDIA’s Quadro FX 5600. Optimizationssuch as register allocating the voxel data, tilingthe scan data, and storing the scan data in the Quadro’sconstant memory dramatically reduce the reconstruction’srequired bandwidth to o-chip memory. The Quadro’s specialfunctional units provide substantial acceleration of thetrigonometric computations in the algorithm’s inner loops,and experimentally-tuned code transformations increase thereconstruction’s performance by an additional 20%.

**Sam S. Stone, **Center for Reliable and High-Performance ComputingUniversity of Illinois at Urbana-Champaign

**Justin P. Haldar,** Department of Electrical andComputer EngineeringUniversity of Illinois atUrbana-Champaign

**Stephanie C. Tsao, Wen-mei W. Hwu,** Center for Reliable and High-Performance ComputingUniversity of Illinois at Urbana-Champaign

**Zhi-Pei Liang,** Department of Electrical andComputer EngineeringUniversity of Illinois atUrbana-Champaign

**Bradley P. Sutton,** Bioengineering DepartmentBiomedical Imaging Center,Beckman Institute forAdvanced Science andTechnologyUniversity of Illinois atUrbana-Champaign

Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.

**Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, John A. Stratton, Sain-Zee Ueng, Sara S. Baghsorkhi, Wen-mei W. Hwu,** Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign

The computer industry has transitioned into multi-core andmany-core parallel systems. The CUDA programming environment fromNVIDIA is an attempt to make programming many-core GPUs moreaccessible to programmers. However, there are still many burdens placedupon the programmer to maximize performance when using CUDA. Onesuch burden is dealing with the complex memory hierarchy. Efficient andcorrect usage of the various memories is essential, making a difference of2-17x in performance. Currently, the task of determining the appropriatememory to use and the coding of data transfer between memories is stillleft to the programmer.We believe that this task can be better performedby automated tools. We present CUDA-lite, an enhancement to CUDA,as one such tool. We leverage programmer knowledge via annotationsto perform transformations and show preliminary results that indicateauto-generated code can have performance comparable to hand coding.

**Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, Wen-mei W. Hwu,** Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign

As growing power dissipation and thermal effectsdisrupted the rising clock frequency trend and threatened toannul Moore’s law, the computing industry has switched its routeto higher performance through parallel processing. The rise ofmulti-core systems in all domains of computing has opened thedoor to heterogeneous multi-processors, where processors ofdifferent compute characteristics can be combined to effectivelyboost the performance per watt of different application kernels.GPUs and FPGAs are becoming very popular in PC-basedheterogeneous systems for speeding up compute intensive kernelsof scientific, imaging and simulation applications. GPUs canexecute hundreds of concurrent threads, while FPGAs providecustomized concurrency for highly parallel kernels. However,exploiting the parallelism available in these applications iscurrently not a push-button task. Often the programmer has toexpose the application’s fine and coarse grained parallelism byusing special APIs. CUDA is such a parallel-computing API thatis driven by the GPU industry and is gaining significantpopularity. In this work, we adapt the CUDA programmingmodel into a new FPGA design flow called FCUDA, whichefficiently maps the coarse and fine grained parallelism exposedin CUDA onto the reconfigurable fabric. Our CUDA-to-FPGAflow employs AutoPilot, an advanced high-level synthesis toolwhich enables high-abstraction FPGA programming. FCUDA isbased on a source-to-source compilation that transforms theSPMD CUDA thread blocks into parallel C code for AutoPilot.We describe the details of our CUDA-to-FPGA flow anddemonstrate the highly competitive performance of the resultingcustomized FPGA multi-core accelerators. To the best of ourknowledge, this is the first CUDA-to-FPGA flow to demonstratethe applicability and potential advantage of using the CUDAprogramming model for high-performance computing in FPGAs.

**Alexandros Papakonstantinou, **Electrical & Computer Eng. Dept., University of Illinois, Urbana-Champaign

**Karthik Gururaj,** Computer Science Dept., University of California, Los-Angeles

**John A. Stratton, Deming Chen,** Electrical & Computer Eng. Dept., University of Illinois, Urbana-Champaign

** Jason Cong,** Computer Science Dept., University of California, Los-Angeles

**Wen-Mei W. Hwu,** Electrical & Computer Eng. Dept., University of Illinois, Urbana-Champaign