CUDA Papers

A collection of research papers and projects utilizing CUDA technology

Rapid Multipole Graph Drawing on the GPU Abstract As graphics processors become powerful, ubiquitous and easier to program, they have also become more amenable to general purpose high-performance computing, including the computationally expensive task of drawing large graphs. This paper describes a new parallel analysis of the multipole method of graph drawing to support its efficient GPU implementation. We use a […]

Efficient Sparse Matrix-Vector Multiplication on CUDA Abstract The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) […]

Efficient Parallel Scan Algorithms for GPUs Abstract Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flatten- ing transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan […]

Fast BVH construction on GPUs Abstract We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized […]

Program Optimization Strategies for Data-Parallel Many-Core Processors Abstract Program optimization for highly parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly data-parallel applications for these platforms while lacking the substantial experience and knowledge needed to maximize application performance. In addition, hand-optimization even […]

GPU Acceleration of Cutoff Pair Potential for Molecular Modeling Applications Abstract The advent of systems biology requires the simulation of everlargerbiomolecular systems, demanding a commensurate growth incomputational power. This paper examines the use of the NVIDIATesla C870 graphics card programmed through the CUDA toolkitto accelerate the calculation of cutoff pair potentials, one of themost prevalent computations required by many different molecularmodeling applications. We present […]

Accelerating Advanced MRI Reconstructions on GPUs Abstract Computational acceleration on graphics processing units(GPUs) can make advanced magnetic resonance imaging(MRI) reconstruction algorithms attractive in clinical settings,thereby improving the quality of MR images across abroad spectrum of applications. At present, MR imaging isoften limited by high noise levels, signi cant imaging artifacts,and/or long data acquisition (scan) times. Advancedimage reconstruction algorithms can mitigate these […]

Program Optimization Carving for GPU Computing Abstract Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique […]

CUDA-lite: Reducing GPU Programming Complexity Abstract The computer industry has transitioned into multi-core andmany-core parallel systems. The CUDA programming environment fromNVIDIA is an attempt to make programming many-core GPUs moreaccessible to programmers. However, there are still many burdens placedupon the programmer to maximize performance when using CUDA. Onesuch burden is dealing with the complex memory hierarchy. Efficient andcorrect usage […]

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs Abstract As growing power dissipation and thermal effectsdisrupted the rising clock frequency trend and threatened toannul Moore’s law, the computing industry has switched its routeto higher performance through parallel processing. The rise ofmulti-core systems in all domains of computing has opened thedoor to heterogeneous multi-processors, where processors ofdifferent compute characteristics can be combined to […]