CUDA Papers

A collection of research papers and projects utilizing CUDA technology

Category Archives: GPU Optimization

Efficient Sparse Matrix-Vector Multiplication on CUDA Abstract The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) […]

Efficient Parallel Scan Algorithms for GPUs Abstract Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flatten- ing transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan […]

Program Optimization Strategies for Data-Parallel Many-Core Processors Abstract Program optimization for highly parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly data-parallel applications for these platforms while lacking the substantial experience and knowledge needed to maximize application performance. In addition, hand-optimization even […]

Program Optimization Carving for GPU Computing Abstract Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique […]

CUDA-lite: Reducing GPU Programming Complexity Abstract The computer industry has transitioned into multi-core andmany-core parallel systems. The CUDA programming environment fromNVIDIA is an attempt to make programming many-core GPUs moreaccessible to programmers. However, there are still many burdens placedupon the programmer to maximize performance when using CUDA. Onesuch burden is dealing with the complex memory hierarchy. Efficient andcorrect usage […]

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs Abstract As growing power dissipation and thermal effectsdisrupted the rising clock frequency trend and threatened toannul Moore’s law, the computing industry has switched its routeto higher performance through parallel processing. The rise ofmulti-core systems in all domains of computing has opened thedoor to heterogeneous multi-processors, where processors ofdifferent compute characteristics can be combined to […]

Data Layout Transformation for Structured-Grid Codes on GPU Abstract We present data layout transformation as an effectiveperformance optimization for memory-bound structuredgridapplications for GPUs. Structured grid applications are aclass of applications that compute grid cell values on a regular2D, 3D or higher dimensional regular grid. Each output pointis computed as a function of itself and its nearest neighbors.Stencil code is an instance of […]

An Adaptive Performance Modeling Tool for GPU Architectures Abstract This paper presents an analytical model to predict the performanceof general-purpose applications on a GPU architecture. The modelis designed to provide performance information to an auto-tuningcompiler and assist it in narrowing down the search to the morepromising implementations. It can also be incorporated into a toolto help programmers better assess the performance bottlenecks […]

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architecture Abstract Reduction is a common component of many applications,but can often be the limiting factor for parallelization.Previous reduction work has focused on detecting reductionidioms and parallelizing the reduction operationby minimizing data communications or exploiting moredata locality. While these techniques can be useful, theyare mostly limited to simple code structures. In this paper,we propose a […]