CUDA Papers

A collection of research papers and projects utilizing CUDA technology

Data Layout Transformation for Structured-Grid Codes on GPU

http://impact.crhc.illinois.edu/ftp/workshop/sung.pdf

Abstract

We present data layout transformation as an effectiveperformance optimization for memory-bound structuredgridapplications for GPUs. Structured grid applications are aclass of applications that compute grid cell values on a regular2D, 3D or higher dimensional regular grid. Each output pointis computed as a function of itself and its nearest neighbors.Stencil code is an instance of this application class. Examplesof structured grid applications include fluid dynamics and heatdistribution that solve partial differential equations with aniterative solver on a dense multidimensional array.Using the information available through variable-length arraysyntax, standardized in C99 and other modern languages, wehave enabled automatic data layout transformations for structuredgrid codes with dynamic array sizes. We first presenta formulation that enables automatic data layout transformationsfor structured grid code in CUDA. We then model theDRAM banking and interleaving scheme of the GTX280 GPUthrough microbenchmarking. We developed a layout transformationmethodology that guides layout transformations to staticallychoose a good layout given a model of the memory system.The transformation which distributes concurrent memory requestsevenly to DRAM channels and banks provides substantialspeedup for structured grid application by improving theirmemory-level parallelism.

Authors

I-Jui Sung, Wen-Mei Hwu, University of Illinois at Urbana-Champaign

An Adaptive Performance Modeling Tool for GPU Architectures

http://impact.crhc.illinois.edu/ftp/conference/sara.pdf

Abstract

This paper presents an analytical model to predict the performanceof general-purpose applications on a GPU architecture. The modelis designed to provide performance information to an auto-tuningcompiler and assist it in narrowing down the search to the morepromising implementations. It can also be incorporated into a toolto help programmers better assess the performance bottlenecks intheir code. We analyze each GPU kernel and identify how the kernelexercises major GPU microarchitecture features. To identifythe performance bottlenecks accurately, we introduce an abstractinterpretation of a GPU kernel, work flow graph, based on whichwe estimate the execution time of a GPU kernel. We validated ourperformance model on the NVIDIA GPUs using CUDA (ComputeUnified Device Architecture). For this purpose, we used data parallelbenchmarks that stress different GPU microarchitecture eventssuch as uncoalesced memory accesses, scratch-pad memory bankconflicts, and control flow divergence, which must be accuratelymodeled but represent challenges to the analytical performancemodels. The proposed model captures full system complexity andshows high accuracy in predicting the performance trends of differentoptimized kernel implementations. We also describe our approachto extracting the performance model automatically from akernel code.

Authors

Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, Wen-mei W. Hwu, University of Illinois at Urbana-Champaign

Accelerating Iterative Field-Compensated MR Image Reconstruction on GPUs

http://impact.crhc.illinois.edu/ftp/conference/isbi2010.pdf

Abstract

We propose a fast implementation for iterative MR image reconstruction using Graphics Processing Units (GPU). In MRI, iterative reconstruction with conjugate gradient algorithms allows for accurate modeling the physics of the imaging system. Specifically, methods have been reported to compensate for the magnetic field inhomogeneity induced by the susceptibility differences near the air/tissue interface in human brain (such as orbitofrontal cortex). Our group has previously presented an algorithm for field inhomogeneity compensation using magnetic field map and its gradients. However, classical iterative reconstruction algorithms are computationally costly, and thus significantly increase the computation time. To remedy this problem, one can utilize the fact that these iterative MR image reconstruction algorithms are highly parallelizable. Therefore, parallel computational hardware, such as GPU, can dramatically improve their performance. In this work, we present an implementation of our field inhomogeneity compensation technique using NVIDA CUDA(Compute Unified Device Architecture)-enabled GPU. We show that the proposed implementation significantly reduces the computation times around two orders of magnitude (compared with non-GPU implementation) while accurately compensating for field inhomogeneity.

Authors

Yue Zhuo, Department of Bioengineering, University of Illinois at Urbana-Champaign

Xiao-Long Wu, Justin P. Haldar, Wen-mei Hwu, Zhi-pei Liang, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign

Bradley P. Sutton, Department of Bioengineering, University of Illinois at Urbana-Champaign

Multi-GPU Implementation for Iterative MR Image Reconstruction with Field Correction

http://impact.crhc.illinois.edu/ftp/conference/ismrm2010.pdf

Abstract

Many advanced MRI image acquisition and reconstruction methods see limited application due to high computational cost in MRI. For instance,iterative reconstruction algorithms (e.g. non-Cartesian k-space trajectory, or magnetic field inhomogeneity compensation) can improve image qualitybut suffer from low reconstruction speed. General-purpose computing on graphics processing units (GPU) have demonstrated significantperformance speedups and cost reductions in science and engineering applications. In fact, GPU can offer significant speedup due to MRIparallelized-data structure, e.g. multi-shots, multi-coil, multi-slice, multi-time-point, etc. We propose an implementation of iterative MR imagereconstruction with magnetic field inhomogeneity compensation on multi-GPUs. The MR image model is based on non-Cartesian trajectory (i.e.spiral) in k-space, and can compensate for both geometric distortion and some signal loss induced by susceptibility gradients.

Authors

Y. Zhuo, Bioengineering, University of Illinois at Urbana-Champaign

X-L. Wu, J. P. Haldar, W-M. W. Hwu, Z-P. Liang, Electrical and Computer Engineering, University of Illinois atUrbana-Champaign

B. P. Sutton, Bioengineering, University of Illinois at Urbana-Champaign

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architecture

http://impact.crhc.illinois.edu/ftp/conference/fgc2010.pdf

Abstract

Reduction is a common component of many applications,but can often be the limiting factor for parallelization.Previous reduction work has focused on detecting reductionidioms and parallelizing the reduction operationby minimizing data communications or exploiting moredata locality. While these techniques can be useful, theyare mostly limited to simple code structures. In this paper,we propose a method for exploiting more parallelism byisolating the reduction from users of the intermediate results.The other main contribution of our work is enablingthe parallelization of more complex reduction codes, includingthose that involve the use of intermediate reductionresults. The proposed transformations are often implementedby programmers in an ad-hoc manner, but tothe best of our knowledge no previous work has been proposedto automate these transformations for many-corearchitectures. We show that the automatic transformationscan result in significant speedup compared to the originalcode using two benchmark applications.

Authors

Xiao-Long Wu, Nady Obeid, Wen-Mei Hwu, University of Illinois at Urbana-Champaign

Sparse regularization in MRI iterative reconstruction using GPUs

http://impact.crhc.illinois.edu/ftp/conference/xiaolong-2010.pdf

Abstract

Regularization is a common technique used toimprove image quality in inverse problems such as MR imagereconstruction. In this work, we extend our previous GraphicsProcessing Unit (GPU) implementation of MR imagereconstruction with compensation for susceptibility-induced fieldinhomogeneity effects by incorporating an additional quadraticregularization term. Regularization techniques commonly imposethe prior information that MR images are relatively smooth bypenalizing large changes in intensity between neighboring voxels.However, the associated computations often increase data accessand the overall computational load, which can lead to slowerimage reconstruction. This motivates us to adopt a GPU-enabledimplementation of spatial regularization using sparse matrices.This implementation enables the computations for the entirereconstruction procedure to be done on the GPU, which avoidsthe memory bandwidth bottlenecks associated with frequentcommunications between the GPU and CPU. Both the CPU andGPU code of this implementation will be available for release atthe time of the conference.

Authors

Yue Zhuo, Xiao-Long Wu, Justin Haldar, Wen-mei Hwu, Zhi-Pei Liang, Bradley P. Sutton, University of Illinois at Urbana-Champaign

Benchmarking GPUs to Tune Dense Linear Algebra

http://portal.acm.org/ft_gateway.cfm?id=1413402&type=pdf&doid2=1413370.1413402

http://www2.computer.org/portal/c/document_library/get_file?folderId=97697&name=DLFE-3337.pdf

Abstract

We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our GEMM routine runs 60% faster than the vendor implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~300 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit register blocking to optimize GEMM and heterogeneity of the system (compute both on GPU and CPU). This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.

Authors

Vasily Volkov, James W. Demmel, University of California, Berkeley

High Performance Discrete Fourier Transforms on Graphics Processors

http://portal.acm.org/ft_gateway.cfm?id=1413373&type=pdf&doid2=1413370.1413373

http://www2.computer.org/portal/c/document_library/get_file?folderId=97697&name=DLFE-3346.pdf

Abstract

We present novel algorithms for computing Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein’s algorithm. We use modular arithmetic in Bluestein’s algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA’s CUFFT library and an optimized CPU-implementation (Intel’s MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained up to 250 GFlops, a 3x performance improvement over CUFFT, and 5-25x improvement over MKL on large sizes.

Authors

Naga Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, John Manferdelli, Microsoft Corporation

Bandwidth Intensive 3-D FFT kernel for GPUs using CUDA

http://portal.acm.org/ft_gateway.cfm?id=1413376&type=pdf&doid2=1413370.1413376

http://www2.computer.org/portal/c/document_library/get_file?folderId=97697&name=DLFE-3317.pdf

Abstract

Most GPU performance “hypes” have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVidia CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.

Authors

Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka

Tokyo Institute of Technology

Designing Efficient Sorting Algorithms for Manycore GPUs

http://mgarland.org/files/papers/nvr-2008-001.pdf

Abstract

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and 2 times faster than other CUDA-based radix sorts. It is also highly competitive with even the most carefully optimized multi-core CPU sorting routines.

To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed on-chip shared memory provided by NVIDIA’s Tesla architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.

Authors

Nadathur Satish, University of California, Berkeley

Mark Harris, Michael Garland, NVIDIA Corporation