CUDA Papers

A collection of research papers and projects utilizing CUDA technology

Data Layout Transformation for Structured-Grid Codes on GPU Abstract We present data layout transformation as an effectiveperformance optimization for memory-bound structuredgridapplications for GPUs. Structured grid applications are aclass of applications that compute grid cell values on a regular2D, 3D or higher dimensional regular grid. Each output pointis computed as a function of itself and its nearest neighbors.Stencil code is an instance of […]

An Adaptive Performance Modeling Tool for GPU Architectures Abstract This paper presents an analytical model to predict the performanceof general-purpose applications on a GPU architecture. The modelis designed to provide performance information to an auto-tuningcompiler and assist it in narrowing down the search to the morepromising implementations. It can also be incorporated into a toolto help programmers better assess the performance bottlenecks […]

Accelerating Iterative Field-Compensated MR Image Reconstruction on GPUs Abstract We propose a fast implementation for iterative MR image reconstruction using Graphics Processing Units (GPU). In MRI, iterative reconstruction with conjugate gradient algorithms allows for accurate modeling the physics of the imaging system. Specifically, methods have been reported to compensate for the magnetic field inhomogeneity induced by the susceptibility differences near the air/tissue […]

Multi-GPU Implementation for Iterative MR Image Reconstruction with Field Correction Abstract Many advanced MRI image acquisition and reconstruction methods see limited application due to high computational cost in MRI. For instance,iterative reconstruction algorithms (e.g. non-Cartesian k-space trajectory, or magnetic field inhomogeneity compensation) can improve image qualitybut suffer from low reconstruction speed. General-purpose computing on graphics processing units (GPU) have demonstrated significantperformance speedups and cost […]

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architecture Abstract Reduction is a common component of many applications,but can often be the limiting factor for parallelization.Previous reduction work has focused on detecting reductionidioms and parallelizing the reduction operationby minimizing data communications or exploiting moredata locality. While these techniques can be useful, theyare mostly limited to simple code structures. In this paper,we propose a […]

Sparse regularization in MRI iterative reconstruction using GPUs Abstract Regularization is a common technique used toimprove image quality in inverse problems such as MR imagereconstruction. In this work, we extend our previous GraphicsProcessing Unit (GPU) implementation of MR imagereconstruction with compensation for susceptibility-induced fieldinhomogeneity effects by incorporating an additional quadraticregularization term. Regularization techniques commonly imposethe prior information that MR images are relatively […]

Benchmarking GPUs to Tune Dense Linear Algebra Abstract We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our GEMM routine runs 60% faster than the vendor implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on two GPUs […]

High Performance Discrete Fourier Transforms on Graphics Processors Abstract We present novel algorithms for computing Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the […]

Bandwidth Intensive 3-D FFT kernel for GPUs using CUDA Abstract Most GPU performance “hypes” have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVidia CUDA, achieves nearly 80 GFLOPS on a top-end […]

Designing Efficient Sorting Algorithms for Manycore GPUs Abstract We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 […]