CUDA Papers

A collection of research papers and projects utilizing CUDA technology

Category Archives: Algorithms

Rapid Multipole Graph Drawing on the GPU Abstract As graphics processors become powerful, ubiquitous and easier to program, they have also become more amenable to general purpose high-performance computing, including the computationally expensive task of drawing large graphs. This paper describes a new parallel analysis of the multipole method of graph drawing to support its efficient GPU implementation. We use a […]

Efficient Sparse Matrix-Vector Multiplication on CUDA Abstract The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) […]

Efficient Parallel Scan Algorithms for GPUs Abstract Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flatten- ing transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan […]

Accelerating Advanced MRI Reconstructions on GPUs Abstract Computational acceleration on graphics processing units(GPUs) can make advanced magnetic resonance imaging(MRI) reconstruction algorithms attractive in clinical settings,thereby improving the quality of MR images across abroad spectrum of applications. At present, MR imaging isoften limited by high noise levels, signi cant imaging artifacts,and/or long data acquisition (scan) times. Advancedimage reconstruction algorithms can mitigate these […]

Accelerating Iterative Field-Compensated MR Image Reconstruction on GPUs Abstract We propose a fast implementation for iterative MR image reconstruction using Graphics Processing Units (GPU). In MRI, iterative reconstruction with conjugate gradient algorithms allows for accurate modeling the physics of the imaging system. Specifically, methods have been reported to compensate for the magnetic field inhomogeneity induced by the susceptibility differences near the air/tissue […]

Multi-GPU Implementation for Iterative MR Image Reconstruction with Field Correction Abstract Many advanced MRI image acquisition and reconstruction methods see limited application due to high computational cost in MRI. For instance,iterative reconstruction algorithms (e.g. non-Cartesian k-space trajectory, or magnetic field inhomogeneity compensation) can improve image qualitybut suffer from low reconstruction speed. General-purpose computing on graphics processing units (GPU) have demonstrated significantperformance speedups and cost […]

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architecture Abstract Reduction is a common component of many applications,but can often be the limiting factor for parallelization.Previous reduction work has focused on detecting reductionidioms and parallelizing the reduction operationby minimizing data communications or exploiting moredata locality. While these techniques can be useful, theyare mostly limited to simple code structures. In this paper,we propose a […]

Sparse regularization in MRI iterative reconstruction using GPUs Abstract Regularization is a common technique used toimprove image quality in inverse problems such as MR imagereconstruction. In this work, we extend our previous GraphicsProcessing Unit (GPU) implementation of MR imagereconstruction with compensation for susceptibility-induced fieldinhomogeneity effects by incorporating an additional quadraticregularization term. Regularization techniques commonly imposethe prior information that MR images are relatively […]

Benchmarking GPUs to Tune Dense Linear Algebra Abstract We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our GEMM routine runs 60% faster than the vendor implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on two GPUs […]

High Performance Discrete Fourier Transforms on Graphics Processors Abstract We present novel algorithms for computing Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the […]