World Library  
Flag as Inappropriate
Email this Article

General-purpose computing on graphics processing units

Article Id: WHEBN0001268939
Reproduction Date:

Title: General-purpose computing on graphics processing units  
Author: World Heritage Encyclopedia
Language: English
Subject: AMD FireStream, Direct Rendering Infrastructure, GPGPU, Graphics hardware, Graphics Core Next
Collection: Computational Science, Emerging Technologies, Gpgpu, Graphics Hardware, Instruction Processing, Parallel Computing, Video Cards, Video Game Development
Publisher: World Heritage Encyclopedia

General-purpose computing on graphics processing units

General-purpose computing on graphics processing units (GPGPU, rarely GPGP or GP²U) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).[1][2][3] The use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.[4] In addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the specialization in each chip.[5]

Essentially, a GPGPU pipeline is a kind of parallel processing between one or more GPUs and CPUs that analyzes data as if it were in image or other graphic form. While GPUs generally operate at lower frequencies, they usually have many times more cores to make up for it (up to hundreds at least) and can, thus, operate on pictures and graphical data effectively much faster, dozens or even hundreds of times faster than a traditional CPU, migrating data into graphical form and then using the GPU to "look" at it and analyze it can result in profound speedup.

GPGPU pipelines developed out of scientific computing.


  • History 1
  • Implementations 2
    • Mobile computers 2.1
  • Hardware support 3
    • Integer numbers 3.1
    • Floating-point numbers 3.2
    • Vectorization 3.3
  • GPU vs. CPU 4
    • Caches 4.1
    • Energy efficiency 4.2
  • Stream processing 5
    • GPU programming concepts 5.1
      • Computational resources 5.1.1
      • Textures as stream 5.1.2
      • Kernels 5.1.3
      • Flow control 5.1.4
    • GPU methods 5.2
      • Map 5.2.1
      • Reduce 5.2.2
      • Stream filtering 5.2.3
      • Scan 5.2.4
      • Scatter 5.2.5
      • Gather 5.2.6
      • Sort 5.2.7
      • Search 5.2.8
      • Data structures 5.2.9
  • Applications 6
    • Bioinformatics 6.1
      • Molecular dynamics 6.1.1
  • See also 7
  • References 8
  • External links 9


General-purpose computing on GPUs only became practical and popular after ca. 2001, with the advent of both programmable shaders and floating point support on graphics processors. In particular, problems involving matrices and/or vectors - especially two-, three-, or four-dimensional vectors - were easy to translate to a GPU, which acts with native speed and support on those types. The scientific computing community's experiments with the new hardware started with a matrix multiplication routine (2001); one of the first common scientific programs to run faster on GPUs than CPUs was an implementation of LU factorization (2005).[6]

These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, OpenGL and DirectX. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as Sh/RapidMind, Brook and Accelerator.[7][8]

These were followed by Nvidia's CUDA, which allowed programmers to ignore the underlying graphical concepts in favor of more common high-performance computing concepts.[6] Newer, hardware vendor-independent offerings include Microsoft's DirectCompute and Apple/Khronos Group's OpenCL.[6] This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.


Any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework.

OpenCL is the currently dominant open general-purpose GPU computing language, and is an open standard defined by the Khronos Group.[9] OpenCL provides a cross-platform GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia and ARM platforms.

The dominant proprietary framework is NVIDIA CUDA.[10] NVIDIA launched CUDA in 2006, an SDK and API that allows using the C programming language to code algorithms for execution on Geforce 8 series GPUs.

Programming standards for parallel computing include

OpenVIDIA was developed at University of Toronto during 2003-2005,[11] in collaboration with NVIDIA.

Microsoft introduced the "GPU Computing" language F#, and the DirectCompute GPU computing API, released with the DirectX 11 API.

MATLAB supports GPGPU acceleration using the Parallel Computing Toolbox and MATLAB Distributed Computing Server,[12] as well as third-party packages like Jacket.

GPGPU processing is also used to simulate Newtonian physics by Physics engines, and commercial implementations include Havok Physics, FX and PhysX, both of which are typically used for computer and video games.

Close to Metal, now called Stream, AMD/ATI's GPGPU technology for ATI Radeon-based GPUs.

C++ Accelerated Massive Parallelism is a library that accelerates execution of C++ code by taking advantage of the data-parallel hardware on GPUs.

Mobile computers

Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major mobile operating systems.

Google Android 4.2 enabled running Renderscript code on the mobile device GPU.[13] Apple introduced a proprietary Metal API for iOS applications, capable of executing arbitrary code through Apple's GPU compute shaders.

Hardware support

Computer graphics cards are produced by various vendors, such as NVIDIA and AMD/ATI. Cards from such vendors differ on implementation of data-format support, such as integer and floating-point formats (32-bit and 64-bit). Microsoft introduced a "Shader Model" standard, to help rank the various features of graphic cards into a simple Shader Model version number (1.0, 2.0, 3.0, etc.).

Integer numbers

Pre-DirectX 9 graphics cards only supported paletted or integer color types. Various formats are available, each containing a red element, a green element, and a blue element. Sometimes an additional alpha value is added, to be used for transparency. Common formats are:

  • 8 bits per pixel – sometimes palette mode, where each value is an index in a table with the real color value specified in one of the other formats. Sometimes three bits for red, three bits for green, and two bits for blue.
  • 16 bits per pixel – usually allocated as five bits for red, six bits for green, and five bits for blue.
  • 24 bits per pixel – eight bits for each of red, green, and blue
  • 32 bits per pixel – eight bits for each of red, green, blue, and alpha

Floating-point numbers

For early fixed-function or limited programmability graphics (i.e. up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations, however. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as floating point data formats, to obtain effects such as high dynamic range imaging. Many GPGPU applications require floating point accuracy, which came with graphics cards conforming to the DirectX 9 specification.

DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16. ATI's R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while Nvidia's NV30 series supported both FP16 and FP32; other vendors such as S3 Graphics and XGI supported a mixture of formats up to FP24.

Shader Model 3.0 altered the specification, increasing full precision requirements to a minimum of FP32 support in the fragment pipeline. ATI's Shader Model 3.0 compliant R5xx generation (Radeon X1000 series) supports just FP32 throughout the pipeline while Nvidia's NV4x and G7x series continued to support both FP32 full precision and FP16 partial precisions. Although not stipulated by Shader Model 3.0, both ATI and Nvidia's Shader Model 3.0 GPUs introduced support for blendable FP16 render targets, more easily facilitating the support for High Dynamic Range Rendering.

The implementations of floating point on Nvidia GPUs are mostly IEEE compliant; however, this is not true across all vendors.[14] This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs; some GPU architectures sacrifice IEEE compliance while others lack double-precision altogether. There have been efforts to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computation onto the GPU in the first place.[15]


Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once. For instance, if one color is to be modulated by another color , the GPU can produce the resulting color in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional). Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions (SIMD) have long been available on CPUs.


Originally, data was simply passed one-way from a image processing and computer vision, among other fields; as well as parallel processing generally. Some particularly heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task.

A simple example would be a GPU program that collects data about average lighting values as it renders a particular view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use edge detection to return both numerical information and a processed image representing outlines to a computer vision program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every pixel or other picture element in an image, it can analyze and average it (for the first example) or apply a Sobel edge filter or other convolution filter (for the second) with much greater speed than a CPU, which typically must access slower random access memory copies of the graphic in question.

GPGPU is fundamentally a software concept, not a hardware concept; it is a type of algorithm, not a piece of equipment. However, specialized equipment designs may even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a "rack"), which adds a third layer - many computing units each using many CPUs to correspond to many GPUs. Some Bitcoin "miners" used such setups for high-quantity processing.


Historically, CPUs have used hardware-managed caches but the earlier GPUs only provided software-managed local memories. However, as GPUs are being increasingly used for general-purpose applications, state-of-the-art GPUs are being designed with hardware-managed multi-level caches[16] which have helped the GPUs to move towards mainstream computing. For example, GT200 architecture GPUs did not feature an L2 cache, the Fermi GPU has 768KB last-level cache, the Kepler GPU has 1536KB last-level cache,[16][17] and the Maxwell GPU has 2048KB last-level cache.

Energy efficiency

Several research projects have compared the energy efficiency of GPU with that of CPU and FPGA.[18]

Stream processing

GPUs are designed specifically for graphics and thus are very restrictive in operations and programming. Due to their design, GPUs are only effective for problems that can be solved using stream processing and the hardware can only be used in certain ways.

The following discussion referring to vertices, fragments and textures concerns mainly the legacy model of GPGPU programming, where graphics APIs (OpenGL or DirectX) were used to perform general-purpose computation. With the introduction of the CUDA (NVIDIA, 2007) and OpenCL (vendor-independent, 2008) general-purpose computing APIs, in new GPGPU codes it is no longer necessary to map the computation to graphics primitives. The stream processing nature of GPUs remains valid regardless of the APIs used. (See e.g.,[19])

GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors – processors that can operate in parallel by running one kernel on many records in a stream at once.

A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements independently there is no way to have shared or static data. For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable.

Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup.[20]

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.

GPU programming concepts

Computational resources

There are a variety of computational resources available on the GPU:

  • Programmable processors – vertex, primitive, fragment and mainly compute pipelines allow programmer to perform kernel on streams of data
  • Rasterizer – creates fragments and interpolates per-vertex constants such as texture coordinates and color
  • Texture unit – read-only memory interface
  • Framebuffer – write-only memory interface

In fact, the programmer can substitute a write only texture for output instead of the framebuffer. This is accomplished either through Render to Texture (RTT), Render-To-Backbuffer-Copy-To-Texture (RTBCTT), or the more recent stream-out.

Textures as stream

The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.

Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.


Kernels can be thought of as the body of loops. For example, a programmer operating on a grid on the CPU might have code that looks like this:

// Input and output grids have 10000 x 10000 or 100 million elements.

void transform_10k_by_10k_grid(float in[10000][10000], float out[10000][10000])
    for (int x = 0; x < 10000; x++) {
        for (int y = 0; y < 10000; y++) {
            // The next line is executed 100 million times
            out[x][y] = do_some_hard_work(in[x][y]);

On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.

Flow control

In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.[21] Conditional writes could be accomplished using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.

Recent GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such as static branch resolution, pre-computation, predication, loop splitting,[22] and Z-cull[23] can be used to achieve branching when hardware support does not exist.

GPU methods


The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.


Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Generally a reduction can be accomplished in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.


The scan operation, also known as parallel prefix sum, takes in a vector (stream) of data elements and an (arbitrary) associative binary function '+' with an identity element 'i'. If the input is [a0, a1, a2, a3, ...], an exclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an inclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...]. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has applications in e.g. quicksort and sparse matrix-vector multiplication.[19][24][25][26]


The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.

The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with an additional gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.


This method is the reverse of scatter, after scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used.


The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using radix sort for integer and floating point data and coarse-grained merge sort and fine-grained sorting networks for general comparable data.[27][28]


The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel. Mostly the search technique used is binary search on sorted elements.

Data structures

A variety of data structures can be represented on the GPU:


The following are some of the areas where GPUs have been used for general purpose computing:


GPGPU usage in Bioinformatics:[35][55]

Application Description Supported features Expected speed-up† GPU‡ Multi-GPU support Release status
BarraCUDA Sequence mapping software Alignment of short sequencing reads 6–10x T 2075, 2090, K10, K20, K20X Yes Available now, version 0.7.105a
CUDASW++ Open source software for Smith-Waterman protein database searches on GPUs Parallel search of Smith-Waterman database 10–50x T 2075, 2090, K10, K20, K20X Yes Available now, version 2.0.8
CUSHAW Parallelized short read aligner Parallel, accurate long read aligner – gapped alignments to large genomes 10x T 2075, 2090, K10, K20, K20X Yes Available now, version 1.0.40
GPU-BLAST Local search with fast k-tuple heuristic Protein alignment according to blastp, multi CPU threads 3–4x T 2075, 2090, K10, K20, K20X Single only Available now, version 2.2.26
GPU-HMMER Parallelized local and global search with profile hidden Markov models Parallel local and global search of hidden Markov models 60–100x T 2075, 2090, K10, K20, K20X Yes Available now, version 2.3.2
mCUDA-MEME Ultrafast scalable motif discovery algorithm based on MEME Scalable motif discovery algorithm based on MEME 4–10x T 2075, 2090, K10, K20, K20X Yes Available now, version 3.0.12
SeqNFind A GPU accelerated sequence analysis toolset Reference assembly, blast, Smith–Waterman, hmm, de novo assembly 400x T 2075, 2090, K10, K20, K20X Yes Available now
UGENE Opensource Smith–Waterman for SSE/CUDA, suffix array based repeats finder and dotplot Fast short read alignment 6–8x T 2075, 2090, K10, K20, K20X Yes Available now, version 1.11
WideLM Fits numerous linear models to a fixed design and response Parallel linear regression on multiple similarly-shaped models 150x T 2075, 2090, K10, K20, K20X Yes Available now, version 0.1-1

Molecular dynamics

Application Description Supported features Expected speed-up† GPU‡ Multi-GPU support Release status
Abalone Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands Explicit and implicit solvent, hybrid Monte Carlo 4–120x T 2075, 2090, K10, K20, K20X Single only Available now, version 1.8.88
ACEMD GPU simulation of molecular mechanics force fields, implicit and explicit solvent Written for use on GPUs 160 ns/day GPU version only T 2075, 2090, K10, K20, K20X Yes Available now
AMBER Suite of programs to simulate molecular dynamics on biomolecule PMEMD: explicit and implicit solvent 89.44 ns/day JAC NVE T 2075, 2090, K10, K20, K20X Yes Available now, version 12 + bugfix9
DL-POLY Simulate macromolecules, polymers, ionic systems, etc. on a distributed memory parallel computer Two-body forces, link-cell pairs, Ewald SPME forces, Shake VV 4x T 2075, 2090, K10, K20, K20X Yes Available now, version 4.0 source only
CHARMM MD package to simulate molecular dynamics on biomolecule. Implicit (5x), explicit (2x) solvent via OpenMM TBD T 2075, 2090, K10, K20, K20X Yes In development Q4/12
GROMACS Simulation of biochemical molecules with complicated bond interactions Implicit (5x), explicit (2x) solvent 165 ns/Day DHFR T 2075, 2090, K10, K20, K20X Single only Available now, version 4.6 in Q4/12
HOOMD-Blue Particle dynamics package written grounds up for GPUs Written for GPUs 2x T 2075, 2090, K10, K20, K20X Yes Available now
LAMMPS Classical molecular dynamics package Lennard-Jones, Morse, Buckingham, CHARMM, tabulated, course grain SDK, anisotropic Gay-Bern, RE-squared, "hybrid" combinations 3–18x T 2075, 2090, K10, K20, K20X Yes Available now
NAMD Designed for high-performance simulation of large molecular systems 100M atom capable 6.44 ns/days STMV 585x 2050s T 2075, 2090, K10, K20, K20X Yes Available now, version 2.9
OpenMM Library and application for molecular dynamics for HPC with GPUs Implicit and explicit solvent, custom forces Implicit: 127–213 ns/day; Explicit: 18–55 ns/day DHFR T 2075, 2090, K10, K20, K20X Yes Available now, version 4.1.1

† Expected speedups are highly dependent on system configuration. GPU performance compared against multi-core x86 CPU socket. GPU performance benchmarked on GPU supported features and may be a kernel to kernel performance comparison. For details on configuration used, view application website. Speedups as per Nvidia in-house testing or ISV's documentation.

‡ Q=Quadro GPU, T=Tesla GPU. Nvidia recommended GPUs for this application. Please check with developer / ISV to obtain certification information.

See also


  1. ^ Fung, et al., "Mediated Reality Using Computer Graphics Hardware for Computer Vision", Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), Seattle, Washington, USA, 7–10 October 2002, pp. 83–89.
  2. ^ An EyeTap video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality, ACM Personal and Ubiquitous Computing published by Springer Verlag, Vol.7, Iss. 3, 2003.
  3. ^ "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004): Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96
  4. ^ "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004), Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.
  5. ^ S. Mittal and J. Vetter (2015), A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys.
  6. ^ a b c
  7. ^
  8. ^
  9. ^ [1]:OpenCL at the Khronos Group
  10. ^ "As the two major programming frameworks for GPU computing, OpenCL and CUDA have been competing for mindshare in the developer community for the past few years."
  11. ^ James Fung, Steve Mann, Chris Aimone, "OpenVIDIA: Parallel GPU Computer Vision", Proceedings of the ACM Multimedia 2005, Singapore, 6–11 November 2005, pages 849–852
  12. ^
  13. ^
  14. ^ Mapping computational concepts to GPUs: Mark Harris. Mapping computational concepts to GPUs. In ACM SIGGRAPH 2005 Courses (Los Angeles, California, 31 July – 4 August 2005). J. Fujii, Ed. SIGGRAPH '05. ACM Press, New York, NY, 50.
  15. ^ Double precision on GPUs (Proceedings of ASIM 2005): Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005 – 18th Symposium on Simulation Technique, 2005.
  16. ^ a b "A Survey of Techniques for Managing and Leveraging Caches in GPUs", S. Mittal, JCSC, 23(8), 2014.
  17. ^
  18. ^ "A Survey of Methods for Analyzing and Improving GPU Energy Efficiency", Mittal et al., ACM Computing Surveys, 2014.
  19. ^ a b D. Göddeke, 2010. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Ph.D. dissertation, Technischen Universität Dortmund.
  20. ^ Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10) (2009) 56–67
  21. ^ GPU Gems – Chapter 34, GPU Flow-Control Idioms
  22. ^ [2]: Future Chips. "Tutorial on removing branches", 2011
  23. ^ GPGPU survey paper: John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. "A Survey of General-Purpose Computation on Graphics Hardware". Computer Graphics Forum, volume 26, number 1, 2007, pp. 80–113.
  24. ^ S. Sengupta, M. Harris, Y. Zhang, J. D. Owens, 2007. Scan primitives for GPU computing. In T. Aila and M. Segal (eds.): Graphics Hardware (2007).
  25. ^ G. E. Blelloch, 1989. Scans as primitive parallel operations. IEEE Transactions on Computers, 38(11), pp. 1526-1538.
  26. ^ M. Harris, S. Sengupta, J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In NVIDIA: GPU Gems 3, Chapter 39.
  27. ^ [3]: Merrill, Duane. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. dissertation, Department of Computer Science, University of Virginia. Dec. 2011.
  28. ^ [4]: Sean Baxter. Modern gpu. 2013.
  29. ^ K. Crane, I. Llamas, S. Tariq, 2008. Real-Time Simulation and Rendering of 3D Fluids. In NVIDIA: GPU Gems 3, Chapter 30.
  30. ^ M. Harris, 2004. Fast Fluid Dynamics Simulation on the GPU. In NVIDIA: GPU Gems, Chapter 38.
  31. ^ Fast k-nearest neighbor search using GPU. In Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska, USA, June 2008. V. Garcia and E. Debreuve and M. Barlaud.
  32. ^ M. Cococcioni, R. Grasso, M. Rixen, Rapid prototyping of high performance fuzzy computing applications using high level GPU programming for maritime operations support, in Proceedings of the 2011 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), Paris, 11–15 April 2011
  33. ^
  34. ^
  35. ^ a b c Hasan Khondker S., Chatterjee Amlan , Radhakrishnan, Sridhar, and Antonio John K., "Performance Prediction Model and Analysis for Compute-Intensive Tasks on GPUs.", The 11th IFIP International Conference on Network and Parallel Computing (NPC-2014), Ilan, Taiwan, Sept. 2014, Lecture Notes in Computer Science (LNCS), pp 612-17, ISBN 978-3-662-44917-2.
  36. ^
  37. ^ Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474.
  38. ^
  39. ^ GPU computing in OR Vincent Boyer, Didier El Baz. "Recent Advances on GPU Computing in Operations Research". Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, On page(s): 1778 - 1787
  40. ^ RCPSP on GPU Libor Bukata, Premysl Sucha, Zdenek Hanzalek. "Solving the Resource Constrained Project Scheduling Problem using the parallel Tabu Search designed for the CUDA platform". Journal of Parallel and Distributed Computing (2014).
  41. ^ CTU-IIG Czech Technical University in Prague, Industrial Informatics Group (2015).
  42. ^ GPU-based Sorting in PostgreSQL Naju Mancheril, School of Computer Science – Carnegie Mellon University
  43. ^ SQream DB
  44. ^ AES on SM3.0 compliant GPUs. Owen Harrison, John Waldron, AES Encryption Implementation and Analysis on Commodity Graphics Processing Units. In proceedings of CHES 2007.
  45. ^ AES and modes of operations on SM4.0 compliant GPUs. Owen Harrison, John Waldron, Practical Symmetric Key Cryptography on Modern Graphics Hardware. In proceedings of USENIX Security 2008.
  46. ^ RSA on SM4.0 compliant GPUs. Owen Harrison, John Waldron, Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware. In proceedings of AfricaCrypt 2009.
  47. ^
  48. ^
  49. ^
  50. ^
  51. ^ GrAVity: A Massively Parallel Antivirus Engine. Giorgos Vasiliadis and Sotiris Ioannidis, GrAVity: A Massively Parallel Antivirus Engine. In proceedings of RAID 2010.
  52. ^
  53. ^ Gnort: High Performance Network Intrusion Detection Using Graphics Processors. Giorgos Vasiliadis et al., Gnort: High Performance Network Intrusion Detection Using Graphics Processors. In proceedings of RAID 2008.
  54. ^ Regular Expression Matching on Graphics Hardware for Intrusion Detection. Giorgos Vasiliadis et al., Regular Expression Matching on Graphics Hardware for Intrusion Detection. In proceedings of RAID 2009.
  55. ^
  56. ^ Open HMPP

External links

  • – New Open Standard for Many-Core
  • OCLTools Open Source OpenCL Compiler and Linker
  • – General-Purpose Computation Using Graphics Hardware
  • GPGPU Wiki
  • SIGGRAPH 2005 GPGPU Course Notes
  • IEEE VIS 2005 GPGPU Course Notes
  • NVIDIA Developer Zone
  • AMD GPU Tools
  • CPU vs. GPGPU
  • What is GPU Computing?
  • Tech Report article: "ATI stakes claims on physics, GPGPU ground" by Scott Wasson
  • GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model – porting a standard model to GPU hardware
  • GPGPU Computing @ Duke Statistical Science
  • GPGPU Programming in F# using the Microsoft Research Accelerator system
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.