PU is a graphic card with programmable interface. Insert it into a slot on
your motherboard, install drivers and gain access to upward from 100 of
processing units sharing 1GB or more of random access memory. It now takes
milliseconds to do what used to take seconds to do. You go from noticeable
delay to no delay at all. This is what happens if GPU processing is an
afterthought: for an average PC system, spend $400 dollars for a mid-range GPU
card and you are in a new world. It is possible to assemble a system that will
drive a stack of high-end GPU units. For an investment of below $100K one can
assemble a supercomputer capable of providing real time quantitative
information for industrial size portfolios.
CUDA is a software interface for GPU. Not every graphic card supports CUDA.
CUDA allows for GPU coding with a version of C++. It is possible to build
applications that run concurrently on CPU and one or more GPUs. There is
straightforward interface for memory exchanges between CPU and GPUs. For later
versions of GPUs, there is interface for memory mapping between CPU and GPUs.
There are barrier, event and stream-based synchronizations, atomic arithmetic
and a constructs for seamless scalability. Threads are extremely lightweight.
For example, it makes sense to create 256 threads to add two vectors in 256
dimensions.
One crucial feature is presence of very fast cache memory of significant size
on every processing core. For example, we no longer do matrix multiplication
by straightforward utilization of definition. Instead, we copy matrix blocks
in cache memory in parallel and then do block-matrix multiplication in
parallel. For numerical techniques this means that triangular factorization
based methods are no longer a good way to invert equations because these are
adapted for consecutive calculations. Matrix multiplication, on the other
hand, is ideally suited for calculations with this technology.
To understand limitations of the technology one needs to understand the notion
of warp. The device (GPU) code is executed in groups of 32 threads (called
"warp") controlled by the same command sequence. Thus, every flow control
operation (if,while,for) has potential to split the warp and introduce
substantial performance penalty. If too much of such divergence is encountered
then Cuda runtime throws "global stack overflow" exception. Such exception
requires restarting Cuda runtime. Even though the flow control commands are
available in device code, the programmer is expected to shift most flow
control to host (CPU) and submit code with minimal amount of flow control into
the device. An elaborate example of such separation is presented in the
section (
Scalar product in N
dimensions
). Naturally, there is no exception throwing or handling in the
device code.
|