CUDA Toolkit 3.2 Math Library Performance January, 2011
CUDA Toolkit 3.2 Libraries NVIDIA GPU-accelerated math libraries: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library
For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216 2
cuFFT Multi-dimensional Fast Fourier Transforms New in CUDA 3.2 : Higher performance of 1D, 2D, 3D transforms with dimensions of powers of 2, 3, 5 or 7 Higher performance and accuracy for 1D transform sizes that contain large prime factors
3
cuFFT 3.2 up to 8.8x Faster than MKL 1D used in audio processing and as a foundation for 2D and 3D FFTs Single-Precision 1D Radix-2 400
CUFFT MKL FFTW
350
120 100
GFLOPS
300
GFLOPS
Double-Precision 1D Radix-2
250 200
CUFFT MKL FFTW
80 60
150 40 100 20
50 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N)
Performance may vary based on OS version and motherboard configuration
0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N)
* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz * FFTW single-thread on same CPU
4
cuFFT 1D Radix-3 up to 18x Faster than MKL 18x for single-precision, 15x GFLOPS for double-precision Similar acceleration for radix-5 and -7
GFLOPS
Double-Precision
Single-Precision 240 220 200
70
180
60
160
GFLOPS
GFLOPS
CUFFT (ECC off) CUFFT (ECC on) MKL
80
CUFFT (ECC off) CUFFT (ECC on) MKL
140
120 100 80 60
50
40 30 20
40
10
20 0
0 2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
N (transform size = 3^N) Performance may vary based on OS version and motherboard configuration
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
N (transform size = 3^N)
* CUFFT 3.2 on Tesla C2050 (Fermi) * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz
5
cuFFT 3.2 up to 100x Faster than cuFFT 3.1 Speedup for C2C single-precision cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)
Log Scale
Speedup
100x
10x
1x
0x 0
2048
4096
6144
8192
10240
12288
14336
16384
1D Transform Size Performance may vary based on OS version and motherboard configuration
6
cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation Supports all 152 routines for single, double, complex and double complex
New in CUDA 3.2 7x Faster GEMM (matrix multiply) on Fermi GPUs Higher performance on SGEMM & DGEMM for all matrix sizes and all transpose cases (NN, TT, TN, NT)
7
Up to 8x Speedup for all GEMM Types
GFLOPS
GEMM Performance on 4K by 4K matrices 900 800 700 600 500 400 300 200 100 0
775
CUBLAS3.2
636
301 78 SGEMM
80 CGEMM
Performance may vary based on OS version and motherboard configuration
295
39 DGEMM
40 ZGEMM
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz
8
cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance DGEMM 3.2
DGEMM 3.1
DGEMM MKL 4 THREADS
350
cuBLAS 3.2 8%
300
cuBLAS 3.1
GFLOPS
250 200 150
7x Faster than MKL
Large perf variance in cuBLAS 3.1
100
MKL 10.2.3
50 0
Dimension (m = n = k) Performance may vary based on OS version and motherboard configuration
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 9
cuSPARSE New library for sparse linear algebra Conversion routines for dense, COO, CSR and CSC formats Optimized sparse matrix-vector multiplication
y1 y2 y3 y4
\alpha
1.0 2.0 5.0
3.0 4.0 6.0
7.0
1.0 2.0 3.0 4.0
+ \beta
y1 y2 y3 y4
10
Log Scale
Speedup over MKL
64 32
32x Faster 1 Sparse Matrix x 6 Dense Vector Speedup of cuSPARSE vs MKL
single double complex double-complex
16 8 4
2 1
Performance may vary based on OS version and motherboard configuration
* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 3.07GHz 11
Performance of 1 Sparse Matrix x 6 Dense Vectors 160
140
single double complex double-complex
GFLOPS
120 100
80 60 40
20 0
Test cases roughly in order of increasing sparseness Performance may vary based on OS version and motherboard configuration
* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on
12
cuRAND Library for generating random numbers
Features supported in CUDA 3.2 XORWOW pseudo-random generator Sobol’ quasi-random number generators Host API for generating random numbers in bulk Inline implementation allows use inside GPU functions/kernels Single- and double-precision, uniform and normal distributions 13
cuRAND performance Sobol’ Quasi-RNG (1 dimension)
GigaSamples / second
XORWOW Psuedo-RNG 18
18
16
16
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0 uniform int
uniform float
uniform double
normal float
normal double
Performance may vary based on OS version and motherboard configuration
uniform int
uniform float
uniform double
normal float
normal double
* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on
14