NVIDIAs CUDA Libraries Performance Report

Page 1

CUDA Toolkit 3.2 Math Library Performance January, 2011


CUDA Toolkit 3.2 Libraries NVIDIA GPU-accelerated math libraries: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216 2


cuFFT Multi-dimensional Fast Fourier Transforms New in CUDA 3.2 : Higher performance of 1D, 2D, 3D transforms with dimensions of powers of 2, 3, 5 or 7 Higher performance and accuracy for 1D transform sizes that contain large prime factors

3


cuFFT 3.2 up to 8.8x Faster than MKL 1D used in audio processing and as a foundation for 2D and 3D FFTs Single-Precision 1D Radix-2 400

 CUFFT  MKL  FFTW

350

120 100

GFLOPS

300

GFLOPS

Double-Precision 1D Radix-2

250 200

 CUFFT  MKL  FFTW

80 60

150 40 100 20

50 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Performance may vary based on OS version and motherboard configuration

0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz * FFTW single-thread on same CPU

4


cuFFT 1D Radix-3 up to 18x Faster than MKL 18x for single-precision, 15x GFLOPS for double-precision Similar acceleration for radix-5 and -7

GFLOPS

Double-Precision

Single-Precision 240 220 200

70

180

60

160

GFLOPS

GFLOPS

 CUFFT (ECC off)  CUFFT (ECC on)  MKL

80

 CUFFT (ECC off)  CUFFT (ECC on)  MKL

140

120 100 80 60

50

40 30 20

40

10

20 0

0 2

3

4

5

6

7

8

9

10 11 12 13 14 15 16

N (transform size = 3^N) Performance may vary based on OS version and motherboard configuration

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16

N (transform size = 3^N)

* CUFFT 3.2 on Tesla C2050 (Fermi) * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz

5


cuFFT 3.2 up to 100x Faster than cuFFT 3.1 Speedup for C2C single-precision cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)

Log Scale

Speedup

100x

10x

1x

0x 0

2048

4096

6144

8192

10240

12288

14336

16384

1D Transform Size Performance may vary based on OS version and motherboard configuration

6


cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation Supports all 152 routines for single, double, complex and double complex

New in CUDA 3.2 7x Faster GEMM (matrix multiply) on Fermi GPUs Higher performance on SGEMM & DGEMM for all matrix sizes and all transpose cases (NN, TT, TN, NT)

7


Up to 8x Speedup for all GEMM Types

GFLOPS

GEMM Performance on 4K by 4K matrices 900 800 700 600 500 400 300 200 100 0

775

CUBLAS3.2

636

301 78 SGEMM

80 CGEMM

Performance may vary based on OS version and motherboard configuration

295

39 DGEMM

40 ZGEMM

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz

8


cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance DGEMM 3.2

DGEMM 3.1

DGEMM MKL 4 THREADS

350

cuBLAS 3.2 8%

300

cuBLAS 3.1

GFLOPS

250 200 150

7x Faster than MKL

Large perf variance in cuBLAS 3.1

100

MKL 10.2.3

50 0

Dimension (m = n = k) Performance may vary based on OS version and motherboard configuration

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 9


cuSPARSE New library for sparse linear algebra Conversion routines for dense, COO, CSR and CSC formats Optimized sparse matrix-vector multiplication

y1 y2 y3 y4

\alpha

1.0 2.0 5.0

3.0 4.0 6.0

7.0

1.0 2.0 3.0 4.0

+ \beta

y1 y2 y3 y4

10


Log Scale

Speedup over MKL

64 32

32x Faster 1 Sparse Matrix x 6 Dense Vector Speedup of cuSPARSE vs MKL

single double complex double-complex

16 8 4

2 1

Performance may vary based on OS version and motherboard configuration

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on * MKL 10.2.3, 4-core Corei7 @ 3.07GHz 11


Performance of 1 Sparse Matrix x 6 Dense Vectors 160

140

single double complex double-complex

GFLOPS

120 100

80 60 40

20 0

Test cases roughly in order of increasing sparseness Performance may vary based on OS version and motherboard configuration

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

12


cuRAND Library for generating random numbers

Features supported in CUDA 3.2 XORWOW pseudo-random generator Sobol’ quasi-random number generators Host API for generating random numbers in bulk Inline implementation allows use inside GPU functions/kernels Single- and double-precision, uniform and normal distributions 13


cuRAND performance Sobol’ Quasi-RNG (1 dimension)

GigaSamples / second

XORWOW Psuedo-RNG 18

18

16

16

14

14

12

12

10

10

8

8

6

6

4

4

2

2

0

0 uniform int

uniform float

uniform double

normal float

normal double

Performance may vary based on OS version and motherboard configuration

uniform int

uniform float

uniform double

normal float

normal double

* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on

14


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.