38,99 €
inkl. MwSt.
Sofort per Download lieferbar
- Format: ePub
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
Bitte loggen Sie sich zunächst in Ihr Kundenkonto ein oder registrieren Sie sich bei
bücher.de, um das eBook-Abo tolino select nutzen zu können.
Hier können Sie sich einloggen
Hier können Sie sich einloggen
Sie sind bereits eingeloggt. Klicken Sie auf 2. tolino select Abo, um fortzufahren.
Bitte loggen Sie sich zunächst in Ihr Kundenkonto ein oder registrieren Sie sich bei bücher.de, um das eBook-Abo tolino select nutzen zu können.
Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the…mehr
- Geräte: eReader
- ohne Kopierschutz
- eBook Hilfe
- Größe: 15.67MB
- Upload möglich
Andere Kunden interessierten sich auch für
- Manika PostGénesis NFT (eBook, ePUB)6,99 €
- Bernabé DorronsoroEvolutionary Algorithms for Mobile Ad Hoc Networks (eBook, ePUB)97,99 €
- Kevin D. MitnickThe Art of Deception (eBook, ePUB)12,99 €
- Dafydd StuttardAttack and Defend Computer Security Set (eBook, ePUB)57,99 €
- Bruce DangPractical Reverse Engineering (eBook, ePUB)42,99 €
- Cyberbedrohungen. Eine Analyse von Kriterien zur Beschreibung von Advanced Persistent Threats (eBook, ePUB)36,99 €
- Thomas GenglerData for the Tiger (eBook, ePUB)9,99 €
-
-
-
Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming.
Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming, including:
The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seeking entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming.
Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming, including:
- CUDA Programming Model
- GPU Execution Model
- GPU Memory model
- Streams, Event and Concurrency
- Multi-GPU Programming
- CUDA Domain-Specific Libraries
- Profiling and Performance Tuning
The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seeking entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.
Produktdetails
- Produktdetails
- Verlag: John Wiley & Sons
- Erscheinungstermin: 8. September 2014
- Englisch
- ISBN-13: 9781118739310
- Artikelnr.: 41740106
- Verlag: John Wiley & Sons
- Erscheinungstermin: 8. September 2014
- Englisch
- ISBN-13: 9781118739310
- Artikelnr.: 41740106
John Cheng, PHD, is a Research Scientist at BGP International in Houston. He has developed seismic imaging products with GPU technology and many high-performance parallel production applications on heterogeneous computing-platforms.
Max Grossman is an expert in GPU computing with experience applying CUDA to problems in medical imaging, machine learning, geophysics, and more.
Ty McKercher has been helping customers adopt GPU acceleration technologies while he has been employed at NVIDIA since 2008.
Max Grossman is an expert in GPU computing with experience applying CUDA to problems in medical imaging, machine learning, geophysics, and more.
Ty McKercher has been helping customers adopt GPU acceleration technologies while he has been employed at NVIDIA since 2008.
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
FOREWORD xvii PREFACE xix INTRODUCTION xxi CHAPTER 1: HETEROGENEOUS
PARALLEL COMPUTING WITH CUDA 1 Parallel Computing 2 Sequential and Parallel
Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing
8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA:
A Platform for Heterogeneous Computing 14 Hello World from GPU 17 Is CUDA C
Programming Difficult? 20 Summary 21 CHAPTER 2: CUDA PROGRAMMING MODEL 23
Introducing the CUDA Programming Model 23 CUDA Programming Structure 25
Managing Memory 26 Organizing Threads 30 Launching a CUDA Kernel 36 Writing
Your Kernel 37 Verifying Your Kernel 39 Handling Errors 40 Compiling and
Executing 40 Timing Your Kernel 43 Timing with CPU Timer 44 Timing with
nvprof 47 Organizing Parallel Threads 49 Indexing Matrices with Blocks and
Threads 49 Summing Matrices with a 2D Grid and 2D Blocks 53 Summing
Matrices with a 1D Grid and 1D Blocks 57 Summing Matrices with a 2D Grid
and 1D Blocks 58 Managing Devices 60 Using the Runtime API to Query GPU
Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU
Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3: CUDA
EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture
Overview 68 The Fermi Architecture 71 The Kepler Architecture 73
Profile-Driven Optimization 78 Understanding the Nature of Warp Execution
80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87
Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing
Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory
Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch
Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel
Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with
Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115
Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119
Reducing with Template Functions 120 Dynamic Parallelism 122 Nested
Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128
Summary 132 CHAPTER 4: GLOBAL MEMORY 135 Introducing the CUDA Memory Model
136 Benefi ts of a Memory Hierarchy 136 CUDA Memory Model 137 Memory
Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146
Pinned Memory 148 Zero-Copy Memory 150 Unifi ed Virtual Addressing 156
Unified Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access
158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures
versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a
Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195 Summary 199 CHAPTER 5: SHARED
MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared
Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode
206 Confi guring the Amount of Shared Memory 212 Synchronization 214
Checking the Data Layout of Shared Memory 216 Square Shared Memory 217
Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel
Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239
Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix
Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory
245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249
Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253 The Warp Shuffle Instruction 255
Variants of the Warp Shuffl e Instruction 256 Sharing Data within a Warp
258 Parallel Reduction Using the Warp Shuffle Instruction 262 Summary 264
CHAPTER 6: STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268
CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events
273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent
Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using
Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking
Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289 Overlap Using
Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297
CHAPTER 7: TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA
Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard
Functions 303 Atomic Instructions 304 Optimizing Instructions for Your
Application 306 Single-Precision vs. Double-Precision 306 Standard vs.
Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It
All Together 322 Summary 324 CHAPTER 8: GPU-ACCELERATED CUDA LIBRARIES AND
OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA
Libraries 329 A Common Library Workflow 330 The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342
Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS
Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating
cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo-or
Quasi- Random Numbers 349 Overview of the cuRAND Library 350 Demonstrating
cuRAND 354 Important Topics in cuRAND Development 357 CUDA Library Features
Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS
versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library
Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives
367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining
OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384
CHAPTER 9: MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing
on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across
Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating
Memory on Multiple Devices 393 Distributing Work from a Single Host Thread
394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs
396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unifi ed Virtual Addressing 398 Finite
Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with
Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling
and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU
Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data
Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to
GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10:
IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD
Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation
432 CUDA Error Handling 437 Profi le-Driven Optimization 438 Finding
Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp
443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448
Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C
Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464
Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475
Summary 476 APPENDIX: SUGGESTED READINGS 477 INDEX 481
PARALLEL COMPUTING WITH CUDA 1 Parallel Computing 2 Sequential and Parallel
Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing
8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA:
A Platform for Heterogeneous Computing 14 Hello World from GPU 17 Is CUDA C
Programming Difficult? 20 Summary 21 CHAPTER 2: CUDA PROGRAMMING MODEL 23
Introducing the CUDA Programming Model 23 CUDA Programming Structure 25
Managing Memory 26 Organizing Threads 30 Launching a CUDA Kernel 36 Writing
Your Kernel 37 Verifying Your Kernel 39 Handling Errors 40 Compiling and
Executing 40 Timing Your Kernel 43 Timing with CPU Timer 44 Timing with
nvprof 47 Organizing Parallel Threads 49 Indexing Matrices with Blocks and
Threads 49 Summing Matrices with a 2D Grid and 2D Blocks 53 Summing
Matrices with a 1D Grid and 1D Blocks 57 Summing Matrices with a 2D Grid
and 1D Blocks 58 Managing Devices 60 Using the Runtime API to Query GPU
Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU
Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3: CUDA
EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture
Overview 68 The Fermi Architecture 71 The Kepler Architecture 73
Profile-Driven Optimization 78 Understanding the Nature of Warp Execution
80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87
Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing
Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory
Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch
Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel
Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with
Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115
Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119
Reducing with Template Functions 120 Dynamic Parallelism 122 Nested
Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128
Summary 132 CHAPTER 4: GLOBAL MEMORY 135 Introducing the CUDA Memory Model
136 Benefi ts of a Memory Hierarchy 136 CUDA Memory Model 137 Memory
Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146
Pinned Memory 148 Zero-Copy Memory 150 Unifi ed Virtual Addressing 156
Unified Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access
158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures
versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a
Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195 Summary 199 CHAPTER 5: SHARED
MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared
Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode
206 Confi guring the Amount of Shared Memory 212 Synchronization 214
Checking the Data Layout of Shared Memory 216 Square Shared Memory 217
Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel
Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239
Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix
Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory
245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249
Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253 The Warp Shuffle Instruction 255
Variants of the Warp Shuffl e Instruction 256 Sharing Data within a Warp
258 Parallel Reduction Using the Warp Shuffle Instruction 262 Summary 264
CHAPTER 6: STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268
CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events
273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent
Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using
Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking
Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289 Overlap Using
Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297
CHAPTER 7: TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA
Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard
Functions 303 Atomic Instructions 304 Optimizing Instructions for Your
Application 306 Single-Precision vs. Double-Precision 306 Standard vs.
Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It
All Together 322 Summary 324 CHAPTER 8: GPU-ACCELERATED CUDA LIBRARIES AND
OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA
Libraries 329 A Common Library Workflow 330 The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342
Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS
Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating
cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo-or
Quasi- Random Numbers 349 Overview of the cuRAND Library 350 Demonstrating
cuRAND 354 Important Topics in cuRAND Development 357 CUDA Library Features
Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS
versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library
Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives
367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining
OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384
CHAPTER 9: MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing
on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across
Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating
Memory on Multiple Devices 393 Distributing Work from a Single Host Thread
394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs
396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unifi ed Virtual Addressing 398 Finite
Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with
Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling
and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU
Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data
Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to
GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10:
IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD
Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation
432 CUDA Error Handling 437 Profi le-Driven Optimization 438 Finding
Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp
443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448
Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C
Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464
Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475
Summary 476 APPENDIX: SUGGESTED READINGS 477 INDEX 481
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
FOREWORD xvii PREFACE xix INTRODUCTION xxi CHAPTER 1: HETEROGENEOUS
PARALLEL COMPUTING WITH CUDA 1 Parallel Computing 2 Sequential and Parallel
Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing
8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA:
A Platform for Heterogeneous Computing 14 Hello World from GPU 17 Is CUDA C
Programming Difficult? 20 Summary 21 CHAPTER 2: CUDA PROGRAMMING MODEL 23
Introducing the CUDA Programming Model 23 CUDA Programming Structure 25
Managing Memory 26 Organizing Threads 30 Launching a CUDA Kernel 36 Writing
Your Kernel 37 Verifying Your Kernel 39 Handling Errors 40 Compiling and
Executing 40 Timing Your Kernel 43 Timing with CPU Timer 44 Timing with
nvprof 47 Organizing Parallel Threads 49 Indexing Matrices with Blocks and
Threads 49 Summing Matrices with a 2D Grid and 2D Blocks 53 Summing
Matrices with a 1D Grid and 1D Blocks 57 Summing Matrices with a 2D Grid
and 1D Blocks 58 Managing Devices 60 Using the Runtime API to Query GPU
Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU
Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3: CUDA
EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture
Overview 68 The Fermi Architecture 71 The Kepler Architecture 73
Profile-Driven Optimization 78 Understanding the Nature of Warp Execution
80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87
Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing
Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory
Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch
Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel
Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with
Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115
Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119
Reducing with Template Functions 120 Dynamic Parallelism 122 Nested
Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128
Summary 132 CHAPTER 4: GLOBAL MEMORY 135 Introducing the CUDA Memory Model
136 Benefi ts of a Memory Hierarchy 136 CUDA Memory Model 137 Memory
Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146
Pinned Memory 148 Zero-Copy Memory 150 Unifi ed Virtual Addressing 156
Unified Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access
158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures
versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a
Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195 Summary 199 CHAPTER 5: SHARED
MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared
Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode
206 Confi guring the Amount of Shared Memory 212 Synchronization 214
Checking the Data Layout of Shared Memory 216 Square Shared Memory 217
Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel
Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239
Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix
Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory
245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249
Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253 The Warp Shuffle Instruction 255
Variants of the Warp Shuffl e Instruction 256 Sharing Data within a Warp
258 Parallel Reduction Using the Warp Shuffle Instruction 262 Summary 264
CHAPTER 6: STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268
CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events
273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent
Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using
Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking
Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289 Overlap Using
Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297
CHAPTER 7: TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA
Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard
Functions 303 Atomic Instructions 304 Optimizing Instructions for Your
Application 306 Single-Precision vs. Double-Precision 306 Standard vs.
Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It
All Together 322 Summary 324 CHAPTER 8: GPU-ACCELERATED CUDA LIBRARIES AND
OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA
Libraries 329 A Common Library Workflow 330 The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342
Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS
Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating
cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo-or
Quasi- Random Numbers 349 Overview of the cuRAND Library 350 Demonstrating
cuRAND 354 Important Topics in cuRAND Development 357 CUDA Library Features
Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS
versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library
Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives
367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining
OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384
CHAPTER 9: MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing
on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across
Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating
Memory on Multiple Devices 393 Distributing Work from a Single Host Thread
394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs
396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unifi ed Virtual Addressing 398 Finite
Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with
Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling
and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU
Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data
Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to
GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10:
IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD
Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation
432 CUDA Error Handling 437 Profi le-Driven Optimization 438 Finding
Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp
443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448
Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C
Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464
Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475
Summary 476 APPENDIX: SUGGESTED READINGS 477 INDEX 481
PARALLEL COMPUTING WITH CUDA 1 Parallel Computing 2 Sequential and Parallel
Programming 3 Parallelism 4 Computer Architecture 6 Heterogeneous Computing
8 Heterogeneous Architecture 9 Paradigm of Heterogeneous Computing 12 CUDA:
A Platform for Heterogeneous Computing 14 Hello World from GPU 17 Is CUDA C
Programming Difficult? 20 Summary 21 CHAPTER 2: CUDA PROGRAMMING MODEL 23
Introducing the CUDA Programming Model 23 CUDA Programming Structure 25
Managing Memory 26 Organizing Threads 30 Launching a CUDA Kernel 36 Writing
Your Kernel 37 Verifying Your Kernel 39 Handling Errors 40 Compiling and
Executing 40 Timing Your Kernel 43 Timing with CPU Timer 44 Timing with
nvprof 47 Organizing Parallel Threads 49 Indexing Matrices with Blocks and
Threads 49 Summing Matrices with a 2D Grid and 2D Blocks 53 Summing
Matrices with a 1D Grid and 1D Blocks 57 Summing Matrices with a 2D Grid
and 1D Blocks 58 Managing Devices 60 Using the Runtime API to Query GPU
Information 61 Determining the Best GPU 63 Using nvidia-smi to Query GPU
Information 63 Setting Devices at Runtime 64 Summary 65 CHAPTER 3: CUDA
EXECUTION MODEL 67 Introducing the CUDA Execution Model 67 GPU Architecture
Overview 68 The Fermi Architecture 71 The Kepler Architecture 73
Profile-Driven Optimization 78 Understanding the Nature of Warp Execution
80 Warps and Thread Blocks 80 Warp Divergence 82 Resource Partitioning 87
Latency Hiding 90 Occupancy 93 Synchronization 97 Scalability 98 Exposing
Parallelism 98 Checking Active Warps with nvprof 100 Checking Memory
Operations with nvprof 100 Exposing More Parallelism 101 Avoiding Branch
Divergence 104 The Parallel Reduction Problem 104 Divergence in Parallel
Reduction 106 Improving Divergence in Parallel Reduction 110 Reducing with
Interleaved Pairs 112 Unrolling Loops 114 Reducing with Unrolling 115
Reducing with Unrolled Warps 117 Reducing with Complete Unrolling 119
Reducing with Template Functions 120 Dynamic Parallelism 122 Nested
Execution 123 Nested Hello World on the GPU 124 Nested Reduction 128
Summary 132 CHAPTER 4: GLOBAL MEMORY 135 Introducing the CUDA Memory Model
136 Benefi ts of a Memory Hierarchy 136 CUDA Memory Model 137 Memory
Management 145 Memory Allocation and Deallocation 146 Memory Transfer 146
Pinned Memory 148 Zero-Copy Memory 150 Unifi ed Virtual Addressing 156
Unified Memory 157 Memory Access Patterns 158 Aligned and Coalesced Access
158 Global Memory Reads 160 Global Memory Writes 169 Array of Structures
versus Structure of Arrays 171 Performance Tuning 176 What Bandwidth Can a
Kernel Achieve? 179 Memory Bandwidth 179 Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195 Summary 199 CHAPTER 5: SHARED
MEMORY AND CONSTANT MEMORY 203 Introducing CUDA Shared Memory 204 Shared
Memory 204 Shared Memory Allocation 206 Shared Memory Banks and Access Mode
206 Confi guring the Amount of Shared Memory 212 Synchronization 214
Checking the Data Layout of Shared Memory 216 Square Shared Memory 217
Rectangular Shared Memory 225 Reducing Global Memory Access 232 Parallel
Reduction with Shared Memory 232 Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238 Effective Bandwidth 239
Coalescing Global Memory Accesses 239 Baseline Transpose Kernel 240 Matrix
Transpose with Shared Memory 241 Matrix Transpose with Padded Shared Memory
245 Matrix Transpose with Unrolling 246 Exposing More Parallelism 249
Constant Memory 250 Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253 The Warp Shuffle Instruction 255
Variants of the Warp Shuffl e Instruction 256 Sharing Data within a Warp
258 Parallel Reduction Using the Warp Shuffle Instruction 262 Summary 264
CHAPTER 6: STREAMS AND CONCURRENCY 267 Introducing Streams and Events 268
CUDA Streams 269 Stream Scheduling 271 Stream Priorities 273 CUDA Events
273 Stream Synchronization 275 Concurrent Kernel Execution 279 Concurrent
Kernels in Non-NULL Streams 279 False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283 Adjusting Stream Behavior Using
Environment Variables 284 Concurrency-Limiting GPU Resources 286 Blocking
Behavior of the Default Stream 287 Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289 Overlap Using
Depth-First Scheduling 289 Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294 Stream Callbacks 295 Summary 297
CHAPTER 7: TUNING INSTRUCTION-LEVEL PRIMITIVES 299 Introducing CUDA
Instructions 300 Floating-Point Instructions 301 Intrinsic and Standard
Functions 303 Atomic Instructions 304 Optimizing Instructions for Your
Application 306 Single-Precision vs. Double-Precision 306 Standard vs.
Intrinsic Functions 309 Understanding Atomic Instructions 315 Bringing It
All Together 322 Summary 324 CHAPTER 8: GPU-ACCELERATED CUDA LIBRARIES AND
OPENACC 327 Introducing the CUDA Libraries 328 Supported Domains for CUDA
Libraries 329 A Common Library Workflow 330 The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333 Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338 Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341 The cuBLAS Library 341 Managing cuBLAS Data 342
Demonstrating cuBLAS 343 Important Topics in cuBLAS Development 345 cuBLAS
Summary 346 The cuFFT Library 346 Using the cuFFT API 347 Demonstrating
cuFFT 348 cuFFT Summary 349 The cuRAND Library 349 Choosing Pseudo-or
Quasi- Random Numbers 349 Overview of the cuRAND Library 350 Demonstrating
cuRAND 354 Important Topics in cuRAND Development 357 CUDA Library Features
Introduced in CUDA 6 358 Drop-In CUDA Libraries 358 Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361 cuSPARSE versus MKL 361 cuBLAS
versus MKL BLAS 362 cuFFT versus FFTW versus MKL 363 CUDA Library
Performance Summary 364 Using OpenACC 365 Using OpenACC Compute Directives
367 Using OpenACC Data Directives 375 The OpenACC Runtime API 380 Combining
OpenACC and the CUDA Libraries 382 Summary of OpenACC 384 Summary 384
CHAPTER 9: MULTI-GPU PROGRAMMING 387 Moving to Multiple GPUs 388 Executing
on Multiple GPUs 389 Peer-to-Peer Communication 391 Synchronizing across
Multi-GPUs 392 Subdividing Computation across Multiple GPUs 393 Allocating
Memory on Multiple Devices 393 Distributing Work from a Single Host Thread
394 Compiling and Executing 395 Peer-to-Peer Communication on Multiple GPUs
396 Enabling Peer-to-Peer Access 396 Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unifi ed Virtual Addressing 398 Finite
Difference on Multi-GPU 400 Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401 2D Stencil Computation with
Multiple GPUs 403 Overlapping Computation and Communication 405 Compiling
and Executing 406 Scaling Applications across GPU Clusters 409 CPU-to-CPU
Data Transfer 410 GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416 Intra-Node GPU-to-GPU Data
Transfer with CUDA-Aware MPI 417 Adjusting Message Chunk Size 418 GPU to
GPU Data Transfer with GPUDirect RDMA 419 Summary 422 CHAPTER 10:
IMPLEMENTATION CONSIDERATIONS 425 The CUDA C Development Process 426 APOD
Development Cycle 426 Optimization Opportunities 429 CUDA Code Compilation
432 CUDA Error Handling 437 Profi le-Driven Optimization 438 Finding
Optimization Opportunities Using nvprof 439 Guiding Optimization Using nvvp
443 NVIDIA Tools Extension 446 CUDA Debugging 448 Kernel Debugging 448
Memory Debugging 456 Debugging Summary 462 A Case Study in Porting C
Programs to CUDA C 462 Assessing crypt 463 Parallelizing crypt 464
Optimizing crypt 465 Deploying Crypt 472 Summary of Porting crypt 475
Summary 476 APPENDIX: SUGGESTED READINGS 477 INDEX 481