上通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 7 CUDA Prof Xiaoyao Liang 201210/15
CS427 Multicore Architecture and Parallel Computing Lecture 7 CUDA Prof. Xiaoyao Liang 2012/10/15 1
⑨CUDA "Compute Unified Device Architecture General purpose programming model > User kicks off batches of threads on the gel Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver -Optimized for computation Interface designed for compute- graphics-free API Data sharing with OpengL buffer objects Guaranteed maximum download &z readback speeds Explicit eU memory management
CUDA 2 • “Compute Unified Device Architecture” ➢General purpose programming model ➢User kicks off batches of threads on the GPU •Targeted software stack ➢Compute oriented drivers, language, and tools •Driver for loading computation programs into GPU ➢Standalone Driver -Optimized for computation ➢Interface designed for compute –graphics-free API ➢Data sharing with OpenGL buffer objects ➢Guaranteed maximum download & readback speeds ➢Explicit GPU memory management
D)GPU Location CPU FSB 画图 AGP Northbridge (RAM NB CPU Southbridge
GPU Location 3
S GPU VS CPU Con trol ALU ALU ALU ALU Cache DRAM DRAM CPU GPU
GPU Vs. CPU 4
⑨ CUDA Execution Model E△xL1e1 cene1 Device Block (0, o) 3落 01 Device en1>>《 1,o》 Block (11)
CUDA Execution Model 5
3 CUDA Device and Threads ° A com pute device Is a coprocessor to the cpu or host Has its own RAM ( device mermory Runs many threads in parallel Is typically a gpU but can also be another type of parallel p rocess vice Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPu and cpu threads > gPU threads are extremely light weight >very little creation overhead >gPU needs 1000s of threads for full effic ciency >Multi-core CpU needs only a few
CUDA Device and Threads 6 •A compute device ➢Is a coprocessor to the CPU or host ➢Has its own DRAM (device memory) ➢Runs many threads in parallel ➢Is typically a GPU but can also be another type of parallel processing device •Data-parallel portions of an application are expressed as device kernels which run on many threads •Differences between GPU and CPU threads ➢GPU threads are extremely light weight ➢Very little creation overhead ➢GPU needs 1000s of threads for full efficiency ➢Multi-core CPU needs only a few
@ EXtension eclspecs device float filter [ nli global, device, shared, local constant global void convolve (float *image shared float region [M] Keywords threadx. blockldx region [threadIdx]= image [i]i · Intrinsics syncthreads syncthreads image[j]= resulti Runtime apl Memory, symbol // Allocate GPU memory void *myimage= cudaMalloc(bytes execution management //100 blocks, 10 threads per block Function launch convolve>>(myimage) 7
C Extension 7
S)Compilation Flow Integrated source (foo. cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo s foo. cpp OCG gcc/cl G80 SASS Mark Murphy. " NVIDIA's Experience with fo Open64 8
Compilation Flow 8
@ Compilation Flow C/C++ CUDA float4 me gxIgtid] Application me.X t= me. y me. Z, NVCC CPU Code Virtual PTX Code PhysicapTX to Target Id globalv4. f32 [sfh mad. f32 sfl Compiler G80 GPU Target code
Compilation Flow 9
@Matrix Multiplication void MatrixMultiplication(float* M, float* N, float* P, int width) for (int i =0: i< width: ++i) for (int j=0:j< Width: ++j)I k float sum=0: for (int k =0;k< Width: ++k)I float a= M[i width + k]: float b=NIk width j]: sum a b P[i Width j] P k WIDTH WIDTH 1000X1000=1,000,000 independent dot product 1000 multiply+ 1000 accumulate per dot
Matrix Multiplication 10 1000X1000=1,000,000 independent dot product 1000 multiply+1000 accumulate per dot