正在加载图片...
viii Contents 4.4Thread Assignment. 4.5 Thread Scheduling and Latency Tolerance....1 4.6 Summary… .74 4.7Exercises94 CHAPTER 5 cUDATM MEMORIES.7 5.1 Importance of Memory Access Efficiency....... .78 5.2 CUDA Device Memory Types........... .79 5.3 A Strategy for Reducing Global Memory Traffic.........83 5.4 Memory as a Limiting Factor to Parallelism..... 90 5.5 Summary 92 5.6 Exercises..................................................9 CHAPTER 6 PERFORMANCE CONSIDERATIONS 95 6.1 More on Thread Execution.............. ..96 6.2 Global Memory Bandwidth. .103 6.3 Dynamic Partitioning of SM Resources.... .111 6.4 Data Prefetching… .113 6.5 Instruction Mix. .115 6.6 Thread Granularity ....... .116 6.7 Measured Performance and Summary.................. 118 6.8 Exercises… .120 CHAPTER 7 FLOATING POINT CONSIDERATIONS..125 7.1 Floating-Point Format............ .126 7.1.1 Normalized Representation of M....................126 7.1.2 Excess Encoding of E.127 7.2 Representable Numbers............... 129 7.3 Special Bit Patterns and Precision.134 7.4 Arithmetic Accuracy and Rounding135 7.5 Algorithm Considerations.............. .136 7.6 Summary… .138 77Exercises .138 CHAPTER 8 APPLICATION CASE STUDY:ADVANCED MRI RECONSTRUCTI0N.… .141 8.1 Application Background142 8.2Iterative Reconstruction44 8.3Computing Fd 148 Step 1.Determine the Kernel Parallelism Structure................149 Step 2.Getting Around the Memory Bandwidth Limitation....1564.4 Thread Assignment.......................................................................70 4.5 Thread Scheduling and Latency Tolerance .................................71 4.6 Summary .......................................................................................74 4.7 Exercises .......................................................................................74 CHAPTER 5 CUDA MEMORIES.......................................................................77 5.1 Importance of Memory Access Efficiency..................................78 5.2 CUDA Device Memory Types ....................................................79 5.3 A Strategy for Reducing Global Memory Traffic.......................83 5.4 Memory as a Limiting Factor to Parallelism ..............................90 5.5 Summary .......................................................................................92 5.6 Exercises .......................................................................................93 CHAPTER 6 PERFORMANCE CONSIDERATIONS................................................95 6.1 More on Thread Execution ..........................................................96 6.2 Global Memory Bandwidth........................................................103 6.3 Dynamic Partitioning of SM Resources ....................................111 6.4 Data Prefetching .........................................................................113 6.5 Instruction Mix ...........................................................................115 6.6 Thread Granularity .....................................................................116 6.7 Measured Performance and Summary .......................................118 6.8 Exercises .....................................................................................120 CHAPTER 7 FLOATING POINT CONSIDERATIONS ...........................................125 7.1 Floating-Point Format.................................................................126 7.1.1 Normalized Representation of M .....................................126 7.1.2 Excess Encoding of E.......................................................127 7.2 Representable Numbers..............................................................129 7.3 Special Bit Patterns and Precision.............................................134 7.4 Arithmetic Accuracy and Rounding ..........................................135 7.5 Algorithm Considerations...........................................................136 7.6 Summary .....................................................................................138 7.7 Exercises .....................................................................................138 CHAPTER 8 APPLICATION CASE STUDY: ADVANCED MRI RECONSTRUCTION.......................................................................141 8.1 Application Background.............................................................142 8.2 Iterative Reconstruction..............................................................144 8.3 Computing FHd...........................................................................148 Step 1. Determine the Kernel Parallelism Structure.................149 Step 2. Getting Around the Memory Bandwidth Limitation....156 viii Contents
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有