《并行与分布式程序设计》课程教学参考书：CUDA《Programming Massively Parallel Processors》A Hands-on Approach（美，David B. Kirk and Wen-mei W. Hwu，英文版）.pdf_大学文库

In Praise of Programming Massively Parallel Processors: A Hands-on Approach Parallel programming is about performance,for otherwise you'd write a sequential program.For those interested in learning or teaching the topic, a problem is where to find truly parallel hardware that can be dedicated to the task,for it is difficult to see interesting speedups if its shared or only modestly parallel.One answer is graphical processing units(GPUs),which can have hundreds of cores and are found in millions of desktop and laptop computers.For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend,as it intro- duces CUDA,a C-like data parallel language,and Tesla,the architecture of the current generation of NVIDIA GPUs.In addition to explaining the language and the architecture,they define the nature of data parallel pro- blems that run well on heterogeneous CPU-GPU hardware.More con- cretely,two detailed case studies demonstrate speedups over CPU-only C programs of 10X to 15X for naive CUDA code and 45X to 105X for expertly tuned versions.They conclude with a glimpse of the future by describing the next generation of data parallel languages and architectures:OpenCL and the NVIDIA Fermi GPU.This book is a valuable addition to the recently reinvigorated parallel computing literature. David Patterson Director,The Parallel Computing Research Laboratory Pardee Professor of Computer Science,U.C.Berkeley Co-author of Computer Architecture:A Quantitative Approach Written by two teaching pioneers,this book is the definitive practical refer- ence on programming massively parallel processors-a true technological gold mine.The hands-on learning included is cutting-edge,yet very read- able.This is a most rewarding read for students,engineers and scientists interested in supercharging computational resources to solve today's and tomorrow's hardest problems. Nicolas Pinto MIT,NVIDIA Fellow 2009 I have always admired Wen-mei Hwu's and David Kirk's ability to turn complex problems into easy-to-comprehend concepts.They have done it again in this book.This joint venture of a passionate teacher and a GPU

In Praise of Programming Massively Parallel Processors: A Hands-on Approach Parallel programming is about performance, for otherwise you’d write a sequential program. For those interested in learning or teaching the topic, a problem is where to find truly parallel hardware that can be dedicated to the task, for it is difficult to see interesting speedups if its shared or only modestly parallel. One answer is graphical processing units (GPUs), which can have hundreds of cores and are found in millions of desktop and laptop computers. For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA, a C-like data parallel language, and Tesla, the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on heterogeneous CPU-GPU hardware. More concretely, two detailed case studies demonstrate speedups over CPU-only C programs of 10X to 15X for naı¨ve CUDA code and 45X to 105X for expertly tuned versions. They conclude with a glimpse of the future by describing the next generation of data parallel languages and architectures: OpenCL and the NVIDIA Fermi GPU. This book is a valuable addition to the recently reinvigorated parallel computing literature. David Patterson Director, The Parallel Computing Research Laboratory Pardee Professor of Computer Science, U.C. Berkeley Co-author of Computer Architecture: A Quantitative Approach Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors—a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers and scientists interested in supercharging computational resources to solve today’s and tomorrow’s hardest problems. Nicolas Pinto MIT, NVIDIA Fellow 2009 I have always admired Wen-mei Hwu’s and David Kirk’s ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU

Morgan Kaufmann Publishers is an imprint of Elsevier. 30 Corporate Drive,Suite 400.Burlington,MA 01803,USA This book is printed on acid-free paper. 2010 David B.Kirk/NVIDIA Corporation and Wen-mei Hwu.Published by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means.electronic or mechanical,including photocopying,recording.or any information storage and retrieval system,without permission in writing from the publisher.Details on how to seek permission,further information about the Publisher's permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency.can be found at our website:www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). NVIDIA,the NVIDIA logo.CUDA,GeForce,Quadro,and Tesla are trademarks or registered trademarks of NVIDIA Corporation in the U.S.and other countries. OpenCL is a trademark of Apple Inc. Notices Knowledge and best practice in this field are constantly changing.As new research and experience broaden our understanding.changes in research methods,professional practices.or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information,methods,compounds,or experiments described herein.In using such information or methods they should be mindful of their own safety and the safety of others,including parties for whom they have a professional responsibility. To the fullest extent of the law,neither the Publisher nor the authors.contributors.or editors.assume any liability for any injury and/or damage to persons or property as a matter of products liability. negligence or otherwise,or from any use or operation of any methods.products,instructions,or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Application Submitted British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN:978-0-12-381472-2 For information on all Morgan Kaufmann publications. visit our Web site at www.mkp.com or www.elsevierdirect.com Printed in United States of America 101112131454321 Working together to grow libraries in developing countries www.clscvicr.com www.bookaid.org www.sabre.org ELSEVIER BOOK AID nternational Sabre Foundation

Morgan Kaufmann Publishers is an imprint of Elsevier. 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. # 2010 David B. Kirk/NVIDIA Corporation and Wen-mei Hwu. Published by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). NVIDIA, the NVIDIA logo, CUDA, GeForce, Quadro, and Tesla are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. OpenCL is a trademark of Apple Inc. Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Application Submitted British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN: 978-0-12-381472-2 For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com Printed in United States of America 10 11 12 13 14 5 4 3 2 1

Contents Preface ......................................................................................................................xi Acknowledgments ................................................................................................ xvii Dedication...............................................................................................................xix CHAPTER 1 INTRODUCTION................................................................................1 1.1 GPUs as Parallel Computers..........................................................2 1.2 Architecture of a Modern GPU......................................................8 1.3 Why More Speed or Parallelism?................................................10 1.4 Parallel Programming Languages and Models............................13 1.5 Overarching Goals........................................................................15 1.6 Organization of the Book.............................................................16 CHAPTER 2 HISTORY OF GPU COMPUTING .....................................................21 2.1 Evolution of Graphics Pipelines ..................................................21 2.1.1 The Era of Fixed-Function Graphics Pipelines..................22 2.1.2 Evolution of Programmable Real-Time Graphics .............26 2.1.3 Unified Graphics and Computing Processors ....................29 2.1.4 GPGPU: An Intermediate Step...........................................31 2.2 GPU Computing ...........................................................................32 2.2.1 Scalable GPUs.....................................................................33 2.2.2 Recent Developments..........................................................34 2.3 Future Trends................................................................................34 CHAPTER 3 INTRODUCTION TO CUDA..............................................................39 3.1 Data Parallelism............................................................................39 3.2 CUDA Program Structure ............................................................41 3.3 A Matrix–Matrix Multiplication Example...................................42 3.4 Device Memories and Data Transfer...........................................46 3.5 Kernel Functions and Threading..................................................51 3.6 Summary .......................................................................................56 3.6.1 Function declarations ..........................................................56 3.6.2 Kernel launch ......................................................................56 3.6.3 Predefined variables ............................................................56 3.6.4 Runtime API........................................................................57 CHAPTER 4 CUDA THREADS.............................................................................59 4.1 CUDA Thread Organization ........................................................59 4.2 Using blockIdx and threadIdx ..........................................64 4.3 Synchronization and Transparent Scalability ..............................68 vii

4.4 Thread Assignment.......................................................................70 4.5 Thread Scheduling and Latency Tolerance .................................71 4.6 Summary .......................................................................................74 4.7 Exercises .......................................................................................74 CHAPTER 5 CUDA MEMORIES.......................................................................77 5.1 Importance of Memory Access Efficiency..................................78 5.2 CUDA Device Memory Types ....................................................79 5.3 A Strategy for Reducing Global Memory Traffic.......................83 5.4 Memory as a Limiting Factor to Parallelism ..............................90 5.5 Summary .......................................................................................92 5.6 Exercises .......................................................................................93 CHAPTER 6 PERFORMANCE CONSIDERATIONS................................................95 6.1 More on Thread Execution ..........................................................96 6.2 Global Memory Bandwidth........................................................103 6.3 Dynamic Partitioning of SM Resources ....................................111 6.4 Data Prefetching .........................................................................113 6.5 Instruction Mix ...........................................................................115 6.6 Thread Granularity .....................................................................116 6.7 Measured Performance and Summary .......................................118 6.8 Exercises .....................................................................................120 CHAPTER 7 FLOATING POINT CONSIDERATIONS ...........................................125 7.1 Floating-Point Format.................................................................126 7.1.1 Normalized Representation of M .....................................126 7.1.2 Excess Encoding of E.......................................................127 7.2 Representable Numbers..............................................................129 7.3 Special Bit Patterns and Precision.............................................134 7.4 Arithmetic Accuracy and Rounding ..........................................135 7.5 Algorithm Considerations...........................................................136 7.6 Summary .....................................................................................138 7.7 Exercises .....................................................................................138 CHAPTER 8 APPLICATION CASE STUDY: ADVANCED MRI RECONSTRUCTION.......................................................................141 8.1 Application Background.............................................................142 8.2 Iterative Reconstruction..............................................................144 8.3 Computing FHd...........................................................................148 Step 1. Determine the Kernel Parallelism Structure.................149 Step 2. Getting Around the Memory Bandwidth Limitation....156 viii Contents

Step 3. Using Hardware Trigonometry Functions ....................163 Step 4. Experimental Performance Tuning ...............................166 8.4 Final Evaluation..........................................................................167 8.5 Exercises .....................................................................................170 CHAPTER 9 APPLICATION CASE STUDY: MOLECULAR VISUALIZATION AND ANALYSIS............................................................................173 9.1 Application Background.............................................................174 9.2 A Simple Kernel Implementation ..............................................176 9.3 Instruction Execution Efficiency................................................180 9.4 Memory Coalescing....................................................................182 9.5 Additional Performance Comparisons .......................................185 9.6 Using Multiple GPUs .................................................................187 9.7 Exercises .....................................................................................188 CHAPTER 10 PARALLEL PROGRAMMING AND COMPUTATIONAL THINKING ....................................................................................191 10.1 Goals of Parallel Programming ...............................................192 10.2 Problem Decomposition ...........................................................193 10.3 Algorithm Selection .................................................................196 10.4 Computational Thinking...........................................................202 10.5 Exercises ...................................................................................204 CHAPTER 11 A BRIEF INTRODUCTION TO OPENCL ......................................205 11.1 Background...............................................................................205 11.2 Data Parallelism Model............................................................207 11.3 Device Architecture..................................................................209 11.4 Kernel Functions ......................................................................211 11.5 Device Management and Kernel Launch ................................212 11.6 Electrostatic Potential Map in OpenCL ..................................214 11.7 Summary...................................................................................219 11.8 Exercises ...................................................................................220 CHAPTER 12 CONCLUSION AND FUTURE OUTLOOK ........................................221 12.1 Goals Revisited.........................................................................221 12.2 Memory Architecture Evolution ..............................................223 12.2.1 Large Virtual and Physical Address Spaces ................223 12.2.2 Unified Device Memory Space ....................................224 12.2.3 Configurable Caching and Scratch Pad........................225 12.2.4 Enhanced Atomic Operations .......................................226 12.2.5 Enhanced Global Memory Access ...............................226 Contents ix

Preface WHY WE WROTE THIS BOOK Mass-market computing systems that combine multicore CPUs and many- core GPUs have brought terascale computing to the laptop and petascale computing to clusters.Armed with such computing power,we are at the dawn of pervasive use of computational experiments for science,engineer- ing,health,and business disciplines.Many will be able to achieve break- throughs in their disciplines using computational experiments that are of unprecedented level of scale,controllability,and observability.This book provides a critical ingredient for the vision:teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus. We started with a course now known as ECE498AL.During the Christ- mas holiday of 2006,we were frantically working on the lecture slides and lab assignments.David was working the system trying to pull the early GeForce 8800 GTX GPU cards from customer shipments to Illinois,which would not succeed until a few weeks after the semester began.It also became clear that CUDA would not become public until a few weeks after the start of the semester.We had to work out the legal agreements so that we can offer the course to students under NDA for the first few weeks. We also needed to get the words out so that students would sign up since the course was not announced until after the preenrollment period. We gave our first lecture on January 16,2007.Everything fell into place.David commuted weekly to Urbana for the class.We had 52 students,a couple more than our capacity.We had draft slides for most of the first 10 lectures.Wen-mei's graduate student,John Stratton, graciously volunteered as the teaching assistant and set up the lab.All students signed NDA so that we can proceed with the first several lectures until CUDA became public.We recorded the lectures but did not release them on the Web until February.We had graduate students from physics,astron- omy,chemistry,electrical engineering,mechanical engineering as well as computer science and computer engineering.The enthusiasm in the room made it all worthwhile. Since then,we have taught the course three times in one-semester format and two times in one-week intensive format.The ECE498AL course has become a permanent course known as ECE408 of the University of Illinois,Urbana-Champaign.We started to write up some early chapters of this book when we offered ECE498AL the second time.We tested these xi

Preface WHY WE WROTE THIS BOOK Mass-market computing systems that combine multicore CPUs and manycore GPUs have brought terascale computing to the laptop and petascale computing to clusters. Armed with such computing power, we are at the dawn of pervasive use of computational experiments for science, engineering, health, and business disciplines. Many will be able to achieve breakthroughs in their disciplines using computational experiments that are of unprecedented level of scale, controllability, and observability. This book provides a critical ingredient for the vision: teaching parallel programming to millions of graduate and undergraduate students so that computational thinking and parallel programming skills will be as pervasive as calculus. We started with a course now known as ECE498AL. During the Christmas holiday of 2006, we were frantically working on the lecture slides and lab assignments. David was working the system trying to pull the early GeForce 8800 GTX GPU cards from customer shipments to Illinois, which would not succeed until a few weeks after the semester began. It also became clear that CUDA would not become public until a few weeks after the start of the semester. We had to work out the legal agreements so that we can offer the course to students under NDA for the first few weeks. We also needed to get the words out so that students would sign up since the course was not announced until after the preenrollment period. We gave our first lecture on January 16, 2007. Everything fell into place. David commuted weekly to Urbana for the class. We had 52 students, a couple more than our capacity. We had draft slides for most of the first 10 lectures. Wen-mei’s graduate student, John Stratton, graciously volunteered as the teaching assistant and set up the lab. All students signed NDA so that we can proceed with the first several lectures until CUDA became public. We recorded the lectures but did not release them on the Web until February. We had graduate students from physics, astronomy, chemistry, electrical engineering, mechanical engineering as well as computer science and computer engineering. The enthusiasm in the room made it all worthwhile. Since then, we have taught the course three times in one-semester format and two times in one-week intensive format. The ECE498AL course has become a permanent course known as ECE408 of the University of Illinois, Urbana-Champaign. We started to write up some early chapters of this book when we offered ECE498AL the second time. We tested these xi