
3rd EditionFROMIOPORTSTOPROCESSMANAGEMENT?Understanding theaLINUXKERNELO'REILLYDANIELPBOVET&MARCOCESAT
1

UnderstandingtheLinuxKernel,3rdEditionLINUXByDaniel P.Bovet, Marco CesatiKERNELPublisher: o'ReillyPubDate:November2005ISBN:0-596-00565-2Pages:942OverviewInordertothoroughlyunderstandwhatmakesLinuxtick andwhyitworkssowell onawidevarietyof systems,youneedtodelvedeepintotheheartof thekernel.Thekernel handlesall interactionsbetweentheCPUandtheexternal world,anddetermineswhichprogramswill shareprocessortime,in what order. It manages limited memory so well that hundreds of processes can share thesystem efficiently,and expertly organizes data transfers so that the Cpu isn't kept waiting anylongerthannecessaryfortherelativelyslowdisks.The third edition of Understanding the Linux Kernel takes you on a guided tour of the mostsignificantdatastructures,algorithms,andprogrammingtricksusedinthekernel.Probingbeyondsuperficial features,the authors offervaluable insights topeople who want toknow how thingsreally work inside their machine. Important Intel-specific features are discussed. Relevantsegmentsof codearedissected lineby line.Butthebookcoversmorethanjustthefunctioningofthe code; it explains the theoretical underpinnings of why Linux does things the way it doesThis edition of the book covers Version 2.6, which has seen significant changes to nearly everykernel subsystem,particularly in the areas ofmemory managementand block devices.The bookfocuses onthefollowingtopics:Memory management,including file buffering,process swapping,and Direct memoryAccess (DMA)TheVirtualFilesystemlayerandtheSecondandThirdExtendedFilesystems·ProcesscreationandschedulingSignals, interrupts, andthe essential interfaces todevicedriversTimingSynchronizationwithinthekernelInterprocessCommunication(IPC)2
2 Understanding the Linux Kernel, 3rd Edition By Daniel P. Bovet, Marco Cesati . Publisher: O'Reilly Pub Date: November 2005 ISBN: 0-596-00565-2 Pages: 942 In order to thoroughly understand what makes Linux tick and why it works so well on a wide variety of systems, you need to delve deep into the heart of the kernel. The kernel handles all interactions between the CPU and the external world, and determines which programs will share processor time, in what order. It manages limited memory so well that hundreds of processes can share the system efficiently, and expertly organizes data transfers so that the CPU isn't kept waiting any longer than necessary for the relatively slow disks. The third edition of Understanding the Linux Kernel takes you on a guided tour of the most significant data structures, algorithms, and programming tricks used in the kernel. Probing beyond superficial features, the authors offer valuable insights to people who want to know how things really work inside their machine. Important Intel-specific features are discussed. Relevant segments of code are dissected line by line. But the book covers more than just the functioning of the code; it explains the theoretical underpinnings of why Linux does things the way it does. This edition of the book covers Version 2.6, which has seen significant changes to nearly every kernel subsystem, particularly in the areas of memory management and block devices. The book focuses on the following topics: Memory management, including file buffering, process swapping, and Direct memory Access (DMA) The Virtual Filesystem layer and the Second and Third Extended Filesystems Process creation and scheduling Signals, interrupts, and the essential interfaces to device drivers Timing Synchronization within the kernel Interprocess Communication (IPC)

Programexecution?UnderstandingtheLinuxKernelwill acquaintyouwithall theinnerworkingsof Linux,but it'smorethan justan academic exercise.You'll learn what conditions bring out Linux'sbest performance,andyou'll see how itmeets thechallenge of providinggood systemresponse during processscheduling,fileaccess,andmemorymanagementinawidevarietyofenvironments.Thisbookwillhelp youmakethemostofyourLinuxsystem.3
3 Program execution Understanding the Linux Kernel will acquaint you with all the inner workings of Linux, but it's more than just an academic exercise. You'll learn what conditions bring out Linux's best performance, and you'll see how it meets the challenge of providing good system response during process scheduling, file access, and memory management in a wide variety of environments. This book will help you make the most of your Linux system

Table of ContentsIndexCopyrightPrefaceThe Audiencefor This BookOrganization of the MaterialLevelof DescriptionOverviewof theBookBackground InformationConventionsinThisBookHow to Contact UsSafari? EnabledAcknowledgmentsChapter1.IntroductionSection1.1.LinuxVersusOtherUnix-LikeKernelsSection 1.2.Hardware DependencySection 1.3. Linux VersionsSection1.4.BasicOperatingSystemConceptsSection1.5.AnOverviewoftheUnixFilesystemSection1.6.AnOverviewofUnixKernelsChapter2.MemoryAddressingSection2.1.MemoryAddressesSection 2.2. Segmentation in HardwareSection2.3.Segmentation in LinuxSection 2.4.Paging in HardwareSection 2.5.Paging in LinuxChapter3.ProcessesSection 3.1. Processes, Lightweight Processes,and ThreadsSection3.2.ProcessDescriptorSection 3.3. Process SwitchSection 3.4.Creating ProcessesSection3.5.Destroying ProcessesChapter4.Interrupts and ExceptionsSection4.1.TheRoleof Interrupt SignalsSection4.2.InterruptsandExceptionsSection 4.3.Nested Execution of Exception and Interrupt HandlersSection4.4.InitializingtheInterruptDescriptorTableSection4.5.ExceptionHandling4
4 Table of Contents | Index Copyright Preface The Audience for This Book Organization of the Material Level of Description Overview of the Book Background Information Conventions in This Book How to Contact Us Safari® Enabled Acknowledgments Chapter 1. Introduction Section 1.1. Linux Versus Other Unix-Like Kernels Section 1.2. Hardware Dependency Section 1.3. Linux Versions Section 1.4. Basic Operating System Concepts Section 1.5. An Overview of the Unix Filesystem Section 1.6. An Overview of Unix Kernels Chapter 2. Memory Addressing Section 2.1. Memory Addresses Section 2.2. Segmentation in Hardware Section 2.3. Segmentation in Linux Section 2.4. Paging in Hardware Section 2.5. Paging in Linux Chapter 3. Processes Section 3.1. Processes, Lightweight Processes, and Threads Section 3.2. Process Descriptor Section 3.3. Process Switch Section 3.4. Creating Processes Section 3.5. Destroying Processes Chapter 4. Interrupts and Exceptions Section 4.1. The Role of Interrupt Signals Section 4.2. Interrupts and Exceptions Section 4.3. Nested Execution of Exception and Interrupt Handlers Section 4.4. Initializing the Interrupt Descriptor Table Section 4.5. Exception Handling

Section4.6.InterruptHandlingSection 4.7.Sofirgs and TaskletsSection4.8.WorkQueuesSection 4.9.Returning from Interrupts and ExceptionsChapter5.Kernel SynchronizationSection5.1.HowtheKernel Services RequestsSection5.2..SynchronizationPrimitivesSection 5.3.SynchronizingAccessestoKernel Data StructuresSection5.4.Examplesof RaceConditionPreventionChapter6.Timing MeasurementsSection 6.1. Clock and Timer CircuitsSection6.2..TheLinuxTimekeepingArchitectureSection 6.3.Updatingthe TimeandDateSection6.4.Updating SystemStatisticsSection6.5.SofwareTimersand DelayFunctionsSection 6.6.SystemCalls Related to Timing MeasurementsChapter7.Process SchedulingSection 7.1. Scheduling PolicySection 7.2.The Scheduling AlgorithmSection 7.3.Data Structures Used by the SchedulerSection7.4.FunctionsUsedbytheSchedulerSection7.5.RunqueueBalancing inMultiprocessorSystemsSection 7.6.System Calls Related to SchedulingChapter8.Memory ManagementSection 8.1.PageFrame ManagementSection 8.2. Memory Area ManagementSection8.3.NoncontiguousMemoryAreaManagementChapter9.ProcessAddressSpaceSection9.1.TheProcess'sAddress SpaceSection 9.2.The Memory DescriptorSection 9.3.Memory RegionsSection9.4.PageFault Exception HandlerSection9.5.Creating and Deleting a ProcessAddress SpaceSection 9.6.Managing the HeapChapter10.SystemCallsSection 10.1.POSIXAPls and System CallsSection10.2.SystemCall Handlerand ServiceRoutinesSection10.3.EnteringandExitingaSystemCall5
5 Section 4.6. Interrupt Handling Section 4.7. Softirqs and Tasklets Section 4.8. Work Queues Section 4.9. Returning from Interrupts and Exceptions Chapter 5. Kernel Synchronization Section 5.1. How the Kernel Services Requests Section 5.2. Synchronization Primitives Section 5.3. Synchronizing Accesses to Kernel Data Structures Section 5.4. Examples of Race Condition Prevention Chapter 6. Timing Measurements Section 6.1. Clock and Timer Circuits Section 6.2. The Linux Timekeeping Architecture Section 6.3. Updating the Time and Date Section 6.4. Updating System Statistics Section 6.5. Software Timers and Delay Functions Section 6.6. System Calls Related to Timing Measurements Chapter 7. Process Scheduling Section 7.1. Scheduling Policy Section 7.2. The Scheduling Algorithm Section 7.3. Data Structures Used by the Scheduler Section 7.4. Functions Used by the Scheduler Section 7.5. Runqueue Balancing in Multiprocessor Systems Section 7.6. System Calls Related to Scheduling Chapter 8. Memory Management Section 8.1. Page Frame Management Section 8.2. Memory Area Management Section 8.3. Noncontiguous Memory Area Management Chapter 9. Process Address Space Section 9.1. The Process's Address Space Section 9.2. The Memory Descriptor Section 9.3. Memory Regions Section 9.4. Page Fault Exception Handler Section 9.5. Creating and Deleting a Process Address Space Section 9.6. Managing the Heap Chapter 10. System Calls Section 10.1. POSIX APIs and System Calls Section 10.2. System Call Handler and Service Routines Section 10.3. Entering and Exiting a System Call

Section 10.4.Parameter PassingSection10.5.KernelWrapperRoutinesChapter11.SignalsSection 11.1.The Role of SignalsSection 11.2.Generating a SignalSection 11.3. Delivering a SignalSection11.4.SystemCalls Related to Signal HandlingChapter 12.The Virtual FilesystemSection 12.1.The Role of the Virtual Filesystem (VFS)Section12.2.VFSData StructuresSection 12.3.Filesystem TypesSection12.4.Filesystem HandlingSection12.5.PathnameLookupSection12.6.ImplementationsofVFSSystemCallsSection 12.7.File LockingChapter13.I/OArchitectureand DeviceDriversSection13.1.I/OArchitectureSection 13.2. The Device Driver ModeSection 13.3.Device FilesSection 13.4.Device DriversSection13.5.CharacterDeviceDriversChapter14.BlockDeviceDriversSection 14.1.Block Devices HandlingSection 14.2.TheGeneric Block LayerSection 14.3.The I/O SchedulerSection 14.4.Block DeviceDriversSection14.5.OpeningaBlockDeviceFileChapter15.ThePageCacheSection15.1.ThePageCacheSection 15.2.Storing Blocks in the Page CacheSection 15.3.Writing Dirty Pages to DiskSection 15.4. The sync(),fsync(),and fdatasync()System CallsChapter16.Accessing FilesSection 16.1.Reading and Writing a FileSection16.2.Memory MappingSection16.3.DirectVOTransfersSection16.4.AsynchronousV/OChapter17.PageFrameReclaiming6
6 Section 10.4. Parameter Passing Section 10.5. Kernel Wrapper Routines Chapter 11. Signals Section 11.1. The Role of Signals Section 11.2. Generating a Signal Section 11.3. Delivering a Signal Section 11.4. System Calls Related to Signal Handling Chapter 12. The Virtual Filesystem Section 12.1. The Role of the Virtual Filesystem (VFS) Section 12.2. VFS Data Structures Section 12.3. Filesystem Types Section 12.4. Filesystem Handling Section 12.5. Pathname Lookup Section 12.6. Implementations of VFS System Calls Section 12.7. File Locking Chapter 13. I/O Architecture and Device Drivers Section 13.1. I/O Architecture Section 13.2. The Device Driver Model Section 13.3. Device Files Section 13.4. Device Drivers Section 13.5. Character Device Drivers Chapter 14. Block Device Drivers Section 14.1. Block Devices Handling Section 14.2. The Generic Block Layer Section 14.3. The I/O Scheduler Section 14.4. Block Device Drivers Section 14.5. Opening a Block Device File Chapter 15. The Page Cache Section 15.1. The Page Cache Section 15.2. Storing Blocks in the Page Cache Section 15.3. Writing Dirty Pages to Disk Section 15.4. The sync( ), fsync( ), and fdatasync( ) System Calls Chapter 16. Accessing Files Section 16.1. Reading and Writing a File Section 16.2. Memory Mapping Section 16.3. Direct I/O Transfers Section 16.4. Asynchronous I/O Chapter 17. Page Frame Reclaiming

Section17.1.ThePageFrameReclaiming AlgorithmSection17.2.ReverseMappingSection17.3.ImplementingthePFRASection17.4.SwappingChapter18.TheExt2andExt3FilesystemsSection 18.1.General Characteristics of Ext2Section18.2.Ext2DiskDataStructuresSection18.3.Ext2MemoryDataStructuresSection 18.4.Creating theExt2 FilesystemSection 18.5. Ext2 MethodsSection18.6.ManagingExt2DiskSpaceSection18.7.TheExt3FilesystemChapter19.ProcessCommunicationSection 19.1.PipesSection 19.2. FIFOsSection19.3.SystemVIPCSection19.4.POSIXMessageQueuesChapter20.ProgramExZecutionSection 20.1. Executable FilesSection 20.2.ExecutableFormatsSection 20.3.Execution DomainsSection20.4.TheexecFunctionsAppendixA.SystemStartupSection A.1. Prehistoric Age:the BIOSSection A2. Ancient Age: the Boot LoaderSection A.3. Middle Ages: the setup()FunctionSectionA.4.Renaissance:thestartup 32()FunctionsSection A.5.ModernAge:the start kernel()Function_Appendix B. ModulesSectionB.1.ToBe(aModule)orNottoBe?Section B.2.Module ImplementationSection B.3.Linking and Unlinking ModulesSection B.4.Linking Modules on DemandBibliographyBookson UnixKernelsBooksontheLinuxKernelBooksonPCArchitectureandTechnical Manualson Intel MicroprocessorsOtherOnlineDocumentation Sources7
7 Section 17.1. The Page Frame Reclaiming Algorithm Section 17.2. Reverse Mapping Section 17.3. Implementing the PFRA Section 17.4. Swapping Chapter 18. The Ext2 and Ext3 Filesystems Section 18.1. General Characteristics of Ext2 Section 18.2. Ext2 Disk Data Structures Section 18.3. Ext2 Memory Data Structures Section 18.4. Creating the Ext2 Filesystem Section 18.5. Ext2 Methods Section 18.6. Managing Ext2 Disk Space Section 18.7. The Ext3 Filesystem Chapter 19. Process Communication Section 19.1. Pipes Section 19.2. FIFOs Section 19.3. System V IPC Section 19.4. POSIX Message Queues Chapter 20. Program ExZecution Section 20.1. Executable Files Section 20.2. Executable Formats Section 20.3. Execution Domains Section 20.4. The exec Functions Appendix A. System Startup Section A.1. Prehistoric Age: the BIOS Section A.2. Ancient Age: the Boot Loader Section A.3. Middle Ages: the setup( ) Function Section A.4. Renaissance: the startup_32( ) Functions Section A.5. Modern Age: the start_kernel( ) Function Appendix B. Modules Section B.1. To Be (a Module) or Not to Be? Section B.2. Module Implementation Section B.3. Linking and Unlinking Modules Section B.4. Linking Modules on Demand Bibliography Books on Unix Kernels Books on the Linux Kernel Books on PC Architecture and Technical Manuals on Intel Microprocessors Other Online Documentation Sources

ResearchPapersRelatedtoLinuxDevelopmentAbout the AuthorsColophonIndexUnderstanding the Linux Kernel,Third EditionbyDanielP.BovetandMarcoCesatiCopyright2006O'ReillyMedia,Inc.All rightsreserved.PrintedintheUnitedStatesofAmericaPublishedbyO'ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472O'Reilly booksmaybe purchased for educational,business,orsales promotional use.Onlineeditionsarealsoavailableformosttitles(safari.oreilly.com).Formoreinformation,contactourcorporate/institutional salesdepartment:(800)998-9938orcorporate@oreilly.com.Editor:Andy OramDarren KellyProductionEditor:ProductionServices:Amy ParkerEdie FreedmanCover Designer:InteriorDesigner:David FutatoPrinting History:November2000:First Edition.December 2002:Second Edition.November2005:Third Edition.NutshellHandbook,theNutshell Handbooklogo,andtheO'ReillylogoareregisteredtrademarksofO'ReillyMedia, Inc.The Linux series designations,Understanding the Linux Kernel,Third Edition,the image of a man with a bubble, and related trade dress are trademarks of O'Reilly Media, Inc.Manyofthedesignationsusedbymanufacturersandsellersto distinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andO'ReillyMedia,Inc.was8
8 Research Papers Related to Linux Development About the Authors Colophon Index Understanding the Linux Kernel, Third Edition by Daniel P. Bovet and Marco Cesati Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Andy Oram Production Editor: Darren Kelly Production Services: Amy Parker Cover Designer: Edie Freedman Interior Designer: David Futato Printing History: November 2000: First Edition. December 2002: Second Edition. November 2005: Third Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly Media, Inc. The Linux series designations, Understanding the Linux Kernel, Third Edition, the image of a man with a bubble, and related trade dress are trademarks of O'Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly Media, Inc. was

aware of atrademark claim,the designationshavebeenprinted in caps or initial capsWhileeveryprecautionhas beentaken inthepreparationof this book,thepublisherand authorsassume no responsibilityfor errors oromissions,orfordamages resulting from the useof theinformationcontainedherein.ISBN:0-596-00565-2[M]6
9 aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 0-596-00565-2 [M]

PrefaceIn the spring semester of 1997,we taught a course on operating systems based on Linux 2.0.Theidea was to encourage students to read the source code.To achieve this, we assigned termprojects consisting of making changes to the kernel and performing tests on the modified version.We also wrote course notes for our students about a few critical features of Linux such as taskswitching andtask scheduling.Outofthisworkandwithalotof supportfromourO'ReillyeditorAndyOramcamethefirsteditionof Understanding the Linux Kernel at the end of 2000,which covered Linux 2.2 with a fewanticipations on Linux 2.4. The success encountered by this book encouraged us to continue alongthis line.At the end of 2o02, we came out witha second edition covering Linux2.4.Youare nowlookingatthethirdedition,whichcoversLinux2.6.As in our previous experiences,wereadthousands of lines of code, trying to make sense of them.After all this work,we can say that it was worth the effort.Welearned a lot of things you don't findin books,andwehopewehavesucceededinconveying someofthisinformationinthefollowingpages.TheAudienceforThisBookAll people curious about how Linux works and why it is so efficient will find answers here. Afterreading the book, you will find your way through the many thousands of lines of code,distinguishing between crucial data structuresand secondary onesin short,becoming a true Linuxhacker.Our work might be considered a guided tour of the Linux kernel:most of the significant datastructuresandmanyalgorithmsandprogrammingtricksusedinthekernelarediscussed.Inmanycases,therelevantfragmentsof codearediscussed linebyline.Ofcourse,youshouldhavetheLinux source code on hand and should be willing to expend some effort deciphering some of thefunctionsthatarenot,forsakeofbrevity,fullydescribed.Onanotherlevel,thebookprovidesvaluableinsighttopeoplewhowanttoknowmoreaboutthecritical design issues in a modern operating system.It is not specifically addressed to systemadministratorsorprogrammers;itismostlyforpeoplewhowantto understandhowthingsreallywork insidethemachine!Aswithanygoodguide,wetrytogobeyondsuperficialfeatures.Weofferabackground,suchasthehistoryofmajorfeaturesandthereasonswhytheywereused.10
10 Preface In the spring semester of 1997, we taught a course on operating systems based on Linux 2.0. The idea was to encourage students to read the source code. To achieve this, we assigned term projects consisting of making changes to the kernel and performing tests on the modified version. We also wrote course notes for our students about a few critical features of Linux such as task switching and task scheduling. Out of this work and with a lot of support from our O'Reilly editor Andy Oram came the first edition of Understanding the Linux Kernel at the end of 2000, which covered Linux 2.2 with a few anticipations on Linux 2.4. The success encountered by this book encouraged us to continue along this line. At the end of 2002, we came out with a second edition covering Linux 2.4. You are now looking at the third edition, which covers Linux 2.6. As in our previous experiences, we read thousands of lines of code, trying to make sense of them. After all this work, we can say that it was worth the effort. We learned a lot of things you don't find in books, and we hope we have succeeded in conveying some of this information in the following pages. The Audience for This Book All people curious about how Linux works and why it is so efficient will find answers here. After reading the book, you will find your way through the many thousands of lines of code, distinguishing between crucial data structures and secondary onesin short, becoming a true Linux hacker. Our work might be considered a guided tour of the Linux kernel: most of the significant data structures and many algorithms and programming tricks used in the kernel are discussed. In many cases, the relevant fragments of code are discussed line by line. Of course, you should have the Linux source code on hand and should be willing to expend some effort deciphering some of the functions that are not, for sake of brevity, fully described. On another level, the book provides valuable insight to people who want to know more about the critical design issues in a modern operating system. It is not specifically addressed to system administrators or programmers; it is mostly for people who want to understand how things really work inside the machine! As with any good guide, we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used