
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iii Professional Linux® Kernel Architecture Wolfgang Mauerer Wiley Publishing, Inc

Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page i Professional Linux® Kernel Architecture Introduction . xxvii Chapter 1: Introduction and Overview . 1 Chapter 2: Process Management and Scheduling . 35 Chapter 3: Memory Management . 133 Chapter 4: Virtual Process Memory . 289 Chapter 5: Locking and Interprocess Communication . 347 Chapter 6: Device Drivers . 391 Chapter 7: Modules . 473 Chapter 8: The Virtual Filesystem. 519 Chapter 9: The Extended Filesystem Family . 583 Chapter 10: Filesystems without Persistent Storage . 643 Chapter 11: Extended Attributes and Access Control Lists . 707 Chapter 12: Networks . 733 Chapter 13: System Calls . 819 Chapter 14: Kernel Activities . 847 Chapter 15: Time management . 893 Chapter 16: Page and Buffer Cache . 949 Chapter 17: Data Synchronization . 989 Chapter 18: Page Reclaim and Swapping. 1023 Chapter 19: Auditing . 1097 Appendix A: Architecture Specifics . 1117 Appendix B: Working with the Source Code . 1141 Appendix C: Notes on C . 1175 Appendix D: System Startup . 1223 Appendix E: The ELF Binary Format . 1241 Appendix F: The Kernel Development Process . 1267 Bibliography . 1289 Index . 1293

Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iii Professional Linux® Kernel Architecture Wolfgang Mauerer Wiley Publishing, Inc

Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iv Professional Linux® Kernel Architecture Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2008 by Wolfgang Mauerer Published by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-34343-2 Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data: Mauerer, Wolfgang, 1978- Professional Linux kernel architecture / Wolfgang Mauerer. p. cm. Includes index. ISBN 978-0-470-34343-2 (pbk.) 1. Linux. 2. Computer architecture. 3. Application software. I. Title. QA76.9.A73M38 2008 005.4’32-dc22 2008028067 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Wrox Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books

Mauerer fauth.tex V2 - 08/22/2008 4:52am Page v About the Author Wolfgang Mauerer is a quantum physicist whose professional interests are centered around quantum cryptography, quantum electrodynamics, and compilers for — you guessed it — quantum architectures. With the confirmed capacity of being the worst experimentalist in the known universe, he sticks to the theoretical side of his profession, which is especially reassuring considering his constant fear of accidentally destroying the universe. Outside his research work, he is fascinated by operating systems, and for more than a decade — starting with an article series about the kernel in 1997 — he has found great pleasure in documenting and explaining Linux kernel internals. He is also the author of a book about typesetting with LaTeX and has written numerous articles that have been translated into seven languages in total. When he’s not submerged in vast Hilbert spaces or large quantities of source code, he tries to take the opposite direction, namely, upward — be this with model planes, a paraglider, or on foot with an ice axe in his hands: Mountains especially have the power to outrival even the Linux kernel. Consequently, he considers planning and accomplishing a first-ascent expedition to the vast arctic glaciers of east Greenland to be the really unique achievement in his life. Being interested in everything that is fundamental, he is also the author of the first compiler for Plankalkul, the world’s earliest high-level language devised in 1942–1946 by Konrad Zuse, the father of ¨ the computer. As an avid reader, he is proud that despite the two-digit number of computers present in his living room, the volume required for books still occupies a larger share

Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page vii Credits Executive Editor Carol Long Senior Development Editor Tom Dinse Production Editor Debra Banninger Copy Editors Cate Caffrey Kathryn Duggan Editorial Manager Mary Beth Wakefield Production Manager Tim Tate Vice President and Executive Group Publisher Richard Swadley Vice President and Executive Publisher Joseph B. Wikert Project Coordinator, Cover Lynsey Stanford Proofreader Publication Services, Inc. Indexer Jack Lewis

Mauerer fack.tex V4 - 09/04/2008 3:36pm Page ix Acknowledgments First and foremost, I have to thank the thousands of programmers who have created the Linux kernel over the years — most of them commercially based, but some also just for their own private or academic joy. Without them, there would be no kernel, and I would have had nothing to write about. Please accept my apologies that I cannot list all several hundred names here, but in true UNIX style, you can easily generate the list by: for file in $ALL_FILES_COVERED_IN_THIS_BOOK; do git log -pretty="format:%an" $file; done | sort -u -k 2,2 It goes without saying that I admire your work very much — you are all the true heroes in this story! What you are reading right now is the result of an evolution over more than seven years: After two years of writing, the first edition was published in German by Carl Hanser Verlag in 2003. It then described kernel 2.6.0. The text was used as a basis for the low-level design documentation for the EAL4+ security evaluation of Red Hat Enterprise Linux 5, requiring to update it to kernel 2.6.18 (if the EAL acronym does not mean anything to you, then Wikipedia is once more your friend). Hewlett-Packard sponsored the translation into English and has, thankfully, granted the rights to publish the result. Updates to kernel 2.6.24 were then performed specifically for this book. Several people were involved in this evolution, and my appreciation goes to all of them: Leslie MackayPoulton, with support from David Jacobs, did a tremendous job at translating a huge pile of text into English. I’m also indebted to Sal La Pietra of atsec information security for pulling the strings to get the translation project rolling, and especially to Stephan Muller for close cooperation during the evaluation. ¨ My cordial thanks also go to all other HP and Red Hat people involved in this evaluation, and also to Claudio Kopper and Hans Lohr for our very enjoyable cooperation during this project. Many thanks also ¨ go to the people at Wiley — both visible and invisible to me — who helped to shape the book into its current form. The German edition was well received by readers and reviewers, but nevertheless comments about inaccuracies and suggestions for improvements were provided. I’m glad for all of them, and would also like to mention the instructors who answered the publisher’s survey for the original edition. Some of their suggestions were very valuable for improving the current publication. The same goes for the referees for this edition, especially to Dr. Xiaodong Zhang for providing numerous suggestions for Appendix F.4. Furthermore, I express my gratitude to Dr. Christine Silberhorn for granting me the opportunity to suspend my regular research work at the Max Planck Research Group for four weeks to work on this project. I hope you enjoyed the peace during this time when nobody was trying to install Linux on your MacBook! As with every book, I owe my deepest gratitude to my family for supporting me in every aspect of life — I more than appreciate this indispensable aid. Finally, I have to thank Hariet Fabritius for infinite

Mauerer fack.tex V4 - 09/04/2008 3:36pm Page x Acknowledgments patience with an author whose work cycle not only perfectly matched the most alarming forms of sleep dyssomnias, but who was always right on the brink of confusing his native tongue with ‘‘C,’’ and whom she consequently had to rescue from numerous situations where he seemingly had lost his mind (see below.). Now that I have more free time again, I’m not only looking forward to our well-deserved holiday, but can finally embark upon the project of giving your laptop all joys of a proper operating system! (Writing these acknowledgments, I all of a sudden realize why people always hasten to lock away their laptops when they see me approaching. .) x

Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xi Contents Introduction xxvii Chapter 1: Introduction and Overview 1 Tasks of the Kernel 2 Implementation Strategies 3 Elements of the Kernel 3 Processes, Task Switching, and Scheduling 4 Unix Processes 4 Address Spaces and Privilege Levels 7 Page Tables 11 Allocation of Physical Memory 13 Timing 16 System Calls 17 Device Drivers, Block and Character Devices 17 Networks 18 Filesystems 18 Modules and Hotplugging 18 Caching 20 List Handling 20 Object Management and Reference Counting 22 Data Types 25 . and Beyond the Infinite 27 Why the Kernel Is Special 28 Some Notes on Presentation 29 Summary 33 Chapter 2: Process Management and Scheduling 35 Process Priorities 36 Process Life Cycle 38 Preemptive Multitasking 40 Process Representation 41 Process Types 47 Namespaces 47

Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xii Contents Process Identification Numbers 54 Task Relationships 62 Process Management System Calls 63 Process Duplication 63 Kernel Threads 77 Starting New Programs 79 Exiting Processes 83 Implementation of the Scheduler 83 Overview 84 Data Structures 86 Dealing with Priorities 93 Core Scheduler 99 The Completely Fair Scheduling Class 106 Data Structures 106 CFS Operations 107 Queue Manipulation 112 Selecting the Next Task 113 Handling the Periodic Tick 114 Wake-up Preemption 115 Handling New Tasks 116 The Real-Time Scheduling Class 117 Properties 118 Data Structures 118 Scheduler Operations 119 Scheduler Enhancements 121 SMP Scheduling 121 Scheduling Domains and Control Groups 126 Kernel Preemption and Low Latency Efforts 127 Summary 132 Chapter 3: Memory Management 133 Overview 133 Organization in the (N)UMA Model 136 Overview 136 Data Structures 138 Page Tables 153 Data Structures 154 Creating and Manipulating Entries 161 Initialization of Memory Management 161 Data Structure Setup 162 Architecture-Specific Setup 169 Memory Management during the Boot Process 191 xii