Linux Kernel Internals Table of Contents Linux Kernel Internals. 1 Tigran Aivazian tigran@veritas.com. 1Booting.. 2.Process and Interrupt Management 3.Virtual Filesystem (VFS). LBooting 11 Building the Linux Kernel Image L2 Booting Overyiew. L3 Booting:Blos POST 14 Booting:bootsector and setup LILO as a bootloader. dat and code d li and Inter 22 of tas and kemnel threads 、9 nux linked list implementation 20 2 5 Wait Oucues 26 Kemel timer 2 7 Rottom Halve 28 Task Oucues 26 2.9 Tasklets. 27 2 10 Softiras 27 2.11 How System Calls Are Implemented on i386 Architecture2. 27 2.12 Atomic Operations 28 2 13 Spinlocks Read-write Spinlocks and Big-Reader Spinlocks .30 2.14 Semaphores and read/write Semaphores. 32 2.15 Kernel Support for Loading Modules. .33 3.Virtual Filesvstem (VFS).. 3.1 Inode Caches and Interaction with Deache 3.2 Filesystem Registration/Unregistration. 39 3.5 File Descriptor Management.. 1 3.4 Flle Structure Management 3.5 Superblock and Mountpoint Management 3.6 Example virtua Fllesystem:pipefs. ample Disk Filesvste d Binan 1a 52
Table of Contents Linux Kernel Internals.......................................................................................................................................1 Tigran Aivazian tigran@veritas.com.......................................................................................................1 1.Booting..................................................................................................................................................1 2.Process and Interrupt Management.......................................................................................................1 3.Virtual Filesystem (VFS)......................................................................................................................2 1.Booting..................................................................................................................................................2 1.1 Building the Linux Kernel Image......................................................................................................2 1.2 Booting: Overview.............................................................................................................................3 1.3 Booting: BIOS POST.........................................................................................................................3 1.4 Booting: bootsector and setup............................................................................................................4 1.5 Using LILO as a bootloader ..............................................................................................................7 1.6 High level initialisation .....................................................................................................................7 1.7 SMP Bootup on x86...........................................................................................................................9 1.8 Freeing initialisation data and code...................................................................................................9 1.9 Processing kernel command line.....................................................................................................10 2.Process and Interrupt Management.....................................................................................................12 2.1 Task Structure and Process Table....................................................................................................12 2.2 Creation and termination of tasks and kernel threads......................................................................16 2.3 Linux Scheduler...............................................................................................................................18 2.4 Linux linked list implementation.....................................................................................................20 2.5 Wait Queues.....................................................................................................................................22 2.6 Kernel Timers..................................................................................................................................25 2.7 Bottom Halves.................................................................................................................................25 2.8 Task Queues.....................................................................................................................................26 2.9 Tasklets............................................................................................................................................27 2.10 Softirqs...........................................................................................................................................27 2.11 How System Calls Are Implemented on i386 Architecture?.........................................................27 2.12 Atomic Operations.........................................................................................................................28 2.13 Spinlocks, Read−write Spinlocks and Big−Reader Spinlocks......................................................30 2.14 Semaphores and read/write Semaphores.......................................................................................32 2.15 Kernel Support for Loading Modules............................................................................................33 3.Virtual Filesystem (VFS)....................................................................................................................36 3.1 Inode Caches and Interaction with Dcache......................................................................................36 3.2 Filesystem Registration/Unregistration...........................................................................................39 3.3 File Descriptor Management............................................................................................................41 3.4 File Structure Management..............................................................................................................42 3.5 Superblock and Mountpoint Management.......................................................................................45 3.6 Example Virtual Filesystem: pipefs.................................................................................................48 3.7 Example Disk Filesystem: BFS.......................................................................................................50 3.8 Execution Domains and Binary Formats.........................................................................................52 Linux Kernel Internals i
Linux Kernel Internals Tigran Aivazian tigran@veritas.com 22 August 2000 Introduction to the Limx 2.4 kernel.The latest copy of this document can be always downloaded from: hup:/hyww moses uklinus nellpatches/lki.sgml This documentation is free software:you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Sofware Foundation:either version 2 of the License,or (at your option)any later version.The author is working as senior Limx kernel engineer at VERITAS Software Ltd and wrote this book for the purpose of supporting the short training course/lectures he gave on this subject.internally at VERITAS. 1.Booting .1.1 Building the Linux Kerel Image ·L2 Booting:Overview ·L3 Booting:BIOS POST .14 Booting:bootsector and setup .15 Using LILO as a bootloader .16 High level initialisation ·L7 SMP Bootup on x86 .1.8 Freeing initialisation data and code .19 Processing kernel command line 2.Process and Interrupt Management 23 26K es .2.11 How System Calls Are Implemented on i386 Architecture? .2.12 Atomic Operations .2.13 Spinlocks.Read-write Spinlocks and Big-Reader Spinlocks .214 Semaphores and read/write Semaphores Linux Kernel Internals 1
Linux Kernel Internals Tigran Aivazian tigran@veritas.com 22 August 2000 Introduction to the Linux 2.4 kernel. The latest copy of this document can be always downloaded from: http://www.moses.uklinux.net/patches/lki.sgml This documentation is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The author is working as senior Linux kernel engineer at VERITAS Software Ltd and wrote this book for the purpose of supporting the short training course/lectures he gave on this subject, internally at VERITAS. 1.Booting • 1.1 Building the Linux Kernel Image • 1.2 Booting: Overview • 1.3 Booting: BIOS POST • 1.4 Booting: bootsector and setup • 1.5 Using LILO as a bootloader • 1.6 High level initialisation • 1.7 SMP Bootup on x86 • 1.8 Freeing initialisation data and code • 1.9 Processing kernel command line 2.Process and Interrupt Management • 2.1 Task Structure and Process Table • 2.2 Creation and termination of tasks and kernel threads • 2.3 Linux Scheduler • 2.4 Linux linked list implementation • 2.5 Wait Queues • 2.6 Kernel Timers • 2.7 Bottom Halves • 2.8 Task Queues • 2.9 Tasklets • 2.10 Softirqs • 2.11 How System Calls Are Implemented on i386 Architecture? • 2.12 Atomic Operations • 2.13 Spinlocks, Read−write Spinlocks and Big−Reader Spinlocks • 2.14 Semaphores and read/write Semaphores Linux Kernel Internals 1
Linux Kernel Internals .2.15 Kernel Support for Loading Modules 3.Virtual Filesystem (VFS) .3 1 Inode Caches and Interaction with dcache .3.2 Filesystem Registration/Unregistration .3.3 File Descriptor Management .3.4 File Structure Management .3.5 Superblock and Mountpoint Management .3.6 Example Virtual Filesystem:pipefs Example Disk Filesyste mains and Binary Formats 1.Booting 1.1 Building the Linux Kernel Image This section explains the steps taken during compilation of the Linux kemel and the output produced at each stage.The build process depends on the architecture so I would like to emphasize that we only consider building a Linux/x86 kernel. When the user types'make zImage'or'make bzImage'the resulting bootable kernel image is stored as arch/i386/boot/zImage or arch/i386/boot/bzImage respectively.Here is how the image is built 1.C and assembl rce files are compiled into ELF relocatable object format(o)and some of them 2.tre gro ally into a mves (a ELF 32 aaohinxwhichisasticalylimked,nonstiped SB 80 o and 8 3 s pro 'nm vmlinux'.irrelevant or uninteresting symbols are grepped out out-D BIG KERNEI whether the t ctively 6 bh sccts bled nd the ed into nary'form called bbootsect(or】 ed 7 Setup code setup S(s video s)i ssed into bsetu s for hzlmage setup.s for e In the s the hootse code the differe nce is marke -D BIG KERNEL esent for bzIms ge The esult is then conve erted into 'raw binary'form called bsetup 8 Enter directory arch/i386/boot/comr ressed and convert /usr/sre/linux/vmlinux to Stmppiggy(tmp filename)in raw binary format,removing.note and.comment ELF sections 9.gzip-9Stmppiggy.gz 10.Link Stmppiggy.gz into ELF relocatable (ld-r)piggy.o 11.Compile compression routines head.S and misc.c(still in arch/i386/boot/compressed directory)into ELF objects head.o and misc.o 3.Virtual Filesystem(VFS) 2
• 2.15 Kernel Support for Loading Modules 3.Virtual Filesystem (VFS) • 3.1 Inode Caches and Interaction with Dcache • 3.2 Filesystem Registration/Unregistration • 3.3 File Descriptor Management • 3.4 File Structure Management • 3.5 Superblock and Mountpoint Management • 3.6 Example Virtual Filesystem: pipefs • 3.7 Example Disk Filesystem: BFS • 3.8 Execution Domains and Binary Formats 1.Booting 1.1 Building the Linux Kernel Image This section explains the steps taken during compilation of the Linux kernel and the output produced at each stage. The build process depends on the architecture so I would like to emphasize that we only consider building a Linux/x86 kernel. When the user types 'make zImage' or 'make bzImage' the resulting bootable kernel image is stored as arch/i386/boot/zImage or arch/i386/boot/bzImage respectively. Here is how the image is built: 1. C and assembly source files are compiled into ELF relocatable object format (.o) and some of them are grouped logically into archives (.a) using ar(1) 2. Using ld(1), the above .o and .a are linked into 'vmlinux' which is a statically linked, non−stripped ELF 32−bit LSB 80386 executable file 3. System.map is produced by 'nm vmlinux', irrelevant or uninteresting symbols are grepped out. 4. Enter directory arch/i386/boot 5. Bootsector asm code bootsect.S is preprocessed either with or without −D__BIG_KERNEL__, depending on whether the target is bzImage or zImage, into bbootsect.s or bootsect.s respectively 6. bbootsect.s is assembled and then converted into 'raw binary' form called bbootsect (or bootsect.s assembled and raw−converted into bootsect for zImage) 7. Setup code setup,S (setup.S includes video.S) is preprocessed into bsetup.s for bzImage or setup.s for zImage. In the same way as the bootsector code, the difference is marked by −D__BIG_KERNEL__ present for bzImage. The result is then converted into 'raw binary' form called bsetup 8. Enter directory arch/i386/boot/compressed and convert /usr/src/linux/vmlinux to $tmppiggy (tmp filename) in raw binary format, removing .note and .comment ELF sections 9. gzip −9 $tmppiggy.gz 10. Link $tmppiggy.gz into ELF relocatable (ld −r) piggy.o 11. Compile compression routines head.S and misc.c (still in arch/i386/boot/compressed directory) into ELF objects head.o and misc.o Linux Kernel Internals 3.Virtual Filesystem (VFS) 2
Linux Kernel Internals 12.Link together head o misc o piggy o into bymlinux (or vmlinux for zImage.don't mistake this for /usr/src/linux/vmlinux!).Note the difference between-Ttext 0x1000 used for vmlinux and-Ttext 0x100000 for bvmlinux,i.e.for bzImage compression loader is high-loaded 13.Convert bvmlinux to'raw binary'bvmlinux.out removing.note and.comment ELF sections 14.Go back to arch/i386/boot directory and using the program tools/build cat together bbootsect+ bsetup+compressed/bvmlinux out into bzImage(delete extra'b'above for zImage).This writes important variables like setup_sects and root_dev at the end of the bootsector. The size of the bootsector is always 512 bytes.The size of the setup must be greater than 4 sectors but is limited above by about 12K-the rule is: 0x4000 bytes>=512 +setup sects *512+room for stack while running bootsector/setup We will see later where this limitation comes from. The upper limi size of the boo keme ge and lo ower bound on the enAoise setup so it is easy to en kernel by a ding some large.spac 1.2 Booting:Overview The boot process details are architectu ecific so we shall foc our attention on the ibm pc/la32 architecture due to old de and backw d co patibility,the PC fin boots the operating system in an 1.BIOS selects the boot device 2.BIOS loads the bootsector from the boot device 3.Bootsector loads setup,decompression routines and compressed kernel image 4.The kernel is uncompressed in protected mode 5.Low-level initialisation performed by asm code 6.High-level C initialisation 1.3 Booting:BIOS POST 1.The power supply starts the clock generator and asserts #POWERGOOD signal on the bus 2.CPU #RESET line is asserted(CPU now in real 8086 mode) 3.% es=%fs=%gs=%ss=0, S.7o0 4.All the checks perfor 5.IVT initialised at address 0 1.2 Booting:Overview 3
12. Link together head.o misc.o piggy.o into bvmlinux (or vmlinux for zImage, don't mistake this for /usr/src/linux/vmlinux!). Note the difference between −Ttext 0x1000 used for vmlinux and −Ttext 0x100000 for bvmlinux, i.e. for bzImage compression loader is high−loaded 13. Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment ELF sections 14. Go back to arch/i386/boot directory and using the program tools/build cat together bbootsect + bsetup + compressed/bvmlinux.out into bzImage (delete extra 'b' above for zImage). This writes important variables like setup_sects and root_dev at the end of the bootsector. The size of the bootsector is always 512 bytes. The size of the setup must be greater than 4 sectors but is limited above by about 12K − the rule is: 0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running bootsector/setup We will see later where this limitation comes from. The upper limit on the bzImage size produced at this step is about 2.5M for booting with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) for booting raw image, e.g. from floppy disk or CD−ROM (El−Torito emulation mode). Note, that tools/build validates the size of the boot sector, of the kernel image and lower bound on the size of setup but not the upper bound of setup so it is easy to build a broken kernel by adding some large ".space" at the end of setup.S. 1.2 Booting: Overview The boot process details are architecture−specific so we shall focus our attention on the IBM PC/IA32 architecture. Due to old design and backward compatibility, the PC firmware boots the operating system in an old−fashioned manner. This process can be separated into the following six logical stages: 1. BIOS selects the boot device 2. BIOS loads the bootsector from the boot device 3. Bootsector loads setup, decompression routines and compressed kernel image 4. The kernel is uncompressed in protected mode 5. Low−level initialisation performed by asm code 6. High−level C initialisation 1.3 Booting: BIOS POST 1. The power supply starts the clock generator and asserts #POWERGOOD signal on the bus 2. CPU #RESET line is asserted (CPU now in real 8086 mode) 3. %ds=%es=%fs=%gs=%ss=0, %cs:%eip = 0xFFFF:0000 (ROM BIOS POST code) 4. All the checks performed by POST with interrupts disabled 5. IVT initialised at address 0 Linux Kernel Internals 1.2 Booting: Overview 3
Linux Kernel Internals 6.The BIOS Bootstrap Loader function is invoked via int Ox19 with %dl containing the boot device 'drive number'.This loads track 0,sector I at physical address Ox7C00(0x07C0:0000). 1.4 Booting:bootsector and setup The bootsector used to boot Linux kernel could be either: .Linux bootsector,arch/i386/boot/bootsectS ·LLO(or other boo tloader's)bootsector .No bootsector(loadlin etc) Lin detal The first few used for segment values =0x07c0 33 SYSSEG 34 SYSSIZE system size:f of l6-byte define DEF SETUPSEG 0×9020 tdefine DEF_SYSSIZE 0x7E00 Now.let us consider the actual code of bootsectS: movw B00TSB6,a× 657686960 movw SINITSEG,ax 256, 1.4 Booting:bootsector and setup
6. The BIOS Bootstrap Loader function is invoked via int 0x19 with %dl containing the boot device 'drive number'. This loads track 0, sector 1 at physical address 0x7C00 (0x07C0:0000). 1.4 Booting: bootsector and setup The bootsector used to boot Linux kernel could be either: • Linux bootsector, arch/i386/boot/bootsect.S • LILO (or other bootloader's) bootsector • No bootsector (loadlin etc) We consider here the Linux bootsector in detail. The first few lines initialize the convenience macros to be used for segment values: 29 SETUPSECS = 4 /* default nr of setup−sectors */ 30 BOOTSEG = 0x07C0 /* original address of boot−sector */ 31 INITSEG = DEF_INITSEG /* we move boot here − out of the way */ 32 SETUPSEG = DEF_SETUPSEG /* setup starts here */ 33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */ 34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16−byte clicks */ (the numbers on the left are the line numbers of bootsect.S file) The values of DEF_INITSEG, DEF_SETUPSEG, DEF_SYSSEG, DEF_SYSSIZE are taken from include/asm/boot.h: /* Don't touch these, unless you really know what you're doing. */ #define DEF_INITSEG 0x9000 #define DEF_SYSSEG 0x1000 #define DEF_SETUPSEG 0x9020 #define DEF_SYSSIZE 0x7F00 Now, let us consider the actual code of bootsect.S: 54 movw $BOOTSEG, %ax 55 movw %ax, %ds 56 movw $INITSEG, %ax 57 movw %ax, %es 58 movw $256, %cx 59 subw %si, %si 60 subw %di, %di Linux Kernel Internals 1.4 Booting: bootsector and setup 4
Linux Kernel Internals cld ljmp SINITSEG,Sgo o远yo thVe op o in the vector table. The old stack might have clobbered the movw s0x4000-12,d 456 多ax,8d ax and es already contain INITSEG put stack at INITSEG:0x4000-12. The lines 54-63 move the bootsector code from address Ox7C00 to 0x90000.This is achieved by 1 set %ds:%si to SBOOTSEG:0(0x7C0:0=0x7C00) 2.set %es:%di to SINITSEG:0(0x9000:0=0x90000) 3.set the number of 16bit words in %cx(256 words=512 bytes=1 sector) 4.clear DF(direction)flag in EFLAGS to auto-increment addresses(cld) 5.go ahead and copy 512 bytes(rep movsw) The reason this code does not use"rep movsd"is intentional (hint-codel6). The line 64 jumps to the label "go:"in the newly made copy of the bootsector,i.e.in the segment 0x9000. This and the following three instructions(lines 64-76)prepare the stack at SINITSEG:0x4000-12,i.e.%ss= SINITSEG(0x9000)and %sp=0x3FEE(0x4000-12).This is where the limit on setup size comes from that we mentioned earlier(see Building the Linux Kernel Image). The lines77-103 patch the disk parameter table for the first disk to allow multi-sector reads in RAM most we ighdoean't hurt.tow does. Segmentareafo11ow:dsesscs-INITSEG,fs0, 91 movw set fs to 0 1.4 Booting:bootsector and setup 5
61 cld 62 rep 63 movsw 64 ljmp $INITSEG, $go 65 # bde − changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We 66 # wouldn't have to worry about this if we checked the top of memory. Also 67 # my BIOS can be configured to put the wini drive tables in high memory 68 # instead of in the vector table. The old stack might have clobbered the 69 # drive table. 70 go: movw $0x4000−12, %di # 0x4000 is an arbitrary value >= 71 # length of bootsect + length of 72 # setup + room for stack; 73 # 12 is disk parm size. 74 movw %ax, %ds # ax and es already contain INITSEG 75 movw %ax, %ss 76 movw %di, %sp # put stack at INITSEG:0x4000−12. The lines 54−63 move the bootsector code from address 0x7C00 to 0x90000. This is achieved by: 1. set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00) 2. set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000) 3. set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector) 4. clear DF (direction) flag in EFLAGS to auto−increment addresses (cld) 5. go ahead and copy 512 bytes (rep movsw) The reason this code does not use "rep movsd" is intentional (hint − .code16). The line 64 jumps to the label "go:" in the newly made copy of the bootsector, i.e. in the segment 0x9000. This and the following three instructions (lines 64−76) prepare the stack at $INITSEG:0x4000−12, i.e. %ss = $INITSEG (0x9000) and %sp = 0x3FEE (0x4000−12). This is where the limit on setup size comes from that we mentioned earlier (see Building the Linux Kernel Image). The lines 77−103 patch the disk parameter table for the first disk to allow multi−sector reads: 77 # Many BIOS's default disk parameter tables will not recognize 78 # multi−sector reads beyond the maximum sector number specified 79 # in the default diskette parameter tables − this may mean 7 80 # sectors in some cases. 81 # 82 # Since single sector reads are slow and out of the question, 83 # we must take care of this by creating new parameter tables 84 # (for the first disk) in RAM. We will set the maximum sector 85 # count to 36 − the most we will encounter on an ED 2.88. 86 # 87 # High doesn't hurt. Low does. 88 # 89 # Segments are as follows: ds = es = ss = cs − INITSEG, fs = 0, 90 # and gs is unused. 91 movw %cx, %fs # set fs to 0 Linux Kernel Internals 1.4 Booting: bootsector and setup 5
Linux Kernel Internals 078,bx fs:bx is parameter table address 1dsw 多f:(8bx),8i ds:si is source 456969960 卷C don't need cld->done on line 66 1 36,0x4(3di) patch sector count 103 The floppy disk contre roller is reset using BIOS service int 0x13 funct al addr at This ha FDC BIOS service int 0x function 2"read se s during lines 107-124 8照 load_set reset FDC int s0x13 head 0 0 90x0200,8b× addresa -512,in INITSEG 115 sects,sal (assume all on head 0,track 0) 117 oad_setup dump error code rint nl load_setup 124 ok_load_setup: If loading failed for some reason(bad floppy or someone pulled the diskette out during the operation)then we dump error code and retry in an endless loop.The only way to get out of it is to reboot the machine, unless retry succeeds but usually it doesn't (if something is wrong it will only get worse). Ifloading setup sects sectors of setup code succeeded we jump to label"ok load setup:" Then we pro image at in low memo y( Is n AipR0Stisowvewt nore callst the enti sed)kerne mag nger ich is Thi ne by setup. oor prote ompresse 386 ndp stac ompres ncompress the ke add s0x1000002 1.4 Booting:bootsector and setup
92 movw $0x78, %bx # fs:bx is parameter table address 93 pushw %ds 94 ldsw %fs:(%bx), %si # ds:si is source 95 movb $6, %cl # copy 12 bytes 96 pushw %di # di = 0x4000−12. 97 rep # don't need cld −> done on line 66 98 movsw 99 popw %di 100 popw %ds 101 movb $36, 0x4(%di) # patch sector count 102 movw %di, %fs:(%bx) 103 movw %es, %fs:2(%bx) The floppy disk controller is reset using BIOS service int 0x13 function 0 "reset FDC" and setup sectors are loaded immediately after the bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using BIOS service int 0x13, function 2 "read sector(s)". This happens during lines 107−124: 107 load_setup: 108 xorb %ah, %ah # reset FDC 109 xorb %dl, %dl 110 int $0x13 111 xorw %dx, %dx # drive 0, head 0 112 movb $0x02, %cl # sector 2, track 0 113 movw $0x0200, %bx # address = 512, in INITSEG 114 movb $0x02, %ah # service 2, "read sector(s)" 115 movb setup_sects, %al # (assume all on head 0, track 0) 116 int $0x13 # read it 117 jnc ok_load_setup # ok − continue 118 pushw %ax # dump error code 119 call print_nl 120 movw %sp, %bp 121 call print_hex 122 popw %ax 123 jmp load_setup 124 ok_load_setup: If loading failed for some reason (bad floppy or someone pulled the diskette out during the operation) then we dump error code and retry in an endless loop. The only way to get out of it is to reboot the machine, unless retry succeeds but usually it doesn't (if something is wrong it will only get worse). If loading setup_sects sectors of setup code succeeded we jump to label "ok_load_setup:" Then we proceed to load the compressed kernel image at physical address 0x10000. This is done to preserve the firmware data areas in low memory (0−64K). After the kernel is loaded we jump to $SETUPSEG:0 (arch/i386/boot/setup.S). Once the data is no longer needed (e.g. no more calls to BIOS) it is overwritten by moving the entire (compressed) kernel image from 0x10000 to 0x1000 (physical addresses, of course). This is done by setup.S which sets things up for protected mode and jumps to 0x1000 which is the head of the compressed kernel, i.e. arch/386/boot/compressed/{head.S,misc.c}. This sets up stack and calls decompress_kernel() which uncompresses the kernel to address 0x100000 and jumps to it. Linux Kernel Internals 1.4 Booting: bootsector and setup 6
Linux Kemnel Internals achteibioCromaA0Ooaoenoai中tcooiopehae6 combinations of loader type/version vs zImage/bzImage and is therefore highly complex. Let us examine the kludge in the bootsector code that allows to load a big kernel known also as "bzImage" The setup sectors are loaded as usual at 0x90200 but the kernel is loaded 64K chunk at a time using a specia helper routine that calls BIOS to move data from low to high memory.This helper routine is referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in setup.S.The bootsect_kludge label in setup.S contains the value of setup segment and the offset of bootsect helper code in it so that bootsector can use lcall instruction to jump to it (inter-segment jump).The reason why it is in setup.S is simply because there is no more space left in bootsect.S(which is strictly not true- there are approx 4 spare bytes and at least I spare byte in bootsect.S but that is not enough,obviously).This routine uses BIOS service int 0x15 (ax-0x8700)to move to high memory and resets %es to al ways point that the code in bootsect.S doesn't run out of low memory when copying data from disk 1.5 Using LILO as a bootloader There are several advantages in using a specialized bootloader(LILO)over a bare bones Linux bootsector veen multi s ke omm 3.much arger m kemels -up to 2.5M vs IM Old versions of LILO(v17 and earlier)could no load bzlm couple of vears ago or earlier)use the ame e as hoo ata m ry by means of BIOS ser ple (Peter Anvin notably)argue that zImage port she uld he The main reason (according to lan Cox)itsta ys is that there are s pparently some broken BIOSes that make it impossible to boot bzImage kemels while loading zlmage ones fine The last thing LILO does is to jump to setup.S and things proceed as normal 1.6 High level initialisation By"high-level initialisation"we consider anything which is not directly related to bootstrap,even though parts of the code to perform this are written in asm,namely arch/i386/kernel/head.S which is the head of the uncompressed kernel.The following steps are performed: 1.initialises segment values(%ds=%es-%fs=%gs=KERNEL DS=0x18) 2.initialises page tables 3.enables paging by setting PG bit in %cr 4.zero-cleans BSS (on SMP,only first CPU does this) 5.copies the first 2k of bootup parameters(kernel commandl ne) 6.checks CPU type using EFLAGS and,if possible,cpuid,able to detect 36 and higher 1.5 Using LILO as a bootloader 7
Note that the old bootloaders (old versions of LILO) could only load the first 4 sectors of setup so there is code in setup to load the rest of itself if needed. Also, the code in setup has to take care of various combinations of loader type/version vs zImage/bzImage and is therefore highly complex. Let us examine the kludge in the bootsector code that allows to load a big kernel, known also as "bzImage". The setup sectors are loaded as usual at 0x90200 but the kernel is loaded 64K chunk at a time using a special helper routine that calls BIOS to move data from low to high memory. This helper routine is referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in setup.S. The bootsect_kludge label in setup.S contains the value of setup segment and the offset of bootsect_helper code in it so that bootsector can use lcall instruction to jump to it (inter−segment jump). The reason why it is in setup.S is simply because there is no more space left in bootsect.S (which is strictly not true − there are approx 4 spare bytes and at least 1 spare byte in bootsect.S but that is not enough, obviously). This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory and resets %es to always point to 0x10000 so that the code in bootsect.S doesn't run out of low memory when copying data from disk. 1.5 Using LILO as a bootloader There are several advantages in using a specialized bootloader (LILO) over a bare bones Linux bootsector: 1. Ability to choose between multiple Linux kernels or even multiple OSes. 2. Ability to pass kernel command line parameters (there is a patch called BCP that adds this ability to bare−bones bootsector+setup) 3. Ability to load much larger bzImage kernels − up to 2.5M vs 1M Old versions of LILO (v17 and earlier) could not load bzImage kernels. The newer versions (as of a couple of years ago or earlier) use the same technique as bootsect+setup of moving data from low into high memory by means of BIOS services. Some people (Peter Anvin notably) argue that zImage support should be removed. The main reason (according to Alan Cox) it stays is that there are apparently some broken BIOSes that make it impossible to boot bzImage kernels while loading zImage ones fine. The last thing LILO does is to jump to setup.S and things proceed as normal. 1.6 High level initialisation By "high−level initialisation" we consider anything which is not directly related to bootstrap, even though parts of the code to perform this are written in asm, namely arch/i386/kernel/head.S which is the head of the uncompressed kernel. The following steps are performed: 1. initialises segment values (%ds=%es=%fs=%gs=__KERNEL_DS= 0x18) 2. initialises page tables 3. enables paging by setting PG bit in %cr0 4. zero−cleans BSS (on SMP, only first CPU does this) 5. copies the first 2k of bootup parameters (kernel commandline) 6. checks CPU type using EFLAGS and, if possible, cpuid, able to detect 386 and higher Linux Kernel Internals 1.5 Using LILO as a bootloader 7
Linux Kernel Internals 7.the first CPU calls start kernel(),all others call arch/i386/kernel/smpboot.c:initialize secondary()if ready=1,which just reloads esp/eip and doesn't return. The init/main.c:start kernel()is written in C and does the following: 1.takes a global kernel lock (it is needed so that only one CPU goes through initialisation) 2.performs arch-specific setup (memory layout analysis,copying boot command line again,etc.) nne A es uap 6. required for scheduler tialis ses tim rq s 0 mandline option 10 11.if module s port was compiled into the kernel,initialises dynamical module loading facility mmand line was su pplied initialises profiling buffers 13.kmem cache init(),initialises most of slab allocator 14.enables interrupts 15 calculates bogomins value for this cpu 16.calls meminit()which calculates max mapnr,totalram pages and high memory and prints out the "line 17.kmem cache sizes init(),finishes slab allocator initialisation 18.initialises data structures used by procfs 19.fork init().creates uid cache,initialises max threads based on the amount of memory available and configures RLIMIT NPROC for init_task to be max_threads/2 20.creates various slab caches needed for VFS.VM,buffer cache etc 21.if System V IPC support is compiled in,initialises IPC subsystem.Note,that for System Vshm this includes mounting an internal (in-kernel)instance of shmfs filesystem 22.if quota support is compiled into the kemel,create and initialise a special slab cache for it 23.performs arch-specific"check for bugs"and,whenever possible,activates workaround for processor/bus/etc bugs.Comparing various architectures reveals that"i64 has no bugs"and"a32 foof bug"which is only checked if kernel is compiled for ork aro 24.sets a flag ate that a sch be invok t"next opp ortunity and creates a keme )w it/bin/init,/b all these 25 "in paramete s in loop,this is dle thread with pid=0 d ot here that the in Important thin nel thr ead calls do basic gh the list of fur d h cal odule inito macros and invokes the m The e functions either do ach other or thei ndencies have been manually fixed by the link order in the makefiles This means that d ndin on the osition of change sometimes this is important because you can imagine two subsy ems a and b with b depending on some initialisation done by A.If A is compiled statically and B is a module then B's entry point is guaranteed to be invoked after A prepared all the necessary environment.If A is a module,then B is also necessarily a 1.5 Using LILO as a bootloader
7. the first CPU calls start_kernel(), all others call arch/i386/kernel/smpboot.c:initialize_secondary() if ready=1, which just reloads esp/eip and doesn't return. The init/main.c:start_kernel() is written in C and does the following: 1. takes a global kernel lock (it is needed so that only one CPU goes through initialisation) 2. performs arch−specific setup (memory layout analysis, copying boot command line again, etc.) 3. prints Linux kernel "banner" containing the version, compiler used to build it etc. to the kernel ring buffer for messages. This is taken from the variable linux_banner defined in init/version.c and is the same string as displayed by "cat /proc/version". 4. initialises traps 5. initialises irqs 6. initialises data required for scheduler 7. initialises time keeping data 8. initialises softirq subsystem 9. parses boot commandline options 10. initialises console 11. if module support was compiled into the kernel, initialises dynamical module loading facility 12. if "profile=" command line was supplied initialises profiling buffers 13. kmem_cache_init(), initialises most of slab allocator 14. enables interrupts 15. calculates BogoMips value for this CPU 16. calls mem_init() which calculates max_mapnr, totalram_pages and high_memory and prints out the "Memory: ..." line 17. kmem_cache_sizes_init(), finishes slab allocator initialisation 18. initialises data structures used by procfs 19. fork_init(), creates uid_cache, initialises max_threads based on the amount of memory available and configures RLIMIT_NPROC for init_task to be max_threads/2 20. creates various slab caches needed for VFS, VM, buffer cache etc 21. if System V IPC support is compiled in, initialises IPC subsystem. Note, that for System V shm this includes mounting an internal (in−kernel) instance of shmfs filesystem 22. if quota support is compiled into the kernel, create and initialise a special slab cache for it 23. performs arch−specific "check for bugs" and, whenever possible, activates workaround for processor/bus/etc bugs. Comparing various architectures reveals that "ia64 has no bugs" and "ia32 has quite a few bugs", good example is "f00f bug" which is only checked if kernel is compiled for less than 686 and worked around accordingly 24. sets a flag to indicate that a schedule should be invoked at "next opportunity" and creates a kernel thread init() which execs execute_command if supplied via "init=" boot parameter or tries to exec /sbin/init,/etc/init,/bin/init,/bin/sh in this order and if all these fail, panics with suggestion to use "init=" parameter. 25. goes into the idle loop, this is an idle thread with pid=0 Important thing to note here that the init() kernel thread calls do_basic_setup() which in turn calls do_initcalls() which goes through the list of functions registered by means of __initcall or module_init() macros and invokes them. These functions either do not depend on each other or their dependencies have been manually fixed by the link order in the Makefiles. This means that depending on the position of directories in the trees and the structure of the Makefiles the order initialisation functions are invoked can change. Sometimes, this is important because you can imagine two subsystems A and B with B depending on some initialisation done by A. If A is compiled statically and B is a module then B's entry point is guaranteed to be invoked after A prepared all the necessary environment. If A is a module, then B is also necessarily a Linux Kernel Internals 1.5 Using LILO as a bootloader 8