
Computer ArchitecturePipeliningComputerArchitecture
Computer Architecture Computer Architecture Pipelining

Can We Do Better?: What limitations do you see with the multi-cycledesign?. Limited concurrency- Some hardware resources are idle during differentphases of instruction processing cycle_"Fetch" logic is idle when an instruction is being"decoded"or"executed"- Most of the datapath is idle when a memory access ishappeningComputerArchitecture
Computer Architecture Can We Do Be*er? • What limitations do you see with the multi-cycle design? • Limited concurrency – Some hardware resources are idle during different phases of instruction processing cycle – “Fetch” logic is idle when an instruction is being “decoded” or “executed” – Most of the datapath is idle when a memory access is happening 2

CanWeUsetheldleHardwareto Improve Concurrency?· Goal: Concurrency →> throughput (more "work"completed in one cycle) Idea: When an instruction is using some resources inits processing phase, process other instructions on idleresources not needed by that instruction- E.g., when an instruction is being decoded, fetch the nextinstruction- E.g., when an instruction is being executed, decode anotherinstruction- E.g., when an instruction is accessing data memory (ld/st)executethenextinstruction- E.g., when an instruction is writing its result into the registerfile, access data memory for the next instructionComputerArchitecture
Computer Architecture Can We Use the Idle Hardware to Improve Concurrency? • Goal: Concurrency à throughput (more “work” completed in one cycle) • Idea: When an instruction is using some resources in its processing phase, process other instructions on idle resources not needed by that instruction – E.g., when an instruction is being decoded, fetch the next instruction – E.g., when an instruction is being executed, decode another instruction – E.g., when an instruction is accessing data memory (ld/st), execute the next instruction – E.g., when an instruction is writing its result into the register file, access data memory for the next instruction 3

Pipelining:Basicldea: More systematically:- Pipeline the execution of multiple instructions-Analogy:"Assembly line processing"of instructions: Idea:- Divide the instruction processing cycle into distinct "stages" ofprocessingEnsurethereareenoughhardware resourcestoprocess oneinstruction in each stage- Process a different instruction in each stageInstructions consecutive inprogramorderareprocessed in consecutivestagesBenefit: Increases instruction processing throughput (1)CPI)Downside: Start thinking about this..ComputerArchitecture
Computer Architecture Pipelining: Basic Idea • More systematically: – Pipeline the execution of multiple instructions – Analogy: “Assembly line processing” of instructions • Idea: – Divide the instruction processing cycle into distinct “stages” of processing – Ensure there are enough hardware resources to process one instruction in each stage – Process a different instruction in each stage • Instructions consecutive in program order are processed in consecutive stages • Benefit: Increases instruction processing throughput (1/ CPI) • Downside: Start thinking about this. 4

Example:ExecutionofFourIndependentADDs: Multi-cycle: 4 cycles per instructionW1Time: Pipelined: 4 cycles per 4 instructions (steady state)DEWDEWDW-TimeComputerArchitecture
Computer Architecture Example: ExecuBon of Four Independent ADDs • Multi-cycle: 4 cycles per instruction • Pipelined: 4 cycles per 4 instructions (steady state) 5 Time F D E W F D E W F D E W F D E W F D E W F D E W F D E W F D E W Time

TheLaundryAnalogy1011126PM18912 AMTimeTaskorder可GB一DC"place one dirty load of clothes in the washer""when the washer is finished, place the wet load in the dryer""when the dryer is finished, take out the dry load and fold'"when folding is finished, ask your roommate (??) to put the clothesaway"-stepstodoaloadaresequentiallydependent-nodependencebetweendifferentloads-differentstepsdonotshareresourcesComputerArchitecture
Computer Architecture The Laundry Analogy • “place one dirty load of clothes in the washer” • “when the washer is finished, place the wet load in the dryer” • “when the dryer is finished, take out the dry load and fold” • “when folding is finished, ask your roommate (??) to put the clothes away” 6 - steps to do a load are sequenBally dependent - no dependence between different loads - different steps do not share resources Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order

Pipelining Multiple Loads of Laundry1891011126 PM12 AMTimeTaskorderC同兰丽ro1011126 PM7812 AMTimeTaskorder-4loadsof laundryinparallel同丽B-noadditionalresourcesC-throughputincreasedby4D-latency per loadis the sameComputerArchitecture
Computer Architecture Pipelining MulBple Loads of Laundry 7 Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order - latency per load is the same - throughput increased by 4 - 4 loads of laundry in parallel - no addiBonal resources

Pipelining MultipleLoads of Laundry:In Practice1011126 PM7892 AM1TimeTaskorderBC6 PM7812910112 AMTimeTaskorder兰丽A口奇丽BC向m国同田thesloweststep decidesthroughputComputerArchitecture
Computer Architecture Pipelining MulBple Loads of Laundry: In PracBce 8 Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order the slowest step decides throughput

Pipelining Multiple Loads ofLaundry:In Practice111289106 PM72 AMTimeTaskorderB6 PM891011122 AM1Time"TaskorderBCLThroughputrestored (2 loads perhour)using2dryersComputerArchitecture
Computer Architecture Pipelining MulBple Loads of Laundry: In PracBce 9 Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order A B A B Throughput restored (2 loads per hour) using 2 dryers Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Time 6 PM 7 8 9 10 11 12 1 2 AM A B C D Task order Task order

AnldealPipeline Goal: Increase throughput with little increase in cost(hardware cost, in case of instruction processing):Repetition of identical operations. The same operation is repeated on a large number ofdifferent inputsRepetition of independentoperations-Nodependenciesbetweenrepeatedoperations.Uniformly partitionable suboperations- Processing can be evenly divided into uniform-latencysuboperations (that do not share resources) Fitting examples: automobile assembly line, doinglaundry- What about the instruction processing "cycle"?ComputerArchitecture10
Computer Architecture An Ideal Pipeline • Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing) • Repetition of identical operations – The same operation is repeated on a large number of different inputs • Repetition of independent operations – No dependencies between repeated operations • Uniformly partitionable suboperations – Processing can be evenly divided into uniform-latency suboperations (that do not share resources) • Fitting examples: automobile assembly line, doing laundry – What about the instruction processing “cycle”? 10