Overview General Principles of Pipelining ■Goa ■ Difficulties Creating a Pipelined Y86 Processor ■ Rearranging sec a Inserting pipeline registers a Problems with data and control hazards Processor
– 2 – Processor Overview General Principles of Pipelining ◼ Goal ◼ Difficulties Creating a Pipelined Y86 Processor ◼ Rearranging SEQ ◼ Inserting pipeline registers ◼ Problems with data and control hazards
Suggested Reading -Chap435,4.434.5 Processor
– 3 – Processor Suggested Reading - Chap 4.3.5, 4.4, 4.5
SEQ Hardware (Review) Stages occur in sequence One operation in process memory at a time Fiqure 4.21 P293 Execute AL Write back Fetch memory increment Processor
– 4 – Processor SEQ Hardware (Review) ◼ Stages occur in sequence ◼ One operation in process at a time Instruction memory PC increment CC ALU Data memory New PC rB dstE dstM ALU A ALU B Mem. control Addr srcA srcB read write ALU fun. Fetch Decode Execute Memory Write back data out Register file A B M E Register file A B M E Bch dstE dstM srcA srcB icode ifun rA PC valC valP valA valB Data valE valM PC newPC Figure 4.21 P293
SEQ+ Hardware Memory ■ Still sequential implementation Reorder PC stage to put at beginning PC Stage Task is to select Pc for current instruction Decode ■ Based on resu|ts computed by previous rite bac instruction Processor State Fetch increment a Pc is no longer stored in register But, can determine Pc based on other stored formation 5 Processor
– 5 – Processor Instruction memory PC increment CC ALU Data memory PC rB dstE dstM ALUA ALUB Mem. control Addr srcA srcB read write ALU fun. Fetch Decode Execute Memory Write back data out Register file A B ME Register file A B ME Bch dstE dstM srcA srcB icode ifun rA pIcode pBch pValM pValC pValP PC valC valP valA valB Data valE valM PC SEQ+ Hardware ◼ Still sequential implementation ◼ Reorder PC stage to put at beginning PC Stage ◼ Task is to select PC for current instruction ◼ Based on results computed by previous instruction Processor State ◼ PC is no longer stored in register ◼ But, can determine PC based on other stored information
Problem of sEQ and seQ+ Too slow Too many tasks needed to finish in one clock cycle a Signals need long time to propagate through all of the stages a The clock must run slowly enough Does not make good use of hardware units Every unit is active for part of the total clock cycle Processor
– 6 – Processor Problem of SEQ and SEQ+ Too slow ◼ Too many tasks needed to finish in one clock cycle ◼ Signals need long time to propagate through all of the stages ◼ The clock must run slowly enough Does not make good use of hardware units ◼ Every unit is active for part of the total clock cycle
Real-World Pipelines: Car Washes Sequential Parallel Pipelined ldea Divide process into independent stages a Move objects through stages In sequence a At any given times, multiple objects being processed Processor
– 7 – Processor Real-World Pipelines: Car Washes Idea ◼ Divide process into independent stages ◼ Move objects through stages in sequence ◼ At any given times, multiple objects being processed Sequential Parallel Pipelined
Computational Example Figure 4.32 P310 300ps 20 ps R Combinational Delay 320 ps logic Throughput =3. 12 GOPS g Clock System a Computation requires total of 300 picoseconds a Additional 20 picoseconds to save result in register a Can must have clock cycle of at least 320 ps 8 Processor
– 8 – Processor Computational Example System ◼ Computation requires total of 300 picoseconds ◼ Additional 20 picoseconds to save result in register ◼ Can must have clock cycle of at least 320 ps Combinational logic R e g 300 ps 20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS Figure 4.32 P310
3-Way Pipelined Version Figure 4.33 A)P310 100ps 20 ps 100ps 20 ps 100ps 20 ps Comb Comb R Comb R logic e logic Delay 360 ps gIc e A g B C Throughput =8.33 GOP g g Clock System a Divide com binational logic into 3 blocks of 100 ps each a Can begin new operation as soon as previous one passes through stage A e Begin new operation every 120 ps Overall latency increases o 360 ps from start to finish -9 Processor
– 9 – Processor 3-Way Pipelined Version System ◼ Divide combinational logic into 3 blocks of 100 ps each ◼ Can begin new operation as soon as previous one passes through stage A. ⚫ Begin new operation every 120 ps ◼ Overall latency increases ⚫ 360 ps from start to finish R e g Clock Comb. logic A R e g Comb. logic B R e g Comb. logic C 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Delay = 360 ps Throughput = 8.33 GOPS Figure 4.33 A) P310
Pipeline Diagrams Figure 4.33 B)P310 Unpipelined OP1 OP2 OP3 Time a Cannot start new operation until previous one com pletes 3-Way Pipelined OPlA OP2 BA CBA OP3 B C Time Up to 3 operations in process simultaneously Processor
– 10 – Processor Pipeline Diagrams Unpipelined ◼ Cannot start new operation until previous one completes 3-Way Pipelined ◼ Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time A B C A B C A B C OP1 OP2 OP3 Figure 4.33 B) P310