ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors
Transcription
ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors
ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors School of Electrical and Computer Engineering Cornell University revision: 2014-09-24-12-26 1 Pipelined Processors 2 RAW Data Hazards 3 10 2.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3. Hardware Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4. RAW Data Hazards Through Memory . . . . . . . . . . . . . . . . . 20 3 Control Hazards 22 3.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3. Hardware Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4. Interrupts and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Structual Hazards 34 4.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3. Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 WAW and WAR Name Hazards 38 5.1. Software Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6 Analyzing Processor Performance 41 7 Case Study: MIPS R2K 44 2 1 Pipelined Processors 1. Pipelined Processors Instructions Cycles Time Time = × × Program Program Instruction Cycles • Instructions / program depends on source code, compiler, ISA • Cycles / instruction (CPI) depends on ISA, microarchitecture • Time / cycle depends upon microarchitecture and implementation Microarchitecture Topic 01 Single-Cycle Processor Topic 02 FSM Processor this topic → Pipelined Processor CPI Cycle Time 1 >1 ≈1 long short short Technology Constraints Control Unit • Assume modern technology where logic is cheap and fast (e.g., fast integer ALU) Control Status Datapath • Assume multi-ported register files with a reasonable number of ports are feasible • Assume small amount of very fast memory (caches) backed by large, slower memory regfile imreq Memory 3 imresp dmreq <1 cycle combinational dmresp 1 Pipelined Processors Pipelining laundry analogy • Anne, Brian, Cathy, and Dave each have one load of clothes • Washing, drying, folding, and storing each take 30 minutes Sequential Laundry 7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am Anne's Load Ben's Load Cathy's Load Dave's Load Pipelined Laundry 7pm 8pm 9pm Pipelined Laundry with Slow Dryers 10pm 7pm Anne's Load Anne's Load Ben's Load Ben's Load Cathy's Load Cathy's Load Dave's Load Dave's Load 8pm 9pm 10pm 11pm Pipelining lessons • • • • • • Multiple transactions operate simultaneously using different resources Pipelining does not help the transaction latency Pipelining does help the transaction throughput Potential speedup is proportional to the number of pipeline stages Potential speedup is limited by the slowest pipeline stage Potential speedup is reduced by time to fill the pipeline 4 12am 1 Pipelined Processors Visualizing space, time, and transactions with pipeline diagrams 5 1 Pipelined Processors Comparing single-cycle, FSM, and pipelined processors addu addiu mul lw sw j jal jr bne Fetch Instruction 3 3 3 3 3 3 3 3 3 Decode Instruction 3 3 3 3 3 3 3 3 3 Read Registers 3 3 3 3 3 3 3 Register Arithmetic 3 3 3 3 3 Read Memory 3 3 Write Memory 3 Write Registers 3 3 3 3 Update PC 3 3 3 3 3 3 3 3 Single-Cycle Processor Fetch Inst Decode Inst Reg Arith lw Read Reg Update PC Read Mem Write Reg Fetch Inst Decode Inst Reg Arith addu Read Reg Update PC Write Reg Fetch Inst Decode Inst j FSM Processor Fetch Inst Decode Inst Reg Arith lw Read Reg Update PC Read Mem Write Reg Fetch Inst Decode Inst Reg Arith addu Read Reg Update PC Write Reg j Pipelined Processor Fetch Inst Decode Inst Reg Arith Read Mem Write Reg lw Read Reg Update PC Fetch Inst Decode Inst Reg Arith Write Reg addu Read Reg Update PC Fetch Inst Decode Inst j 6 Fetch Inst Update PC Decode Inst Update PC Update PC 3 3 1 Pipelined Processors Pipelining a PARCv1 processor • • • • • • • Incrementally develop an unpipelined datapath Keep data flowing from left to right Position control signal table early in the diagram Divided datapath/control into stages by inserting pipeline registers Keep the pipeline stages roughly balanced Forward arrows should avoid “skipping” pipeline registers Backward arrows will need careful consideration Control Signal Table br_targ jr j_targ pc_plus4 pc_F pc _sel_F ir[25:0] j_tgen ir[15:0] wb_sel_X br_tgen +4 wb_sel_M mul rf _waddr_W rf _wen_W ir[25:21] ir[20:16] ir[15:0] regfile (read) regfile (write) alu_ func_X sext alu eq_X op1 _sel_D imreq addr imresp data dmreq data 7 dmreq dmresp addr addr 1 Pipelined Processors Adding a new auto-incrementing load instruction Draw on the above datapath diagram what paths we need to use as well as any new paths we will need to add in order to implement the following auto-incrementing load instruction. lw.ai rt, imm(rs) R[rt] ← M[ R[rs] + sext(imm) ]; R[rs] ← R[rs] + 4 Control Signal Table F Stage always pc pc_plus4 _sel_F cs_XM D Stage br_targ jr j_targ pc_plus4 pc_F cs_DX X Stage M Stage W Stage btarg_DX ir[25:0] j_tgen pc_plus4 _FD ir[15:0] br_tgen +4 wb_sel_X op0_DX mul ir[20:16] ir[15:0] regfile (read) op1_DX sext rf _waddr_W rf _wen_W result _MW wb_sel_M result _XM ir[25:21] ir_FD regfile (write) alu_ func_X alu sd_DX op1 _sel_D imreq addr cs_MW imresp data eq_X sd_XM dmreq data 8 dmreq dmresp addr addr 1 Pipelined Processors Pipeline diagrams addiu r1, r2, 1 addiu r3, r4, 1 F D X M W F D X M W F D X M addiu r5, r6, 1 W addiu r1, r2, 1 addiu r3, r4, 1 addiu r5, r6, 1 What would be the total execution time if these three instructions were repeated 10 times? Hazards occur when instructions interact with each other in pipeline • RAW Data Hazards: An instruction depends on a data value produced by an earlier instruction • Control Hazards: Whether or not an instruction should be executed depends on a control decision made by an earlier instruction • Structural Hazards: An instruction in the pipeline needs a resource being used by another instruction in the pipeline • WAW and WAR Name Hazards: An instruction in the pipeline is writing a register that an earlier instruction in the pipeline is either writing or reading 9 2 RAW Data Hazards 2. RAW Data Hazards RAW data hazards occur when one instruction depends on a data value produced by a preceding instruction still in the pipeline. We use architectural dependency arrows to illustrate RAW dependencies in assembly code sequences. addiu r1, r2, 1 addiu r3, r1, 1 addiu r4, r3, 1 Using pipeline diagrams to illustrate RAW hazards We use microarchitectural dependency arrows to illustrate RAW hazards on pipeline diagrams. addiu r1, r2, 1 addiu r3, r1, 1 F D X M W F D X M W F D X M addiu r4, r3, 1 addiu r1, r2, 1 addiu r3, r1, 1 addiu r4, r3, 1 10 W 2 RAW Data Hazards Approaches to resolving data hazards • Software Scheduling: Programmer or compiler explicitly avoids scheduling instructions that would create data hazards • Hardware Scheduling: Hardware dynamically schedules instructions to avoid RAW hazards, potentially allowing instructions to execute out of order • Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished producing data value • Hardware Bypassing: Hardware allows values to be sent from an earlier instruction to a later instruction before the earlier instruction has left the pipeline • Hardware Speculation: Hardware guesses that there is no hazard and allows later instructions to potentially read invalid data; detects when there is a problem and re-executes instructions that operated on invalid data 11 2 RAW Data Hazards 2.1. Software Scheduling 2.1. Software Scheduling Insert nops to delay read of earlier write. These nops count as real instructions increasing instructions per program. Insert independent instructions to delay read of earlier write, and only use nops if there is not enough useful work addiu addiu addiu nop addiu nop nop nop addiu addiu r1, r2, 1 nop nop nop addiu r3, r1, 1 nop nop nop addiu r4, r3, 1 r1, r2, 1 r6, r7, 1 r8, r9, 1 r3, r1, 1 r4, r3, 1 Pipeline diagram showing software scheduling for RAW data hazards addiu r1, r2, 1 addiu r6, r7, 1 addiu r8, r9, 1 nop addiu r3, r1, 1 nop nop nop addiu r4, r3, 1 12 2 RAW Data Hazards 2.2. Hardware Stalling 2.2. Hardware Stalling Hardware includes control logic that freezes later instructions (in front of pipeline) until earlier instruction (in back of pipeline) has finished producing data value. Pipeline diagram showing hardware stalling for RAW data hazards addiu r1, r2, 1 F D X M W F addiu r3, r1, 1 addiu r1, r2, 1 addiu r3, r1, 1 addiu r4, r3, 1 13 D 2 RAW Data Hazards 2.2. Hardware Stalling Modifications to datapath/control to support hardware stalling Stall Signal cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F always pc pc_plus4 _sel_F X Stage M Stage W Stage btarg_DX j_tgen pc_plus4 _FD ir[15:0] ir_FD reg_ en_D wb_sel_X op0_DX br_tgen mul ir[15:0] regfile (read) op1_DX sext rf _waddr_W rf _wen_W result _MW wb_sel_M result _XM ir[25:21] ir[20:16] nop insert _nop_D imreq addr cs_MW ir[25:0] +4 reg_ en_F cs_XM regfile (write) alu_ func_X alu sd_DX op1 _sel_D eq_X imresp data sd_XM dmreq data dmreq dmresp addr addr Deriving the stall signal ren0 raddr0 ren1 addu addiu mul lw sw j jal jr bne 14 raddr1 wen waddr 2 RAW Data Hazards 2.2. Hardware Stalling stall_waddr_X_raddr0_D = ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0) stall_waddr_M_raddr0_D = ren0_D && wen_M && (raddr0_D == waddr_M) && (waddr_M != 0) stall_waddr_W_raddr0_D = ren0_D && wen_W && (raddr0_D == waddr_W) && (waddr_W != 0) stall_waddr_X_raddr1_D = ren1_D && wen_X && (raddr1_D == waddr_X) && (waddr_X != 0) stall_waddr_M_raddr1_D = ren1_D && wen_M && (raddr1_D == waddr_M) && (waddr_M != 0) stall_waddr_W_raddr1_D = ren1_D && wen_W && (raddr1_D == waddr_W) && (waddr_W != 0) stall = stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D || stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D || stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D reg_en_F = !stall reg_en_D = !stall insert_nop_D = stall 15 2 RAW Data Hazards 2.3. Hardware Bypassing 2.3. Hardware Bypassing Hardware allows values to be sent from an earlier instruction (in back of pipeline) to a later instruction (in front of pipeline) before the earlier instruction has left the pipeline. Sometimes called “forwarding”. Pipeline diagram showing hardware bypassing for RAW data hazards addiu r1, r2, 1 addiu r3, r1, 1 F D X M W F D X M W F D X M addiu r4, r3, 1 addiu r1, r2, 1 addiu r3, r1, 1 addiu r4, r3, 1 16 W 2 RAW Data Hazards 2.3. Hardware Bypassing Adding single bypass path to support limited hardware bypassing Stall & Stall Bypass Signals Signal cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F always pc pc_plus4 _sel_F cs_MW X Stage M Stage W Stage btarg_DX ir[25:0] j_tgen pc_plus4 _FD ir[15:0] br_tgen +4 reg_ en_F cs_XM ir_FD reg_ en_D mul ir[20:16] ir[15:0] nop insert _nop_D wb_sel_X op0_DX op1_DX sext wb_sel_M result _XM ir[25:21] regfile (read) rf _waddr_W rf _wen_W result _MW op0_byp _sel_D regfile (write) alu_ func_X alu sd_DX op1 _sel_D eq_X sd_XM bypass_from_X imreq addr imresp data dmreq data dmreq dmresp addr addr Deriving the bypass and stall signals stall_waddr_X_raddr0_D = 0 bypass_waddr_X_raddr0_D = ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0) stall = stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D || stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D || stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D reg_en_F = !stall reg_en_D = !stall insert_nop_D = stall 17 2 RAW Data Hazards 2.3. Hardware Bypassing Handling load-use RAW dependencies ALU-use latency is only one cycle, but load-use latency is two cycles. lw r1, 0(r2) F addiu r3, r1, 1 D X M W F D X M W lw r1, 0(r2) addiu r3, r1, 1 lw r1, 0(r2) addiu r3, r1, 1 stall_waddr_X_raddr0_D = ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0) && (op_X == lw) bypass_waddr_X_raddr0_D = ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0) && (op_X != lw) stall = stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D || stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D || stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D reg_en_F = !stall reg_en_D = !stall insert_nop_D = stall 18 2 RAW Data Hazards 2.3. Hardware Bypassing Pipeline diagram showing multiple hardware bypass paths addiu r1, r2, 1 addiu r3, r4, 1 addiu r5, r3, 1 addu r6, r1, r3 sw r5, 0(r7) jr r6 Adding all bypass path to support full hardware bypassing Stall & Stall Bypass Signals Signal cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F always pc pc_plus4 _sel_F cs_MW X Stage M Stage ir[25:0] j_tgen pc_plus4 _FD ir[15:0] br_tgen ir_FD reg_ en_D wb_sel_X op0_DX mul ir[15:0] regfile (read) rf _waddr_W rf _wen_W result _MW op0_byp _sel_D op1_DX sext wb_sel_M result _XM ir[25:21] ir[20:16] nop insert _nop_D regfile (write) alu_ func_X alu sd_DX op1 op1_ byp_ _sel_D sel_D eq_X sd_XM bypass_from_X bypass_from_M bypass_from_W imreq addr W Stage btarg_DX +4 reg_ en_F cs_XM imresp data dmreq data 19 dmreq dmresp addr addr 2 RAW Data Hazards 2.4. RAW Data Hazards Through Memory 2.4. RAW Data Hazards Through Memory So far we have only studied RAW data hazards through registers, but we must also carefully consider RAW data hazards through memory. sw r1, 0(r2) lw r3, 0(r4) # RAW dependency occurs if R[r2] == R[r4] Pipeline diagrams showing RAW dependency through memory sw r1, 0(r2) lw r3, 0(r4) F D X M W F D X M sw r1, 0(r2) lw r3, 0(r4) 20 W 2 RAW Data Hazards 2.4. RAW Data Hazards Through Memory Pipeline diagram for simple assembly sequence Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined PARCv1 processor. Include microarchitectural dependency arrows to illustrate how data is transferred along various bypass paths. lw r1, 0(r2) lw r3, 0(r4) addu r5, r1, r3 sw r5, 0(r6) addiu r2, r2, 4 addiu r4, r4, 4 addiu r6, r6, 4 addiu r7, r7, -1 bne r7, r0, loop 21 3 Control Hazards 3. Control Hazards Control hazards occur when whether or not an instruction should be executed depends on a control decision made by an earlier instruction We use architectural dependency arrows to illustrate control dependencies in assembly code sequences. foo: bar: addiu j addiu addiu bne addiu addiu r1, foo r3, r5, r0, r7, r9, r2, 1 # Jumps are always taken r4, 1 r6, 1 r0, bar r8, 1 r10, 1 # This branch is not taken Using pipeline diagrams to illustrate control hazards We use microarchitectural dependency arrows to illustrate control hazards on pipeline diagrams. Jump resolution latency is two cycles, and branch resolution latency is three cycles. j foo F addiu r5, r6, 1 bne r0, r0, bar addiu r7, r8, 1 F D X M W F D X M D X M W F D X M 22 W W 3 Control Hazards addiu r1, r2, 1 j foo addiu r5, r6, 1 bne r0, r0, bar addiu r7, r8, 1 addiu r9, r10, 1 Approaches to resolving control hazards • Software Scheduling: Programmer or compiler explicitly avoids scheduling instructions that would create control hazards • Software Predication: Programmer or compiler converts control flow into data flow by using instructions that conditionally execute based on a data value • Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction have finished determining the correct control flow • Hardware Speculation: Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions the instructions that are along the correct control flow • Software Hints: Programmer or compiler provides hints about whether a conditional branch will be taken or not taken, and hardware can use these hints for more efficient hardware speculation 23 3 Control Hazards 3.1. Software Scheduling 3.1. Software Scheduling Expose branch delay slots as part of the instruction set. Branch delay slots are instructions that follow a jump or branch and are always executed regardless of whether a jump or branch is taken or not taken. Compiler tries to insert useful instructions, otherwise inserts nops. foo: bar: addiu j nop addiu addiu bne nop nop addiu addiu r1, r2, 1 foo r3, r4, 1 r5, r6, 1 r0, r0, bar r7, r8, 1 r9, r10, 1 Assume we modify the PARCv1 instruction set to specify that J, JAL, and JR instructions have a single-instruction branch delay slot (i.e., one instruction after a J, JAL, and JR is always executed) and the BNE instruction has a two-instruction branch delay slot (i.e., two instructions after a BNE are always executed). Pipeline diagram showing software scheduling for control hazards addiu r1, r2, 1 j foo nop addiu r5, r6, 1 bne r0, r0, bar nop nop addiu r7, r8, 1 addiu r9, r10, 1 24 3 Control Hazards 3.2. Hardware Stalling 3.2. Hardware Stalling Although we could stall to resolve control hazards, the issue is that every instruction essentially creates a control hazard until we know if that instruction is a jump, branch, or other instruction. Stalling will seriously degrade the performance of the common case when we have a sequence of instructions that are not control flow instructions. addiu r1, r2, 1 j foo addiu r5, r6, 1 bne r0, r0, bar addiu r7, r8, 1 addiu r9, r10, 1 25 3 Control Hazards 3.3. Hardware Speculation 3.3. Hardware Speculation Hardware guesses which way the control flow will go and potentially fetches incorrect instructions; detects when there is a problem and re-executes instructions the instructions that are along the correct control flow. For now, we will only consider a simple branch prediction scheme where the hardware always predicts not taken. Pipeline diagram when branch is not taken addiu r1, r2, 1 j foo addiu r3, r4, 1 addiu r5, r6, 1 bne r0, r0, bar addiu r7, r8, 1 addiu r9, r10, 1 Pipeline diagram when branch is taken addiu r1, r2, 1 j foo addiu r3, r4, 1 addiu r5, r6, 1 bne r0, r0, bar addiu r7, r8, 1 addiu r9, r10, 1 addiu r9, r10, 1 26 3 Control Hazards 3.3. Hardware Speculation Modifications to datapath/control to support hardware speculation Stall, Stall Bypass, & Stall Bypass & Squash Signals Signal Signals cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F X Stage j_tgen ir[15:0] W Stage reg_ en_D nop mul ir[15:0] insert nop _nop_F insert _nop_D wb_sel_X op0_DX regfile (read) op1_DX sext wb_sel_M result _XM ir[25:21] ir[20:16] rf _waddr_W rf _wen_W result _MW op0_byp _sel_D br_tgen ir_FD M Stage btarg_DX +4 reg_ en_F cs_MW ir[25:0] pc_plus4 _FD always pc pc_plus4 _sel_F cs_XM regfile (write) alu_ func_X alu sd_DX op1 op1_ byp_ _sel_D sel_D eq_X sd_XM bypass_from_X bypass_from_M bypass_from_W imreq addr imresp data dmreq data dmreq dmresp addr addr Deriving the squash signals squash_F = (op_D == j) || (op_D == jal) || (op_D == jr) || br_taken_X squash_D = br_taken_X reg_en_F reg_en_D insert_nop_D insert_nop_F = !stall || squash_F = !stall || squash_D = stall || squash_D = squash_F 27 3 Control Hazards 3.3. Hardware Speculation Pipeline diagram for simple assembly sequence Draw a pipeline diagram illustrating how the following assembly sequence would execute on a fully bypassed pipelined PARCv1 processor that uses hardware speculation which always predicts not-taken. Unlike the “standard” PARCv1 processor, you should also assume that we add a single-instruction branch delay slot to the instruction set. So this processor will leverage both software scheduling and hardware speculation. Include microarchitectural dependency arrows to illustrate both data and control flow. addiu bne addiu addiu ... foo: addu addiu r1, r0, r4, r6, r2, r3, r5, r7, 1 foo 1 1 # assume R[rs] != 0 # instruction in branch delay slot r8, r1, r4 r9, r1, 1 28 3 Control Hazards 3.4. Interrupts and Exceptions 3.4. Interrupts and Exceptions Interrupts and exceptions alter the normal control flow of the program. They are caused by an external or internal event that needs to be processed by the system, and these events are usually unexpected or rare from the program’s point of view. • Asynchronous Interrupts – Input/output device needs to be serviced – Timer has expired – Power distruption or hardware failure • Synchronous Exceptions – – – – – – Undefined opcode, privileged instruction Arithmetic overflow, floating-point exception Misaligned memory access for instruction fetch or data access Memory protection violation Virtual memory page faults System calls (traps) to jump into the operating system kernel Interrupts and Exception Semantics • Interrupts are asynchronous with respect to the program, so the microarchitecture can decide when to service the interrupt • Exceptions are synchronous with respect to the program, so they must be handled immediately 29 3 Control Hazards 3.4. Interrupts and Exceptions • To handle an interrupt or exception the hardware/software must: – – – – – – – – Stop program at current instruction (I), ensure previous insts finished Save cause of interrupt or exception in privileged arch state Save the PC of the instruction I in a special register (EPC) Switch to privileged mode Set the PC to the address of either the interrupt or the exception handler Disable interrupts Save the user architectural state Check the type of interrupt or exception – Handle the interrupt or exception – – – – Enable interrupts Switch to user mode Set the PC to EPC if I should be restarted Potentially set PC to EPC+4 if we should skip I Handling a misaligned data address and syscall exceptions Static code sequence Dynamic code sequence addiu r1, r0, 0x2001 lw r2, 0(r1) syscall opB opC ... exception_hander: opD # disable interrupts opE # save user registers opF # check exception type opG # handle exception opH # enable interrupts addiu EPC, EPC, 4 eret addiu r1, r0, 0x2001 lw r2, 0(r1) (excep) opD opE opF opG opH addiu EPC, EPC, 4 eret syscall (excep) opD opE opF 30 3 Control Hazards 3.4. Interrupts and Exceptions Interrupts and Exceptions in a PARCv4 Pipelined Processor F Inst Address Exceptions D X Illegal Instruction Arithmetic Overflow M W Data Address Exceptions • How should we handle a single instruction which generates multiple exceptions in different stages as it goes down the pipeline? – Exceptions in earlier pipeline stages override later exceptions for a given instruction • How should we handle multiple instructions generating exceptions in different stages at the same or different times? – We always want the execution to appear as if we have completed executed one instruction before going onto the next instruction – So we want to process the exception corresponding to the earliest instruction in program order first – Hold exception flags in pipeline until commit point – Commit point is after all exceptions could be generated but before any architectural state has been updated – To handle an exception at the commit point: update cause and EPC, squash all stages before the commit point, and set PC to exception handler • How and where to handle external asynchronous interrupts? – Inject asynchronous interrupts at the commit point – Asynchronous interrupts will then naturally override exceptions caused by instructions earlier in the pipeline 31 3 Control Hazards 3.4. Interrupts and Exceptions Modifications to datapath/control to support exceptions exception_FD exception_DX exception_XM Exception Handling Logic Stall & Stall Bypass Signals Signal cs_XM cs_DX Control Signal Table F Stage D Stage ehandler br_targ jr j_targ pc_plus4 pc_F pc _sel_F insert _nop_X X Stage cs_MW nop M Stage insert _nop_M btarg_DX j_tgen ir[15:0] ir_FD wb_sel_X op0_DX mul ir[15:0] regfile (read) op1_DX sext wb_sel_M result _XM ir[25:21] ir[20:16] rf _waddr_W rf _wen_W result _MW op0_byp _sel_D br_tgen reg_ en_D nop insert _nop_D regfile (write) alu_ func_X alu sd_DX op1 op1_ byp_ _sel_D sel_D eq_X sd_XM bypass_from_X bypass_from_M bypass_from_W imreq addr W Stage ir[25:0] pc_plus4 _FD +4 reg_ en_F nop imresp data dmreq data dmreq dmresp addr addr Deriving the squash signals squash_F = (op_D == j) || (op_D == jal) || (op_D == jr) || br_taken_X || exception_M squash_D = br_taken_X || exception_M squash_X = exception_M squash_M = exception_M 32 3 Control Hazards 3.4. Interrupts and Exceptions Pipeline diagram of exception handling 33 4 Structual Hazards 4. Structual Hazards Structural hazards occur when an instruction in the pipeline needs a resource being used by another instruction in the pipeline. The PARCv1 processor pipeline is specifically designed to avoid any structural hazards. Let’s introduce a structural hazard by allowing ADDU, ADDIU, MUL, and JAL instructions to write to the register file in the M stage instead of waiting until the W stage. We would need to add another writeback mux in the W stage and carefully handle bypassing. Using pipeline diagrams to illustrate structural hazards We use structual dependency arrows to illustrate structual hazards on pipeline diagrams. addiu r1, r2, 1 addiu r3, r4, 1 lw r5, 0(r6) addiu r7, r8, 1 Approaches to resolving structural hazards • Software Scheduling: Programmer or compiler explicitly avoids scheduling instructions that would create structural hazards • Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished using the shared resource • Hardware Duplication: Add more hardware so that each instruction can access separate resources at the same time 34 4 Structual Hazards 4.1. Software Scheduling 4.1. Software Scheduling Insert independent instructions or nops to delay an ADDU, ADDIU, MUL, or JAL instructions if they follow a LW instruction. addiu addiu lw nop addiu r1, r2, 1 r3, r4, 1 r5, 0(r6) r7, r8, 1 Pipeline diagram showing software scheduling for structural hazards addiu r1, r2, 1 addiu r3, r4, 1 lw r5, 0(r6) nop addiu r7, r8, 1 35 4 Structual Hazards 4.2. Hardware Stalling 4.2. Hardware Stalling Hardware includes control logic that freezes an ADDU, ADDIU, MUL, or JAL instruction if a LW instruction is ahead in the pipeline. Pipeline diagram showing hardware stalling for structural hazards addiu r1, r2, 1 addiu r3, r4, 1 lw r5, 0(r6) addiu r7, r8, 1 Deriving the stall signal stall_wport_hazard_D = (op_X == lw) && ( (op_D == ADDU) || (op_D == ADDIU) || (op_D == MUL) || (op_D == JAL) ) stall = stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D || stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D || stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D || stall_wport_hazard_D Notice that we stall far before the point when the structural hazard actually occurs, because we know exactly how instructions move down the pipeline. Also possible to use dynamic arbitration in the back of the pipeline. 36 4 Structual Hazards 4.3. Hardware Duplication 4.3. Hardware Duplication Add a second write port so that an ADDU, ADDIU, MUL, or JAL instruction can writeback to the register file at the same time as a LW. Stall, Stall Bypass, & Stall Bypass & Squash Signals Signal Signals cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F ir[15:0] br_tgen ir_FD X Stage M Stage W Stage btarg_DX j_tgen +4 reg_ en_F cs_MW ir[25:0] pc_plus4 _FD always pc pc_plus4 _sel_F cs_XM reg_ en_D nop op0_DX mul ir[20:16] ir[15:0] insert nop _nop_F insert _nop_D wb_sel_X op1_DX sext wb_sel_M result _XM ir[25:21] regfile (read) rf _waddr_W rf _wen_W result _MW op0_byp _sel_D regfile (write) alu_ func_X rf _wen1_W rf _waddr1_W alu sd_DX op1 op1_ byp_ _sel_D sel_D eq_X sd_XM bypass_from_X bypass_from_M bypass_from_W imreq addr imresp data dmreq data dmreq dmresp addr addr Does allowing early writeback help performance in the first place? addiu r1, r2, 1 addiu r3, r1, 1 addiu r4, r3, 1 addiu r5, r4, 1 addiu r6, r5, 1 addiu r7, r6, 1 37 5 WAW and WAR Name Hazards 5. WAW and WAR Name Hazards WAW dependencies occur when an instruction overwrites a register than an earlier instruction has already written. WAR dependencies occur when an instruction writes a register than an earlier instruction needs to read. We use architectural dependency arrows to illustrate WAW and WAR dependencies in assembly code sequences. mul r1, r2, r3 addiu r4, r1, 1 addiu r1, r5, 1 WAW name hazards occur when an instruction in the pipeline writes a register before an earlier instruction (in back of the pipeline) has had a chance to write that same register. WAR name hazards occur when an instruction in the pipeline writes a register before an earlier instuction (in back of pipeline) has had a chance to read that same register. The PARCv1 processor pipeline is specifically designed to avoid any WAW or WAR name hazards. Instructions always write the registerfile in-order in the same stage, and instructions always read registers in the front of the pipeline and write registers in the back of the pipeline. Let’s introduce a WAW name hazard by using an iterative variable latency multiplier, and allowing other instructions to continue executing while the multiplier is working. 38 5 WAW and WAR Name Hazards 5.1. Software Renaming Using pipeline diagrams to illustrate WAW name hazards We use microarchitectural dependency arrows to illustrate WAW hazards on pipeline diagrams. mul r1, r2, r3 addiu r4, r1, 1 addiu r1, r5, 1 Approaches to resolving structural hazards • Software Renaming: Programmer or compiler changes the register names to avoid creating name hazards • Hardware Renaming: Hardware dynamically changes the register names to avoid creating name hazards • Hardware Stalling: Hardware includes control logic that freezes later instructions until earlier instruction has finished either writing or reading the problematic register name 5.1. Software Renaming As long as we have enough architectural registers, renaming registers in software is easy. WAW and WAR dependencies occur because we have a finite number of architectural registers. mul r1, r2, r3 addiu r4, r1, 1 addiu r6, r5, 1 39 5 WAW and WAR Name Hazards 5.2. Hardware Stalling 5.2. Hardware Stalling Simplest approach is to add stall logic in the decode stage similar to what the approach used to resolve other hazards. mul r1, r2, r3 addiu r4, r1, 1 addiu r1, r5, 1 Deriving the stall signal stall_struct_hazard_D = (op_D == MUL) && !imul_rdy_D stall_waw_hazard_D = wen_D && wen_Z && (waddr_D == waddr_Z) && (waddr_Z != 0) stall = || || || || stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D stall_struct_hazard_D stall_waw_hazard_D 40 6 Analyzing Processor Performance 6. Analyzing Processor Performance Time Instructions Cycles Time = × × Program Program Instruction Cycles Estimating cycle time Stall, Stall Bypass, & Stall Bypass & Squash Signals Signal Signals cs_DX Control Signal Table F Stage D Stage br_targ jr j_targ pc_plus4 pc_F X Stage j_tgen reg_ en_D nop wb_sel_X op0_DX br_tgen mul ir[20:16] ir[15:0] regfile (read) op1_DX sext wb_sel_M result _XM ir[25:21] insert nop _nop_F insert _nop_D regfile (write) alu_ func_X alu sd_DX eq_X op1 op1_ byp_ _sel_D sel_D sd_XM bypass_from_X bypass_from_M bypass_from_W imreq addr • • • • • • W Stage rf _waddr_W rf _wen_W result _MW op0_byp _sel_D ir[15:0] ir_FD M Stage btarg_DX +4 reg_ en_F cs_MW ir[25:0] pc_plus4 _FD always pc pc_plus4 _sel_F cs_XM imresp data register read register write regfile read regfile write memory read memory write dmreq data = 1τ = 1τ = 10τ = 10τ = 20τ = 20τ • • • • • • • +4 unit sext unit br_tgen j_tgen mux multiplier alu 41 = 4τ = 1τ = 8τ = 1τ = 3τ = 20τ = 10τ dmreq dmresp addr addr 6 Analyzing Processor Performance Estimating execution time Using our first-order equation for processor performance, how long in τ will it take to execute the vvadd example assuming n is 64? loop: lw lw addu sw addiu addiu addiu addiu bne jr r12, r13, r14, r14, r4, r5, r6, r7, r7, r31 0(r4) 0(r5) r12, r13 0(r6) r4, 4 r5, 4 r6, 4 r7, -1 r0, loop lw lw addu sw addiu addiu addiu addiu bne opA opB lw 42 6 Analyzing Processor Performance Using our first-order equation for processor performance, how long in τ will it take to execute the mystery program assuming n is 64 and that we find a match on the last element. addiu loop: lw bne addiu jr foo: addiu addiu bne addiu jr r12, r0, 0 r13, 0(r4) r13, r6, foo r2, r12, 0 r31 r4, r12, r12, r2, r31 r4, 4 r12, 1 r5, loop r0, -1 43 7 Case Study: MIPS R2K 7. Case Study: MIPS R2K MIPSI MIPSII • MIPS R2K is one of the first popular pipelined RISC processors MIPSIII • MIPS R2K implements the MIPS I instruction set MIPSIV • MIPS = Microprocessor without Interlocked Pipeline Stages MIPS64 MIPS32 MIPS16 • MIPS I used software scheduling to avoid some RAW hazards by including a single-instruction load-use delay slot • MIPS I used software scheduling to avoid some control hazards by including a single-instruction branch delay slot One-Instr Load-Use Delay Slot One-Instr Branch Delay Slot addiu j addiu ... foo: addiu bne addiu ... bar: r1, r2, 1 foo r3, r4, 1 lw lw addiu addu # BDS r5, r6, 1 r7, r8, bar r9, r10, 1 # BDS r1, r3, r2, r5, 0(r2) 0(r4) r2, 4 # LDS r1, r3 Deprecated in MIPS II instruction set; legacy code can still execute on new microarchitectures, but code using the MIPS II instruction set can rely in hardware stalling Present in all MIPS instruction sets; not possible to depricate and still enable legacy code to execute on new microarchitectures 44 7 Case Study: MIPS R2K MIPS R2K Microarchitecture The pipelined datapath and control were located on a single die. Cache control and memory management unit were also integrated on-die, but the actual tag and data storage for the cache was located off-chip. Control Unit Datapath I$ Controller D$ Controller I$ (4-32KB) D$ (4-32KB) Used two-phase clocking to enable five pipeline stages to fit into four clock cycles. This avoided the need for explicit bypassing from the W 1.2 The MIPS Five-Stage Pipeline 5 stage to the end of the D stage. IF from I-cache Instruction 1 Instruction sequence Instruction 2 RD from register file IF Instruction 3 MEM from D-cache ALU RD ALU IF WB to register file MEM RD ALU WB MEM WB Time 1.2 MIPS five-stage pipeline. Two-phaseFIGURE clocking enabled a single-cycle branch resolution latency since register read, branch generation, and branch comparison fetch dataaddress from the cache; is a relatively rare event and we can just27 1.5 a cache MIPSmiss Compared with CISC Architectures stop the CPU when it happens (though cleverer CPUs find more useful things can fit in a single cycle.to do). Branch instruction IF Branch delay Branch target 1.2 The MIPS architecture was planned with separate instruction and data caches, so it can fetch an instruction and read or write a memory variable simultaneously.Branch RD MEM CISC address architectures have caches too, butWB they’re most often afterthoughts, fitted in as a feature of the memory system. A RISC architecture makes more sense if you regard the caches as very much part of the CPU and tied firmly into the pipeline. IF RD ALU MEM WB The MIPS Five-Stage Pipeline IF RD ALU MEM WB The MIPS architecture is made for pipelining, and Figure 1.2 is close to the earliest MIPS CPUs and typical of many. So long as the CPU runs from the cache, the execution of every MIPS instruction is divided into five phases, called pipestages, with each pipestage taking a fixed amount of time. The fixed amount FIGURE 1.3 The pipeline and branch delays. of time is usually a processor clock cycle (though some actions take only half a clock, so the MIPS five-stage pipeline actually occupies only four clock cycles).on a MIPS CPU without using some registers). For a program running any systemare thatrigidly takes defined interrupts traps, values these registers Allin instructions so or they can the follow the of same sequence may change any time, so you’d better not use at them. of pipestages, even at where the instruction does nothing some stage. The net 45 7 Case Study: MIPS R2K MIPS R2K VLSI Design Process: 2 µm, two metal layers Clock Frequency: 8–15 MHz Size: 110K transistors, 80 mm2 46