ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors

Transcription

ECE 4750 Computer Architecture, Fall 2014 T03 Pipelined Processors
ECE 4750 Computer Architecture, Fall 2014
T03 Pipelined Processors
School of Electrical and Computer Engineering
Cornell University
revision: 2014-09-24-12-26
1
Pipelined Processors
2
RAW Data Hazards
3
10
2.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3. Hardware Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4. RAW Data Hazards Through Memory . . . . . . . . . . . . . . . . . 20
3
Control Hazards
22
3.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3. Hardware Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4. Interrupts and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 29
4
Structual Hazards
34
4.1. Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3. Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5
WAW and WAR Name Hazards
38
5.1. Software Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6
Analyzing Processor Performance
41
7
Case Study: MIPS R2K
44
2
1
Pipelined Processors
1. Pipelined Processors
Instructions
Cycles
Time
Time
=
×
×
Program
Program
Instruction Cycles
• Instructions / program depends on source code, compiler, ISA
• Cycles / instruction (CPI) depends on ISA, microarchitecture
• Time / cycle depends upon microarchitecture and implementation
Microarchitecture
Topic 01 Single-Cycle Processor
Topic 02 FSM Processor
this topic → Pipelined Processor
CPI
Cycle Time
1
>1
≈1
long
short
short
Technology Constraints
Control Unit
• Assume modern technology
where logic is cheap and fast
(e.g., fast integer ALU)
Control
Status
Datapath
• Assume multi-ported register
files with a reasonable
number of ports are feasible
• Assume small amount of very
fast memory (caches) backed
by large, slower memory
regfile
imreq
Memory
3
imresp
dmreq
<1 cycle
combinational
dmresp
1
Pipelined Processors
Pipelining laundry analogy
• Anne, Brian, Cathy, and Dave each have one load of clothes
• Washing, drying, folding, and storing each take 30 minutes
Sequential Laundry
7pm
8pm
9pm
10pm
11pm
12am
1am
2am
3am
Anne's
Load
Ben's
Load
Cathy's
Load
Dave's
Load
Pipelined Laundry
7pm
8pm
9pm
Pipelined Laundry with Slow Dryers
10pm
7pm
Anne's
Load
Anne's
Load
Ben's
Load
Ben's
Load
Cathy's
Load
Cathy's
Load
Dave's
Load
Dave's
Load
8pm
9pm
10pm
11pm
Pipelining lessons
•
•
•
•
•
•
Multiple transactions operate simultaneously using different resources
Pipelining does not help the transaction latency
Pipelining does help the transaction throughput
Potential speedup is proportional to the number of pipeline stages
Potential speedup is limited by the slowest pipeline stage
Potential speedup is reduced by time to fill the pipeline
4
12am
1
Pipelined Processors
Visualizing space, time, and transactions with pipeline diagrams
5
1
Pipelined Processors
Comparing single-cycle, FSM, and pipelined processors
addu
addiu
mul
lw
sw
j
jal
jr
bne
Fetch Instruction
3
3
3
3
3
3
3
3
3
Decode Instruction
3
3
3
3
3
3
3
3
3
Read Registers
3
3
3
3
3
3
3
Register Arithmetic
3
3
3
3
3
Read Memory
3
3
Write Memory
3
Write Registers
3
3
3
3
Update PC
3
3
3
3
3
3
3
3
Single-Cycle Processor
Fetch
Inst
Decode
Inst
Reg
Arith
lw
Read
Reg
Update
PC
Read
Mem
Write
Reg
Fetch
Inst
Decode
Inst
Reg
Arith
addu
Read
Reg
Update
PC
Write
Reg
Fetch
Inst
Decode
Inst
j
FSM Processor
Fetch
Inst
Decode
Inst
Reg
Arith
lw
Read
Reg
Update
PC
Read
Mem
Write
Reg
Fetch
Inst
Decode
Inst
Reg
Arith
addu
Read
Reg
Update
PC
Write
Reg
j
Pipelined Processor
Fetch
Inst
Decode
Inst
Reg
Arith
Read
Mem
Write
Reg
lw
Read
Reg
Update
PC
Fetch
Inst
Decode
Inst
Reg
Arith
Write
Reg
addu
Read
Reg
Update
PC
Fetch
Inst
Decode
Inst
j
6
Fetch
Inst
Update
PC
Decode
Inst
Update
PC
Update
PC
3
3
1
Pipelined Processors
Pipelining a PARCv1 processor
•
•
•
•
•
•
•
Incrementally develop an unpipelined datapath
Keep data flowing from left to right
Position control signal table early in the diagram
Divided datapath/control into stages by inserting pipeline registers
Keep the pipeline stages roughly balanced
Forward arrows should avoid “skipping” pipeline registers
Backward arrows will need careful consideration
Control Signal
Table
br_targ
jr
j_targ
pc_plus4
pc_F
pc
_sel_F
ir[25:0]
j_tgen
ir[15:0]
wb_sel_X
br_tgen
+4
wb_sel_M
mul
rf
_waddr_W
rf
_wen_W
ir[25:21]
ir[20:16]
ir[15:0]
regfile
(read)
regfile
(write)
alu_
func_X
sext
alu
eq_X
op1
_sel_D
imreq
addr
imresp
data
dmreq
data
7
dmreq dmresp
addr
addr
1
Pipelined Processors
Adding a new auto-incrementing load instruction
Draw on the above datapath diagram what paths we need to use as well
as any new paths we will need to add in order to implement the
following auto-incrementing load instruction.
lw.ai rt, imm(rs)
R[rt] ← M[ R[rs] + sext(imm) ]; R[rs] ← R[rs] + 4
Control Signal
Table
F Stage
always
pc
pc_plus4
_sel_F
cs_XM
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
cs_DX
X Stage
M Stage
W Stage
btarg_DX
ir[25:0]
j_tgen
pc_plus4
_FD
ir[15:0]
br_tgen
+4
wb_sel_X
op0_DX
mul
ir[20:16]
ir[15:0]
regfile
(read)
op1_DX
sext
rf
_waddr_W
rf
_wen_W
result
_MW
wb_sel_M
result
_XM
ir[25:21]
ir_FD
regfile
(write)
alu_
func_X
alu
sd_DX
op1
_sel_D
imreq
addr
cs_MW
imresp
data
eq_X
sd_XM
dmreq
data
8
dmreq dmresp
addr
addr
1
Pipelined Processors
Pipeline diagrams
addiu r1, r2, 1
addiu r3, r4, 1
F
D
X
M
W
F
D
X
M
W
F
D
X
M
addiu r5, r6, 1
W
addiu r1, r2, 1
addiu r3, r4, 1
addiu r5, r6, 1
What would be the total execution time if these three instructions were
repeated 10 times?
Hazards occur when instructions interact with each other in pipeline
• RAW Data Hazards: An instruction depends on a data value
produced by an earlier instruction
• Control Hazards: Whether or not an instruction should be executed
depends on a control decision made by an earlier instruction
• Structural Hazards: An instruction in the pipeline needs a resource
being used by another instruction in the pipeline
• WAW and WAR Name Hazards: An instruction in the pipeline is
writing a register that an earlier instruction in the pipeline is either
writing or reading
9
2
RAW Data Hazards
2. RAW Data Hazards
RAW data hazards occur when one instruction depends on a data value
produced by a preceding instruction still in the pipeline. We use
architectural dependency arrows to illustrate RAW dependencies in
assembly code sequences.
addiu r1, r2, 1
addiu r3, r1, 1
addiu r4, r3, 1
Using pipeline diagrams to illustrate RAW hazards
We use microarchitectural dependency arrows to illustrate RAW
hazards on pipeline diagrams.
addiu r1, r2, 1
addiu r3, r1, 1
F
D
X
M
W
F
D
X
M
W
F
D
X
M
addiu r4, r3, 1
addiu r1, r2, 1
addiu r3, r1, 1
addiu r4, r3, 1
10
W
2
RAW Data Hazards
Approaches to resolving data hazards
• Software Scheduling: Programmer or compiler explicitly avoids
scheduling instructions that would create data hazards
• Hardware Scheduling: Hardware dynamically schedules
instructions to avoid RAW hazards, potentially allowing
instructions to execute out of order
• Hardware Stalling: Hardware includes control logic that freezes
later instructions until earlier instruction has finished producing
data value
• Hardware Bypassing: Hardware allows values to be sent from an
earlier instruction to a later instruction before the earlier instruction
has left the pipeline
• Hardware Speculation: Hardware guesses that there is no hazard
and allows later instructions to potentially read invalid data; detects
when there is a problem and re-executes instructions that operated
on invalid data
11
2
RAW Data Hazards
2.1. Software Scheduling
2.1. Software Scheduling
Insert nops to delay read of earlier
write. These nops count as real
instructions increasing
instructions per program.
Insert independent instructions to
delay read of earlier write, and
only use nops if there is not
enough useful work
addiu
addiu
addiu
nop
addiu
nop
nop
nop
addiu
addiu r1, r2, 1
nop
nop
nop
addiu r3, r1, 1
nop
nop
nop
addiu r4, r3, 1
r1, r2, 1
r6, r7, 1
r8, r9, 1
r3, r1, 1
r4, r3, 1
Pipeline diagram showing software scheduling for RAW data hazards
addiu r1, r2, 1
addiu r6, r7, 1
addiu r8, r9, 1
nop
addiu r3, r1, 1
nop
nop
nop
addiu r4, r3, 1
12
2
RAW Data Hazards
2.2. Hardware Stalling
2.2. Hardware Stalling
Hardware includes control logic that freezes later instructions (in front
of pipeline) until earlier instruction (in back of pipeline) has finished
producing data value.
Pipeline diagram showing hardware stalling for RAW data hazards
addiu r1, r2, 1
F
D
X
M
W
F
addiu r3, r1, 1
addiu r1, r2, 1
addiu r3, r1, 1
addiu r4, r3, 1
13
D
2
RAW Data Hazards
2.2. Hardware Stalling
Modifications to datapath/control to support hardware stalling
Stall
Signal
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
always
pc
pc_plus4
_sel_F
X Stage
M Stage
W Stage
btarg_DX
j_tgen
pc_plus4
_FD
ir[15:0]
ir_FD
reg_
en_D
wb_sel_X
op0_DX
br_tgen
mul
ir[15:0]
regfile
(read)
op1_DX
sext
rf
_waddr_W
rf
_wen_W
result
_MW
wb_sel_M
result
_XM
ir[25:21]
ir[20:16]
nop
insert
_nop_D
imreq
addr
cs_MW
ir[25:0]
+4
reg_
en_F
cs_XM
regfile
(write)
alu_
func_X
alu
sd_DX
op1
_sel_D
eq_X
imresp
data
sd_XM
dmreq
data
dmreq dmresp
addr
addr
Deriving the stall signal
ren0
raddr0
ren1
addu
addiu
mul
lw
sw
j
jal
jr
bne
14
raddr1
wen
waddr
2
RAW Data Hazards
2.2. Hardware Stalling
stall_waddr_X_raddr0_D =
ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0)
stall_waddr_M_raddr0_D =
ren0_D && wen_M && (raddr0_D == waddr_M) && (waddr_M != 0)
stall_waddr_W_raddr0_D =
ren0_D && wen_W && (raddr0_D == waddr_W) && (waddr_W != 0)
stall_waddr_X_raddr1_D =
ren1_D && wen_X && (raddr1_D == waddr_X) && (waddr_X != 0)
stall_waddr_M_raddr1_D =
ren1_D && wen_M && (raddr1_D == waddr_M) && (waddr_M != 0)
stall_waddr_W_raddr1_D =
ren1_D && wen_W && (raddr1_D == waddr_W) && (waddr_W != 0)
stall =
stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D
|| stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D
|| stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D
reg_en_F
= !stall
reg_en_D
= !stall
insert_nop_D = stall
15
2
RAW Data Hazards
2.3. Hardware Bypassing
2.3. Hardware Bypassing
Hardware allows values to be sent from an earlier instruction (in back
of pipeline) to a later instruction (in front of pipeline) before the earlier
instruction has left the pipeline. Sometimes called “forwarding”.
Pipeline diagram showing hardware bypassing for RAW data hazards
addiu r1, r2, 1
addiu r3, r1, 1
F
D
X
M
W
F
D
X
M
W
F
D
X
M
addiu r4, r3, 1
addiu r1, r2, 1
addiu r3, r1, 1
addiu r4, r3, 1
16
W
2
RAW Data Hazards
2.3. Hardware Bypassing
Adding single bypass path to support limited hardware bypassing
Stall &
Stall
Bypass
Signals
Signal
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
always
pc
pc_plus4
_sel_F
cs_MW
X Stage
M Stage
W Stage
btarg_DX
ir[25:0]
j_tgen
pc_plus4
_FD
ir[15:0]
br_tgen
+4
reg_
en_F
cs_XM
ir_FD
reg_
en_D
mul
ir[20:16]
ir[15:0]
nop
insert
_nop_D
wb_sel_X
op0_DX
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
regfile
(read)
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
regfile
(write)
alu_
func_X
alu
sd_DX
op1
_sel_D
eq_X
sd_XM
bypass_from_X
imreq
addr
imresp
data
dmreq
data
dmreq dmresp
addr
addr
Deriving the bypass and stall signals
stall_waddr_X_raddr0_D = 0
bypass_waddr_X_raddr0_D =
ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0)
stall =
stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D
|| stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D
|| stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D
reg_en_F
= !stall
reg_en_D
= !stall
insert_nop_D = stall
17
2
RAW Data Hazards
2.3. Hardware Bypassing
Handling load-use RAW dependencies
ALU-use latency is only one cycle, but load-use latency is two cycles.
lw r1, 0(r2)
F
addiu r3, r1, 1
D
X
M
W
F
D
X
M
W
lw r1, 0(r2)
addiu r3, r1, 1
lw r1, 0(r2)
addiu r3, r1, 1
stall_waddr_X_raddr0_D =
ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0)
&& (op_X == lw)
bypass_waddr_X_raddr0_D =
ren0_D && wen_X && (raddr0_D == waddr_X) && (waddr_X != 0)
&& (op_X != lw)
stall =
stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D
|| stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D
|| stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D
reg_en_F
= !stall
reg_en_D
= !stall
insert_nop_D = stall
18
2
RAW Data Hazards
2.3. Hardware Bypassing
Pipeline diagram showing multiple hardware bypass paths
addiu r1, r2, 1
addiu r3, r4, 1
addiu r5, r3, 1
addu
r6, r1, r3
sw
r5, 0(r7)
jr
r6
Adding all bypass path to support full hardware bypassing
Stall &
Stall
Bypass
Signals
Signal
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
always
pc
pc_plus4
_sel_F
cs_MW
X Stage
M Stage
ir[25:0]
j_tgen
pc_plus4
_FD
ir[15:0]
br_tgen
ir_FD
reg_
en_D
wb_sel_X
op0_DX
mul
ir[15:0]
regfile
(read)
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
ir[20:16]
nop
insert
_nop_D
regfile
(write)
alu_
func_X
alu
sd_DX
op1
op1_
byp_ _sel_D
sel_D
eq_X
sd_XM
bypass_from_X
bypass_from_M
bypass_from_W
imreq
addr
W Stage
btarg_DX
+4
reg_
en_F
cs_XM
imresp
data
dmreq
data
19
dmreq dmresp
addr
addr
2
RAW Data Hazards
2.4. RAW Data Hazards Through Memory
2.4. RAW Data Hazards Through Memory
So far we have only studied RAW data hazards through registers, but
we must also carefully consider RAW data hazards through memory.
sw r1, 0(r2)
lw r3, 0(r4) # RAW dependency occurs if R[r2] == R[r4]
Pipeline diagrams showing RAW dependency through memory
sw r1, 0(r2)
lw r3, 0(r4)
F
D
X
M
W
F
D
X
M
sw r1, 0(r2)
lw r3, 0(r4)
20
W
2
RAW Data Hazards
2.4. RAW Data Hazards Through Memory
Pipeline diagram for simple assembly sequence
Draw a pipeline diagram illustrating how the following assembly
sequence would execute on a fully bypassed pipelined PARCv1
processor. Include microarchitectural dependency arrows to illustrate
how data is transferred along various bypass paths.
lw r1, 0(r2)
lw r3, 0(r4)
addu r5, r1, r3
sw r5, 0(r6)
addiu r2, r2, 4
addiu r4, r4, 4
addiu r6, r6, 4
addiu r7, r7, -1
bne r7, r0, loop
21
3
Control Hazards
3. Control Hazards
Control hazards occur when whether or not an instruction should be
executed depends on a control decision made by an earlier instruction
We use architectural dependency arrows to illustrate control
dependencies in assembly code sequences.
foo:
bar:
addiu
j
addiu
addiu
bne
addiu
addiu
r1,
foo
r3,
r5,
r0,
r7,
r9,
r2, 1
# Jumps are always taken
r4, 1
r6, 1
r0, bar
r8, 1
r10, 1
# This branch is not taken
Using pipeline diagrams to illustrate control hazards
We use microarchitectural dependency arrows to illustrate control
hazards on pipeline diagrams. Jump resolution latency is two cycles,
and branch resolution latency is three cycles.
j foo
F
addiu r5, r6, 1
bne r0, r0, bar
addiu r7, r8, 1
F
D
X
M
W
F
D
X
M
D
X
M
W
F
D
X
M
22
W
W
3
Control Hazards
addiu r1, r2, 1
j foo
addiu r5, r6, 1
bne r0, r0, bar
addiu r7, r8, 1
addiu r9, r10, 1
Approaches to resolving control hazards
• Software Scheduling: Programmer or compiler explicitly avoids
scheduling instructions that would create control hazards
• Software Predication: Programmer or compiler converts control
flow into data flow by using instructions that conditionally execute
based on a data value
• Hardware Stalling: Hardware includes control logic that freezes
later instructions until earlier instruction have finished determining
the correct control flow
• Hardware Speculation: Hardware guesses which way the control
flow will go and potentially fetches incorrect instructions; detects
when there is a problem and re-executes instructions the instructions
that are along the correct control flow
• Software Hints: Programmer or compiler provides hints about
whether a conditional branch will be taken or not taken, and
hardware can use these hints for more efficient hardware speculation
23
3
Control Hazards
3.1. Software Scheduling
3.1. Software Scheduling
Expose branch delay slots as part of the instruction set. Branch delay
slots are instructions that follow a jump or branch and are always
executed regardless of whether a jump or branch is taken or not taken.
Compiler tries to insert useful instructions, otherwise inserts nops.
foo:
bar:
addiu
j
nop
addiu
addiu
bne
nop
nop
addiu
addiu
r1, r2, 1
foo
r3, r4, 1
r5, r6, 1
r0, r0, bar
r7, r8, 1
r9, r10, 1
Assume we modify the PARCv1
instruction set to specify that J,
JAL, and JR instructions have a
single-instruction branch delay
slot (i.e., one instruction after a J,
JAL, and JR is always executed)
and the BNE instruction has a
two-instruction branch delay slot
(i.e., two instructions after a BNE
are always executed).
Pipeline diagram showing software scheduling for control hazards
addiu r1, r2, 1
j foo
nop
addiu r5, r6, 1
bne r0, r0, bar
nop
nop
addiu r7, r8, 1
addiu r9, r10, 1
24
3
Control Hazards
3.2. Hardware Stalling
3.2. Hardware Stalling
Although we could stall to resolve control hazards, the issue is that
every instruction essentially creates a control hazard until we know if
that instruction is a jump, branch, or other instruction. Stalling will
seriously degrade the performance of the common case when we have a
sequence of instructions that are not control flow instructions.
addiu r1, r2, 1
j foo
addiu r5, r6, 1
bne r0, r0, bar
addiu r7, r8, 1
addiu r9, r10, 1
25
3
Control Hazards
3.3. Hardware Speculation
3.3. Hardware Speculation
Hardware guesses which way the control flow will go and potentially
fetches incorrect instructions; detects when there is a problem and
re-executes instructions the instructions that are along the correct
control flow. For now, we will only consider a simple branch prediction
scheme where the hardware always predicts not taken.
Pipeline diagram when branch is not taken
addiu r1, r2, 1
j foo
addiu r3, r4, 1
addiu r5, r6, 1
bne r0, r0, bar
addiu r7, r8, 1
addiu r9, r10, 1
Pipeline diagram when branch is taken
addiu r1, r2, 1
j foo
addiu r3, r4, 1
addiu r5, r6, 1
bne r0, r0, bar
addiu r7, r8, 1
addiu r9, r10, 1
addiu r9, r10, 1
26
3
Control Hazards
3.3. Hardware Speculation
Modifications to datapath/control to support hardware speculation
Stall,
Stall Bypass,
&
Stall
Bypass
&
Squash
Signals
Signal
Signals
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
X Stage
j_tgen
ir[15:0]
W Stage
reg_
en_D
nop
mul
ir[15:0]
insert nop
_nop_F
insert
_nop_D
wb_sel_X
op0_DX
regfile
(read)
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
ir[20:16]
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
br_tgen
ir_FD
M Stage
btarg_DX
+4
reg_
en_F
cs_MW
ir[25:0]
pc_plus4
_FD
always
pc
pc_plus4
_sel_F
cs_XM
regfile
(write)
alu_
func_X
alu
sd_DX
op1
op1_
byp_ _sel_D
sel_D
eq_X
sd_XM
bypass_from_X
bypass_from_M
bypass_from_W
imreq
addr
imresp
data
dmreq
data
dmreq dmresp
addr
addr
Deriving the squash signals
squash_F =
(op_D == j) || (op_D == jal) || (op_D == jr)
|| br_taken_X
squash_D = br_taken_X
reg_en_F
reg_en_D
insert_nop_D
insert_nop_F
= !stall || squash_F
= !stall || squash_D
= stall || squash_D
= squash_F
27
3
Control Hazards
3.3. Hardware Speculation
Pipeline diagram for simple assembly sequence
Draw a pipeline diagram illustrating how the following assembly
sequence would execute on a fully bypassed pipelined PARCv1
processor that uses hardware speculation which always predicts
not-taken. Unlike the “standard” PARCv1 processor, you should also
assume that we add a single-instruction branch delay slot to the
instruction set. So this processor will leverage both software
scheduling and hardware speculation. Include microarchitectural
dependency arrows to illustrate both data and control flow.
addiu
bne
addiu
addiu
...
foo:
addu
addiu
r1,
r0,
r4,
r6,
r2,
r3,
r5,
r7,
1
foo
1
1
# assume R[rs] != 0
# instruction in branch delay slot
r8, r1, r4
r9, r1, 1
28
3
Control Hazards
3.4. Interrupts and Exceptions
3.4. Interrupts and Exceptions
Interrupts and exceptions alter the normal control flow of the program.
They are caused by an external or internal event that needs to be
processed by the system, and these events are usually unexpected or
rare from the program’s point of view.
• Asynchronous Interrupts
– Input/output device needs to be serviced
– Timer has expired
– Power distruption or hardware failure
• Synchronous Exceptions
–
–
–
–
–
–
Undefined opcode, privileged instruction
Arithmetic overflow, floating-point exception
Misaligned memory access for instruction fetch or data access
Memory protection violation
Virtual memory page faults
System calls (traps) to jump into the operating system kernel
Interrupts and Exception Semantics
• Interrupts are asynchronous with respect to the program, so the
microarchitecture can decide when to service the interrupt
• Exceptions are synchronous with respect to the program, so they
must be handled immediately
29
3
Control Hazards
3.4. Interrupts and Exceptions
• To handle an interrupt or exception the hardware/software must:
–
–
–
–
–
–
–
–
Stop program at current instruction (I), ensure previous insts finished
Save cause of interrupt or exception in privileged arch state
Save the PC of the instruction I in a special register (EPC)
Switch to privileged mode
Set the PC to the address of either the interrupt or the exception handler
Disable interrupts
Save the user architectural state
Check the type of interrupt or exception
– Handle the interrupt or exception
–
–
–
–
Enable interrupts
Switch to user mode
Set the PC to EPC if I should be restarted
Potentially set PC to EPC+4 if we should skip I
Handling a misaligned data address and syscall exceptions
Static code sequence
Dynamic code sequence
addiu r1, r0, 0x2001
lw
r2, 0(r1)
syscall
opB
opC
...
exception_hander:
opD
# disable interrupts
opE
# save user registers
opF
# check exception type
opG
# handle exception
opH
# enable interrupts
addiu EPC, EPC, 4
eret
addiu r1, r0, 0x2001
lw
r2, 0(r1) (excep)
opD
opE
opF
opG
opH
addiu EPC, EPC, 4
eret
syscall
(excep)
opD
opE
opF
30
3
Control Hazards
3.4. Interrupts and Exceptions
Interrupts and Exceptions in a PARCv4 Pipelined Processor
F
Inst Address
Exceptions
D
X
Illegal
Instruction
Arithmetic
Overflow
M
W
Data Address
Exceptions
• How should we handle a single instruction which generates
multiple exceptions in different stages as it goes down the pipeline?
– Exceptions in earlier pipeline stages override later exceptions for a given
instruction
• How should we handle multiple instructions generating exceptions
in different stages at the same or different times?
– We always want the execution to appear as if we have completed
executed one instruction before going onto the next instruction
– So we want to process the exception corresponding to the earliest
instruction in program order first
– Hold exception flags in pipeline until commit point
– Commit point is after all exceptions could be generated but before any
architectural state has been updated
– To handle an exception at the commit point: update cause and EPC,
squash all stages before the commit point, and set PC to exception handler
• How and where to handle external asynchronous interrupts?
– Inject asynchronous interrupts at the commit point
– Asynchronous interrupts will then naturally override exceptions caused
by instructions earlier in the pipeline
31
3
Control Hazards
3.4. Interrupts and Exceptions
Modifications to datapath/control to support exceptions
exception_FD
exception_DX
exception_XM
Exception
Handling Logic
Stall &
Stall
Bypass
Signals
Signal
cs_XM
cs_DX
Control Signal
Table
F Stage
D Stage
ehandler
br_targ
jr
j_targ
pc_plus4
pc_F
pc
_sel_F
insert
_nop_X
X Stage
cs_MW
nop
M Stage
insert
_nop_M
btarg_DX
j_tgen
ir[15:0]
ir_FD
wb_sel_X
op0_DX
mul
ir[15:0]
regfile
(read)
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
ir[20:16]
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
br_tgen
reg_
en_D
nop
insert
_nop_D
regfile
(write)
alu_
func_X
alu
sd_DX
op1
op1_
byp_ _sel_D
sel_D
eq_X
sd_XM
bypass_from_X
bypass_from_M
bypass_from_W
imreq
addr
W Stage
ir[25:0]
pc_plus4
_FD
+4
reg_
en_F
nop
imresp
data
dmreq
data
dmreq dmresp
addr
addr
Deriving the squash signals
squash_F =
(op_D == j) || (op_D == jal) || (op_D == jr)
|| br_taken_X || exception_M
squash_D = br_taken_X || exception_M
squash_X = exception_M
squash_M = exception_M
32
3
Control Hazards
3.4. Interrupts and Exceptions
Pipeline diagram of exception handling
33
4
Structual Hazards
4. Structual Hazards
Structural hazards occur when an instruction in the pipeline needs a
resource being used by another instruction in the pipeline. The PARCv1
processor pipeline is specifically designed to avoid any structural
hazards.
Let’s introduce a structural hazard by allowing ADDU, ADDIU, MUL,
and JAL instructions to write to the register file in the M stage instead of
waiting until the W stage. We would need to add another writeback
mux in the W stage and carefully handle bypassing.
Using pipeline diagrams to illustrate structural hazards
We use structual dependency arrows to illustrate structual hazards on
pipeline diagrams.
addiu r1, r2, 1
addiu r3, r4, 1
lw
r5, 0(r6)
addiu r7, r8, 1
Approaches to resolving structural hazards
• Software Scheduling: Programmer or compiler explicitly avoids
scheduling instructions that would create structural hazards
• Hardware Stalling: Hardware includes control logic that freezes
later instructions until earlier instruction has finished using the
shared resource
• Hardware Duplication: Add more hardware so that each instruction
can access separate resources at the same time
34
4
Structual Hazards
4.1. Software Scheduling
4.1. Software Scheduling
Insert independent instructions or nops to delay an ADDU, ADDIU,
MUL, or JAL instructions if they follow a LW instruction.
addiu
addiu
lw
nop
addiu
r1, r2, 1
r3, r4, 1
r5, 0(r6)
r7, r8, 1
Pipeline diagram showing software scheduling for structural hazards
addiu r1, r2, 1
addiu r3, r4, 1
lw r5, 0(r6)
nop
addiu r7, r8, 1
35
4
Structual Hazards
4.2. Hardware Stalling
4.2. Hardware Stalling
Hardware includes control logic that freezes an ADDU, ADDIU, MUL,
or JAL instruction if a LW instruction is ahead in the pipeline.
Pipeline diagram showing hardware stalling for structural hazards
addiu r1, r2, 1
addiu r3, r4, 1
lw r5, 0(r6)
addiu r7, r8, 1
Deriving the stall signal
stall_wport_hazard_D = (op_X == lw)
&& (
(op_D == ADDU) || (op_D == ADDIU)
|| (op_D == MUL) || (op_D == JAL) )
stall =
stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D
|| stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D
|| stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D
|| stall_wport_hazard_D
Notice that we stall far before the point when the structural hazard
actually occurs, because we know exactly how instructions move down
the pipeline. Also possible to use dynamic arbitration in the back of the
pipeline.
36
4
Structual Hazards
4.3. Hardware Duplication
4.3. Hardware Duplication
Add a second write port so that an ADDU, ADDIU, MUL, or JAL
instruction can writeback to the register file at the same time as a LW.
Stall,
Stall Bypass,
&
Stall
Bypass
&
Squash
Signals
Signal
Signals
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
ir[15:0]
br_tgen
ir_FD
X Stage
M Stage
W Stage
btarg_DX
j_tgen
+4
reg_
en_F
cs_MW
ir[25:0]
pc_plus4
_FD
always
pc
pc_plus4
_sel_F
cs_XM
reg_
en_D
nop
op0_DX
mul
ir[20:16]
ir[15:0]
insert nop
_nop_F
insert
_nop_D
wb_sel_X
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
regfile
(read)
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
regfile
(write)
alu_
func_X
rf
_wen1_W
rf
_waddr1_W
alu
sd_DX
op1
op1_
byp_ _sel_D
sel_D
eq_X
sd_XM
bypass_from_X
bypass_from_M
bypass_from_W
imreq
addr
imresp
data
dmreq
data
dmreq dmresp
addr
addr
Does allowing early writeback help performance in the first place?
addiu r1, r2, 1
addiu r3, r1, 1
addiu r4, r3, 1
addiu r5, r4, 1
addiu r6, r5, 1
addiu r7, r6, 1
37
5
WAW and WAR Name Hazards
5. WAW and WAR Name Hazards
WAW dependencies occur when an instruction overwrites a register
than an earlier instruction has already written. WAR dependencies
occur when an instruction writes a register than an earlier instruction
needs to read. We use architectural dependency arrows to illustrate
WAW and WAR dependencies in assembly code sequences.
mul
r1, r2, r3
addiu r4, r1, 1
addiu r1, r5, 1
WAW name hazards occur when an instruction in the pipeline writes a
register before an earlier instruction (in back of the pipeline) has had a
chance to write that same register.
WAR name hazards occur when an instruction in the pipeline writes a
register before an earlier instuction (in back of pipeline) has had a
chance to read that same register.
The PARCv1 processor pipeline is specifically designed to avoid any
WAW or WAR name hazards. Instructions always write the registerfile
in-order in the same stage, and instructions always read registers in the
front of the pipeline and write registers in the back of the pipeline.
Let’s introduce a WAW name hazard by using an iterative variable
latency multiplier, and allowing other instructions to continue
executing while the multiplier is working.
38
5
WAW and WAR Name Hazards
5.1. Software Renaming
Using pipeline diagrams to illustrate WAW name hazards
We use microarchitectural dependency arrows to illustrate WAW
hazards on pipeline diagrams.
mul
r1, r2, r3
addiu r4, r1, 1
addiu r1, r5, 1
Approaches to resolving structural hazards
• Software Renaming: Programmer or compiler changes the register
names to avoid creating name hazards
• Hardware Renaming: Hardware dynamically changes the register
names to avoid creating name hazards
• Hardware Stalling: Hardware includes control logic that freezes
later instructions until earlier instruction has finished either writing
or reading the problematic register name
5.1. Software Renaming
As long as we have enough architectural registers, renaming registers in
software is easy. WAW and WAR dependencies occur because we have
a finite number of architectural registers.
mul
r1, r2, r3
addiu r4, r1, 1
addiu r6, r5, 1
39
5
WAW and WAR Name Hazards
5.2. Hardware Stalling
5.2. Hardware Stalling
Simplest approach is to add stall logic in the decode stage similar to
what the approach used to resolve other hazards.
mul
r1, r2, r3
addiu r4, r1, 1
addiu r1, r5, 1
Deriving the stall signal
stall_struct_hazard_D =
(op_D == MUL) && !imul_rdy_D
stall_waw_hazard_D =
wen_D && wen_Z && (waddr_D == waddr_Z) && (waddr_Z != 0)
stall =
||
||
||
||
stall_waddr_X_raddr0_D || stall_waddr_X_raddr1_D
stall_waddr_M_raddr0_D || stall_waddr_M_raddr1_D
stall_waddr_W_raddr0_D || stall_waddr_W_raddr1_D
stall_struct_hazard_D
stall_waw_hazard_D
40
6
Analyzing Processor Performance
6. Analyzing Processor Performance
Time
Instructions
Cycles
Time
=
×
×
Program
Program
Instruction Cycles
Estimating cycle time
Stall,
Stall Bypass,
&
Stall
Bypass
&
Squash
Signals
Signal
Signals
cs_DX
Control Signal
Table
F Stage
D Stage
br_targ
jr
j_targ
pc_plus4
pc_F
X Stage
j_tgen
reg_
en_D
nop
wb_sel_X
op0_DX
br_tgen
mul
ir[20:16]
ir[15:0]
regfile
(read)
op1_DX
sext
wb_sel_M
result
_XM
ir[25:21]
insert nop
_nop_F
insert
_nop_D
regfile
(write)
alu_
func_X
alu
sd_DX
eq_X
op1
op1_
byp_ _sel_D
sel_D
sd_XM
bypass_from_X
bypass_from_M
bypass_from_W
imreq
addr
•
•
•
•
•
•
W Stage
rf
_waddr_W
rf
_wen_W
result
_MW
op0_byp
_sel_D
ir[15:0]
ir_FD
M Stage
btarg_DX
+4
reg_
en_F
cs_MW
ir[25:0]
pc_plus4
_FD
always
pc
pc_plus4
_sel_F
cs_XM
imresp
data
register read
register write
regfile read
regfile write
memory read
memory write
dmreq
data
= 1τ
= 1τ
= 10τ
= 10τ
= 20τ
= 20τ
•
•
•
•
•
•
•
+4 unit
sext unit
br_tgen
j_tgen
mux
multiplier
alu
41
= 4τ
= 1τ
= 8τ
= 1τ
= 3τ
= 20τ
= 10τ
dmreq dmresp
addr
addr
6
Analyzing Processor Performance
Estimating execution time
Using our first-order equation for processor performance, how long in τ
will it take to execute the vvadd example assuming n is 64?
loop:
lw
lw
addu
sw
addiu
addiu
addiu
addiu
bne
jr
r12,
r13,
r14,
r14,
r4,
r5,
r6,
r7,
r7,
r31
0(r4)
0(r5)
r12, r13
0(r6)
r4, 4
r5, 4
r6, 4
r7, -1
r0, loop
lw
lw
addu
sw
addiu
addiu
addiu
addiu
bne
opA
opB
lw
42
6
Analyzing Processor Performance
Using our first-order equation for processor performance, how long in τ
will it take to execute the mystery program assuming n is 64 and that
we find a match on the last element.
addiu
loop:
lw
bne
addiu
jr
foo:
addiu
addiu
bne
addiu
jr
r12, r0, 0
r13, 0(r4)
r13, r6, foo
r2, r12, 0
r31
r4,
r12,
r12,
r2,
r31
r4, 4
r12, 1
r5, loop
r0, -1
43
7
Case Study: MIPS R2K
7. Case Study: MIPS R2K
MIPSI
MIPSII
• MIPS R2K is one of the first popular
pipelined RISC processors
MIPSIII
• MIPS R2K implements the MIPS I
instruction set
MIPSIV
• MIPS = Microprocessor without
Interlocked Pipeline Stages
MIPS64
MIPS32
MIPS16
• MIPS I used software scheduling to avoid some RAW hazards by
including a single-instruction load-use delay slot
• MIPS I used software scheduling to avoid some control hazards by
including a single-instruction branch delay slot
One-Instr Load-Use Delay Slot
One-Instr Branch Delay Slot
addiu
j
addiu
...
foo:
addiu
bne
addiu
...
bar:
r1, r2, 1
foo
r3, r4, 1
lw
lw
addiu
addu
# BDS
r5, r6, 1
r7, r8, bar
r9, r10, 1 # BDS
r1,
r3,
r2,
r5,
0(r2)
0(r4)
r2, 4 # LDS
r1, r3
Deprecated in MIPS II instruction
set; legacy code can still execute
on new microarchitectures, but
code using the MIPS II instruction
set can rely in hardware stalling
Present in all MIPS instruction
sets; not possible to depricate and
still enable legacy code to execute
on new microarchitectures
44
7
Case Study: MIPS R2K
MIPS R2K Microarchitecture
The pipelined datapath and control
were located on a single die. Cache
control and memory management unit
were also integrated on-die, but the
actual tag and data storage for the
cache was located off-chip.
Control Unit
Datapath
I$ Controller
D$ Controller
I$ (4-32KB)
D$ (4-32KB)
Used two-phase clocking to enable five pipeline stages to fit into four
clock cycles. This avoided the need for explicit bypassing from the W
1.2 The MIPS Five-Stage Pipeline
5
stage to the end of the D stage.
IF
from
I-cache
Instruction 1
Instruction sequence
Instruction 2
RD
from
register
file
IF
Instruction 3
MEM
from
D-cache
ALU
RD
ALU
IF
WB
to
register
file
MEM
RD
ALU
WB
MEM
WB
Time
1.2 MIPS five-stage pipeline.
Two-phaseFIGURE
clocking
enabled a single-cycle branch resolution latency
since register read, branch
generation,
and
branch
comparison
fetch dataaddress
from the cache;
is a relatively
rare event
and we can
just27
1.5 a cache
MIPSmiss
Compared
with CISC
Architectures
stop the CPU when it happens (though cleverer CPUs find more useful things
can fit in a single cycle.to do).
Branch
instruction
IF
Branch
delay
Branch
target
1.2
The MIPS architecture was planned with separate instruction and data
caches, so it can fetch an instruction and read or write a memory variable simultaneously.Branch
RD
MEM
CISC address
architectures have caches
too, butWB
they’re most often afterthoughts,
fitted in as a feature of the memory system. A RISC architecture makes more
sense if you regard the caches as very much part of the CPU and tied firmly into
the pipeline.
IF
RD
ALU
MEM
WB
The MIPS Five-Stage
Pipeline
IF
RD
ALU
MEM
WB
The MIPS architecture is made for pipelining, and Figure 1.2 is close to the
earliest MIPS CPUs and typical of many. So long as the CPU runs from the
cache, the execution of every MIPS instruction is divided into five phases, called
pipestages, with each pipestage taking a fixed amount of time. The fixed amount
FIGURE 1.3 The pipeline and branch delays.
of time is usually a processor clock cycle (though some actions take only half
a clock, so the MIPS five-stage pipeline actually occupies only four clock
cycles).on a MIPS CPU without using some registers). For a program running
any systemare
thatrigidly
takes defined
interrupts
traps,
values
these
registers
Allin
instructions
so or
they
can the
follow
the of
same
sequence
may change
any time,
so you’d better
not use at
them.
of pipestages,
even at
where
the instruction
does nothing
some stage. The net
45
7
Case Study: MIPS R2K
MIPS R2K VLSI Design
Process: 2 µm, two metal layers
Clock Frequency: 8–15 MHz
Size: 110K transistors, 80 mm2
46

Documents pareils