University of North Texas Data Hazards Computer Science Questions
Description
Problems C1, C2, C5, Textbook (modified)
1 (C.1.). Use the following code fragment:
Loop: ld x1,0(x2) ;load x1 from address 0+x2
ld x3, 0(x4);load x3 from address 0+x4
add x1,x1,x3 ;x1=x1+x3
sd x1,0,(x2) ;store x1 at address 0+x2
addi x2,x2,4 ;x2=x2+4
addi x4, x4, 4;x4=x4+4
sub x6,x5,x2 ;x6=x5-x2
bgt x6, Loop ;branch to Loop if x6 > 0
Assume that the initial value of x5 is x2 + 396 (loop is repeated 99 times)
a. Data hazards are caused by data dependences in the code. Whether a dependency causes a hazard depends on the machine implementation (i.e., number of pipeline stages). List all of the data dependences in the code above. Record the register, source instruction, and destination instruction; for example, there is a data dependency for register x1 from the ld to the add.
b. Show the timing of this instruction sequence for the 5-stage RISC pipeline without any forwarding or bypassing hardware but assuming that a register read and a write in the same clock cycle (for example, when an instruction writes back result to a register in cycle n, another instruction read the register in the same cycle n) as shown in Figure C.5. Use a pipeline timing chart like that in Figure C.8. Assume that the branch is handled by flushing the pipeline (that is, you lose 2 cycles regardless of the branch decision). If all memory references take 1 cycle, how many cycles does this loop take to execute?
c. Show the timing of this instruction sequence for the 5-stage RISC pipeline with full forwarding and bypassing hardware. Use a pipeline timing chart like that shown in Figure C.8. Assume that the branch is handled by predicting it as not taken (now if branch is not taken, we do not lose any cycles but if branch is taken, we lose 2 cycles -this is similar what I have shown in class). If all memory references take 1 cycle, how many cycles does this loop take to execute?
d. High-performance processors have very deep pipelines—more than 15 stages. Imagine that you have a 10-stage pipeline in which every stage of the 5-stage pipeline has been split in two. The only catch is that, for data forwarding, data are forwarded from the end of a pair of stages to the beginning of the two stages where they are needed. For example, data are forwarded from the output of the second execute stage to the input of the first execute stage, still causing a 1-cycle delay. Show the timing of this instruction sequence for the 10-stage RISC pipeline with full forwarding and bypassing hardware. Use a pipeline timing chart like that shown in Figure C.8 (but with stages labeled IF1, IF2, ID1, etc.). Assume that the branch is handled by predicting it as not taken ( that is, if branch is not taken, we do not lose any cycles but if branch is taken, we lose 4 cycles). How many cycles does this loop take to execute?
e. Assume that in the 5-stage pipeline, the longest stage requires 0.8 ns, and the pipeline register delay is 0.1 ns. What is the clock cycle time of the 5-stage pipeline? If the 10-stage pipeline splits all stages in half, what is the cycle time of the 10-stage machine?
Figure C.5 A set of instructions that depends on the add result uses forwarding paths to avoid the data hazard. The inputs for the sub and and instructions forward from the pipeline registers to the first ALU input. The or receives its result by forwarding through the register file, which is easily accomplished by reading the registers in the second half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that the forwarded result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either the same pipeline register or from different pipeline registers. This would occur, for example, if the and instruction was and x6,x1,x4.
Figure C.6 Forwarding of operand required by stores during MEM. The result of the load is forwarded from the memory output to the memory input to be stored. In addition, the ALU output is forwarded to the ALU input for the address calculation of both the load and the store (this is no different than forwarding to another ALU operation). If the store depended on an immediately preceding ALU operation (not shown herein), the result would need to be forwarded to prevent a stall.
Figure C.8 In the top half, we can see why a stall is needed: the MEM cycle of the load produces a value that is needed in the EX cycle of the sub, which occurs at the same time. This problem is solved by inserting a stall, as shown in the bottom half.
2 (C-2). Suppose the branch frequencies (as percentages of all instructions) are as follows:
Conditional branches 20%
Jumps and calls 5%
Taken conditional branches 75% are taken
a. We are examining a five-stage pipeline where the branch is resolved at the end of second cycle for unconditional branches and at the end of third cycle (Execute) for conditional branches. Show how many cycles are lost for unconditional branches; when a conditional branch is taken and when a conditional branch is not taken. Given the data in the table, how many cycles are lost due all branch instructions?
b. Now assume a high-performance processor in which we have a 15-deep pipeline where the branch is resolved at the end of the fifth cycle for unconditional branches and at the end of the tenth cycle for conditional branches. Show how many cycles are lost for unconditional branches, when a conditional branch is taken and when a conditional branch is not taken. Given the data in the table, how many cycles are lost due all branch instructions?
3 (C-5). For these problems, we will explore a pipeline for a register-memory architecture. The architecture has two instruction formats: a register-register format and a register-memory format. There is a single-memory addressing mode (offset + base register). The ALU operations can of two types:
ALUop Rd, Rs1, Rs2 this is similar to RISC V instructions
or
ALUop Rd, Rs1, disp(R2) here we add the contents at the memory location to Rs1
the result is stored in Rd
Memory address is computed using (Rs2) + disp
Branches compare of two registers and are PC relative (add the constant in the instruction to PC). We will still have RISC V like load and store instructions.
Assume that this machine is pipelined so that a new instruction is started every clock cycle. The pipeline structure, similar to that used in the VAX 8700(Clark, 1987), is shown below
IF RF ALU1 MEM ALU2 WB
IF RF ALU1 MEM ALU2 WB
IF RF ALU1 MEM ALU2 WB
IF RF ALU1 MEM ALU2 WB
IF RF ALU1 MEM ALU2 WB
IF RF ALU1 MEM ALU2 WB
The first ALU stage (ALU1) is used for effective address calculation for memory references and branches. The second ALU stage (ALU2) is used for ALU operations and branch comparison. RF is both a decode and register-fetch cycle. Assume that when a register read and a register write of the same register occur in the same clock, the write data are forwarded (That is, the new data written to the register is the value that is read).
a. Find the number of adders needed, counting any adder or incrementor; show a combination of instructions and pipe stages that justify this answer. You need only give one combination that maximizes the adder count.
b. Find the number of register read and write ports and memory read and write ports required. Show that your answer is correct by showing a combination of instructions and pipeline stage indicating the instruction and the number of read ports and write ports required for that instruction. Note the number of register read ports depends on how many register sources are accessed in a given cycle. Likewise the number of register write ports depend on how many different registers are written in a given cycle. The same applies to memory read and write ports.
c. Determine any data forwarding for any ALUs that will be needed. Assume that there are separate ALUs for the ALU1 and ALU2 pipe stages. Put in all forwarding among ALUs necessary to avoid or reduce stalls. Show the relationship between the two instructions involved in forwarding using the format of the table in Figure C.23 but ignoring the last two columns. Be careful to consider forwarding across an intervening instruction—for example,
add x1, …
any instruction
add …, x1, …
d. Show all of the data forwarding requirements necessary to avoid or reduce stalls when either the source or destination unit is not an ALU. Use the same format as in Figure C.23, again ignoring the last two columns. Remember to forward to and from memory references.
e. Show all the remaining hazards that involve at least one unit other than an ALU as the source or destination unit. Use a table like that shown in Figure C.25, but replace the last column with the lengths of the hazards.
f. Show all control hazards by example and state the length of the stall. Use a format like that
shown in Figure C.11, labeling each example.
Figure C.11 The behavior of a delayed branch is the same whether or not the branch is taken. The instructions in the delay slot (there was only one delay slot for most RISC architectures that incorporated them) are executed. If the branch is untaken, execution continues with the instruction after the branch delay instruction; if the branch is taken, execution continues at the branch target. When the instruction in the branch delay slot is also a branch, the meaning is unclear: if the branch is not taken, what should happen to the branch in the branch delay slot? Because of this confusion, architectures with delay branches often disallow putting a branch in the delay slot.
Figure C.23 Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from the ALU result (in EX/MEM or in MEM/WB) or from the load result in MEM/WB. There are 10 separate comparisons needed to tell whether a forwarding operation should occur. The top and bottom ALU inputs refer to the inputs corresponding to the first and second ALU source operands, respectively, and are shown explicitly in Figure C.18 on page C.30 and in Figure C.24 on page C.36. Remember that the pipeline latch for destination instruction in EX is ID/EX, while the source values come from the ALUOutput portion of EX/MEM or MEM/WB or the LMD portion of MEM/WB. There is one complication not addressed by this logic: dealing with multiple instructions that write the same register. For example, during the code sequence add x1, x2, x3; addi x1, x1, 2; sub x4, x3, x1, the logic must ensure that the sub instruction uses the result of the addi instruction rather than the result of the add instruction. The logic shown here can be extended to handle this case by simply testing that forwarding from MEM/WB is enabled only when forwarding from EX/MEM is not enabled for the same input. Because the addi result will be in EX/MEM, it will be forwarded, rather than the add result in MEM/WB.
Figure C.25 To minimize the impact of deciding whether a conditional branch is taken, we compute the branch target address in ID while doing the conditional test and final selection of next PC in EX. As mentioned in Figure C.19, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."