

SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (9<sup>th</sup> Week) Outline 3. The Central Processing Unit **3.1 Instruction Sets: Characteristics and Functions** 3.2 Instruction Sets: Addressing Modes and **Formats 3.3 Processor Structure and Function 3.4 Reduced Instruction Set Computers 3.5 Instruction-Level Parallelism and Superscalar** Processors

3.4 Reduced Instruction Set Computers (RISC)

H

# 3.4 Outline

- Instruction Execution Characteristics
- The Use of a Large Register File
- Compiler-Based Register Optimization
- Reduced Instruction Set Architecture
- RISC Pipelining
- MIPS R4000
- Sparc
- RISC Versus CISC Controversy

# Major advances since the birth of the computer

- The family concept
- Microprogrammed control unit
- Cache memory
- Pipelining
- Multiple processors
- Multiple processors
- Reduced instruction set computer (RISC) architecture

# Characteristics of Some CISCs, RISCs, and Superscalar Processors

|                                         |                | lex Instructi<br>ISC)Compu |                |          | nstruction<br>Computer | Superscalar |                |                |
|-----------------------------------------|----------------|----------------------------|----------------|----------|------------------------|-------------|----------------|----------------|
| Characteristic                          | IBM<br>370/168 | VAX<br>11/780              | Intel<br>80486 | SPARC    | MIPS<br>R4000          | PowerPC     | Ultra<br>SPARC | MIPS<br>R10000 |
| Year developed                          | 1973           | 1978                       | 1989           | 1987     | 1991                   | 1993        | 1996           | 1996           |
| Number of<br>instructions               | 208            | 303                        | 235            | 69       | 94                     | 225         |                |                |
| Instruction size (bytes)                | 2-6            | 2–57                       | 1-11           | 4        | 4                      | 4           | 4              | 4              |
| Addressing modes                        | 4              | 22                         | 11             | 1        | 1                      | 2           | 1              | 1              |
| Number of general-<br>purpose registers | 16             | 16                         | 8              | 40 - 520 | 32                     | 32          | 40 - 520       | 32             |
| Control memory size<br>(Kbits)          | 420            | 480                        | 246            | —        | _                      | _           | —              | _              |
| Cache size (KBytes)                     | 64             | 64                         | 8              | 32       | 128                    | 16-32       | 32             | 64             |

Characteristics of Some CISCs, RISCs, and Superscalar Processors

# Instruction Execution Characteristics

### **High-level languages (HLLs)**

- •Allow the programmer to express algorithms more concisely
- •Allow the compiler to take care of details that are not important in the programmer's expression of algorithms
- •Often support naturally the use of structured programming and/or object-oriented design

#### **Execution sequencing**

•Determines the control and pipeline organization

#### **Operands used**

•The types of operands and the frequency of their use determine the memory organization for storing them and the addressing modes for accessing them

#### Semantic gap

•The difference between the operations provided in HLLs and those provided in computer architecture

#### **Operations** performed

•Determine the functions to be performed by the processor and its interaction with memory

# Weighted Relative Dynamic Frequency of HLL Operations

|        | Dynamic ( | Occurrence |        | instruction<br>ghted | Memory-Reference<br>Weighted |     |  |
|--------|-----------|------------|--------|----------------------|------------------------------|-----|--|
|        | Pascal    | С          | Pascal | С                    | Pascal                       | С   |  |
| ASSIGN | 45%       | 38%        | 13%    | 13%                  | 14%                          | 15% |  |
| LOOP   | 5%        | 3%         | 42%    | 32%                  | 33%                          | 26% |  |
| CALL   | 15%       | 12%        | 31%    | 33%                  | 44%                          | 45% |  |
| IF     | 29%       | 43%        | 11%    | 21%                  | 7%                           | 13% |  |
| GOTO   | —         | 3%         | _      | _                    | _                            | _   |  |
| OTHER  | 6%        | 1%         | 3%     | 1%                   | 2%                           | 1%  |  |

Weighted Relative Dynamic Frequency of HLL Operations [PATT82a]



|                  | Pascal | С   | Average |
|------------------|--------|-----|---------|
| Integer Constant | 16%    | 23% | 20%     |
| Scalar Variable  | 58%    | 53% | 55%     |
| Array/Structure  | 26%    | 24% | 25%     |

Dynamic Percentage of Operands

# Procedure Arguments and Local Scalar Variables

| Percentage of Executed<br>Procedure Calls With | Compiler, Interpreter, and<br>Typesetter | Small Nonnumeric<br>Programs |
|------------------------------------------------|------------------------------------------|------------------------------|
| >3 arguments                                   | 0-7%                                     | 0-5%                         |
| >5 arguments                                   | 0-3%                                     | 0%                           |
| >8 words of arguments and<br>local scalars     | 1-20%                                    | 0-6%                         |
| >12 words of arguments and<br>local scalars    | 1-6%                                     | 0–3%                         |

**Procedure Arguments and Local Scalar Variables** 

# Implications

HLLs can best be supported by optimizing performance of the most time-consuming features of typical HLL programs

Three elements characterize RISC architectures:

- Use a large number of registers or use a compiler to optimize register usage
- Careful attention needs to be paid to the design of instruction pipelines
- Instructions should have predictable costs and be consistent with a high-performance implementation

# The Use of a Large Register File

### Software Solution

- Requires compiler to allocate registers
- Allocates based on most used variables in a given time
- Requires sophisticated program analysis

### Hardware Solution

- More registers
- Thus more variables will be in registers

# **Overlapping Register Windows**

+



### Figure 15.1 Overlapping Register Windows

Circular Buffer Organization of Overlapped Windows

Restore



Figure 15.2 Circular-Buffer Organization of Overlapped Windows

# **Global Variables**

- Variables declared as global in an HLL can be assigned memory locations by the compiler and all machine instructions that reference these variables will use memory reference operands
  - However, for frequently accessed global variables this scheme is inefficient
- Alternative is to incorporate a set of global registers in the processor
  - These registers would be fixed in number and available to all procedures
  - A unified numbering scheme can be used to simplify the instruction format
- There is an increased hardware burden to accommodate the split in register addressing
- In addition, the linker must decide which global variables should be assigned to registers

## Characteristics of Large-Register-File and Cache Organizations

### Large Register File

All local scalars

Individual variables

Compiler-assigned global variables

Save/Restore based on procedure nesting depth

Register addressing

Multiple operands addressed and accessed in one cycle

### Cache

Recently-used local scalars

Blocks of memory

Recently-used global variables

Save/Restore based on cache replacement algorithm

Memory addressing

One operand addressed and accessed per cycle

Characteristics of Large-Register-File and Cache Organizations

### Referencing a Scalar



(a) Windows-based register file



Figure 15.3 Referencing a Scalar

# **Graph Coloring Approach**



**Actual Registers** 

(a) Time sequence of active use of registers



(b) Register interference graph

Figure 15.4 Graph Coloring Approach

# Why CISC ?

### (Complex Instruction Set Computer)

There is a trend to richer instruction sets which include a larger and more complex number of instructions

Two principal reasons for this trend:

- A desire to simplify compilers
- A desire to improve performance

There are two advantages to smaller programs:

- The program takes up less memory
- Should improve performance
  - Fewer instructions means fewer instruction bytes to be fetched
  - In a paging environment smaller programs occupy fewer pages, reducing page faults
  - More instructions fit in cache(s)



|            | [PATT82a]     | [KATE83]      | [HEAT84]     |
|------------|---------------|---------------|--------------|
|            | 11 C Programs | 12 C Programs | 5 C Programs |
| RISC I     | 1.0           | 1.0           | 1.0          |
| VAX-11/780 | 0.8           | 0.67          |              |
| M68000     | 0.9           |               | 0.9          |
| Z8002      | 1.2           |               | 1.12         |
| PDP-11/70  | 0.9           | 0.71          |              |

Code Size Relative to RISC I

# Characteristics of Reduced Instruction Set Architectures

One machine<br/>instruction per<br/>machine cycle• Machine cycle --- the time it takes to fetch two operands from<br/>registers, perform an ALU operation, and store the result in a<br/>registerRegister-to-register<br/>operations• Only simple LOAD and STORE operations accessing memory<br/>• This simplifies the instruction set and therefore the control unitSimple addressing<br/>modes• Simplifies the instruction set and the control unit

# Simple instruction formats

- Generally only one or a few formats are used
- Instruction length is fixed and aligned on word boundaries
- Opcode decoding and register operand accessing can occur simultaneously

### **Comparison of Register-to-Register and Memory-to-Memory Approaches**

| 8       | 16     | 16                | 16                 |
|---------|--------|-------------------|--------------------|
| Add     | В      | С                 | А                  |
|         | Mei    | nory to memory    | No. 25 Charles and |
| 1. 1.   | I = 50 | 6, D= 96, M = 152 |                    |
| the set |        |                   |                    |

| ł         | 8     | 4      | 22 3 | 1  | 6 |  |  |  |  |
|-----------|-------|--------|------|----|---|--|--|--|--|
|           | Load  | RB     |      | В  |   |  |  |  |  |
| 500       | Load  | RC     |      | H  | 3 |  |  |  |  |
| Ser and   | Add   | R<br>A | RB   | RC |   |  |  |  |  |
| 1.1 m 1.4 | Store | R<br>A |      | A  | 1 |  |  |  |  |

Register to memory

I = 104, D = 96, M = 200

| 1    | 8   | 16 | 16 | 16 |
|------|-----|----|----|----|
|      | Add | В  | С  | А  |
|      | Add | А  | С  | В  |
| 1000 | Sub | В  | D  | D  |

Memory to memory

I = 168, D= 288, M = 456

(b)  $A \leftarrow B + C$ ;  $B \leftarrow A + C$ ;  $D \leftarrow D - B$ 

I = number of bytes occupied by executed instructions D = number of bytes occupied by data M = total memory traffic = I + D

Figure 15.5 Two Comparisons of Register-to-Register and Memory-to-Memory Approaches

| 8   | 4  | 4  | 4  |
|-----|----|----|----|
| Add | RA | RB | RC |
| Add | RB | RA | RC |
| Sub | RD | RD | RB |

Register to register

I = 60, D = 0, M = 60

(a)  $A \leftarrow B + C$ 

# **Characteristics of Some Processors**

| Processor   | Number<br>of<br>instruc-<br>tion<br>sizes | Max<br>instruc-<br>tion size<br>in bytes | Number of<br>addressing<br>modes | Indirect<br>addressing | Load/store<br>combined<br>with<br>arithmetic | Max<br>number of<br>memory<br>operands | Unaligned<br>addressing<br>allowed | Max<br>Number of<br>MMU uses | Number of<br>bits for<br>integer<br>register<br>specifier | Number of<br>bits for FP<br>register<br>specifier |
|-------------|-------------------------------------------|------------------------------------------|----------------------------------|------------------------|----------------------------------------------|----------------------------------------|------------------------------------|------------------------------|-----------------------------------------------------------|---------------------------------------------------|
| AMD29000    | 1                                         | 4                                        | 1                                | no                     | no                                           | 1                                      | no                                 | 1                            | 8                                                         | 3"                                                |
| MIPS R2000  | 1                                         | 4                                        | 1                                | no                     | no                                           | 1                                      | no                                 | 1                            | 5                                                         | 4                                                 |
| SPARC       | 1                                         | 4                                        | 2                                | no                     | no                                           | 1                                      | no                                 | 1                            | 5                                                         | 4                                                 |
| MC88000     | 1                                         | 4                                        | 3                                | no                     | no                                           | 1                                      | no                                 | 1                            | 5                                                         | 4                                                 |
| HP PA       | 1                                         | 4                                        | 10 "                             | no                     | no                                           | 1                                      | no                                 | 1                            | 5                                                         | 4                                                 |
| IBM RT/PC   | 2ª                                        | 4                                        | 1                                | no                     | no                                           | 1                                      | no                                 | 1                            | 4.4                                                       | 3"                                                |
| IBM RS/6000 | 1                                         | 4                                        | 4                                | no                     | no                                           | 1                                      | yes                                | 1                            | 5                                                         | 5                                                 |
| Intel i860  | 1                                         | 4                                        | 4                                | no                     | no                                           | 1                                      | no                                 | 1                            | 5                                                         | 4                                                 |
| IBM 3090    | 4                                         | 8                                        | 2*                               | no <sup>b</sup>        | yes                                          | 2                                      | yes                                | 4                            | 4                                                         | 2                                                 |
| Intel 80486 | 12                                        | 12                                       | 15                               | no <sup>b</sup>        | yes                                          | 2                                      | yes                                | 4                            | 3                                                         | 3                                                 |
| NSC 32016   | 21                                        | .21                                      | - 23                             | yęs                    | yes                                          | 2                                      | yes                                | 4                            | - 3                                                       | 3                                                 |
| MC68040     | 11                                        | 22                                       | 44                               | yes                    | yes                                          | 2                                      | yes                                | 8                            | 4                                                         | 3                                                 |
| VAX         | 56                                        | 56                                       | 22                               | yes                    | ýes                                          | 6                                      | yes                                | 24                           | 4                                                         | 0                                                 |
| Clipper     | 4°                                        | 8 "                                      | 9°                               | no                     | no                                           | 1                                      | 0                                  | 2                            | 4.4                                                       | 34                                                |
| Intel 80960 | 2"                                        | 8 "                                      | 9°                               | no                     | no                                           | 1                                      | yes"                               | -                            | 5                                                         | 34                                                |

a RISC that does not conform to this characteristic.

b CISC that does not conform to this characteristic.

# **The Effects of Pipelining**

Load  $rA \leftarrow M$ Load  $rB \leftarrow M$ Add  $rC \leftarrow rA + rB$ Store  $M \leftarrow rC$ Branch X

|     | Е       | D   |    | 1   | 57  | 1   | 1   | -   |      | 40  |      |    |
|-----|---------|-----|----|-----|-----|-----|-----|-----|------|-----|------|----|
| 200 | 1       | -   | Ι  | Е   | D   | 1   | 19  | 33  | 14   |     |      | 30 |
|     | 124     | 1   | 1  | 20  |     | Ι   | Е   | 3   | - !! |     | 100  |    |
|     | 1       | 1   |    | 1   | 12  |     | 12  | Ι   | Е    | D   | 1    |    |
|     | and the |     | 18 | 22  | 22  | 120 | 10  |     |      | 11  | Ι    | Е  |
|     | 60.0    | 10- |    | 100 | (1) | 13  | 20. | 115 | 1.43 | 200 | 12.2 | 10 |

(a) Sequential execution

| Load   | rA ← M                  | Ι       | Е  | D     | 1    | - | 10   | 1  | 1 |
|--------|-------------------------|---------|----|-------|------|---|------|----|---|
| Load   | rB ← M                  |         | Ι  | Е     | D    |   |      |    | 1 |
| NOOP   | CHARLES STATE           |         | 2  | Ι     | Е    | 2 |      | 1  | 1 |
| Add    | $rC \leftarrow rA + rB$ | 3.      | 1  | 1     | Ι    | Е | 1    | 53 |   |
| Store  | M ← rC                  |         | 50 |       | 1    | Ι | Е    | D  |   |
| Branch | Χ.                      | -       | 1  | 1     | 17.1 |   | Ι    | Е  | 3 |
| NOOP   |                         | 14.<br> | 1  | 11.1  |      |   | 3    | Ι  | Е |
|        |                         | 1       |    | 1.000 |      |   | 1.00 |    |   |

(c) Three-stage pipelined timing

Load $rA \leftarrow M$ IEDLoad $rB \leftarrow M$ IEDAdd $rC \leftarrow rA + rB$ IIStore $M \leftarrow rC$ IBranchXIINOOPII



Load  $rA \leftarrow M$ Load  $rB \leftarrow M$ NOOP Add  $rC \leftarrow rA + rB$ Store  $M \leftarrow rC$ Branch X NOOP NOOP

|   |      |       |       |       |       |       |       | A     |       |       |        |
|---|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--------|
|   | Ι    | $E_1$ | $E_2$ | D     |       |       |       |       | 1     | 1     | 5      |
|   |      | Ι     | $E_1$ | $E_2$ | D     | 210   |       |       |       |       |        |
|   | -    | 17    | Ι     | $E_1$ | $E_2$ | 5 A   | 1     | 1     | -     | 1     | 1.2    |
|   | 100  | , Sa  | 1     | Ι     | $E_1$ | $E_2$ |       |       | 1.    |       | in the |
| B |      |       | 0     | 1     | Ι     | $E_1$ | $E_2$ | -     | 24    | 2     | 20     |
|   |      | 5     | 12    |       | 1     | Ι     | $E_1$ | $E_2$ | D     |       | 100    |
|   | 11.5 |       | 6.7   |       |       | 100   | Ι     | $E_1$ | $E_2$ |       |        |
| 1 |      | 144   | 4     |       |       | 1     | 1     | Ι     | $E_1$ | $E_2$ |        |
| 3 |      | 1     | 1     | - jh  | E.    | 2     |       | -     | Ι     | $E_1$ | $E_2$  |
|   |      |       |       |       |       | 1     |       |       | 1.1   |       |        |

E

Ι

E D

E

IE

(d) Four-stage pipelined timing

Figure 15.6 The Effects of Pipelining

# **Optimization of Pipelining**

### Delayed branch

- Does not take effect until after execution of following instruction
- This following instruction is the delay slot

### Delayed Load

- Register to be target is locked by processor
- Continue execution of instruction stream until register required
- Idle until load is complete
- Re-arranging instructions can allow useful work while loading

### Loop Unrolling

- Replicate body of loop a number of times
- Iterate loop fewer times
- Reduces loop overhead
- Increases instruction parallelism
- Improved register, data cache, or TLB locality



# Normal and Delayed Branch

| Address | Norma | l Branch | Delaye | d Branch | -     | imized<br>d Branch |
|---------|-------|----------|--------|----------|-------|--------------------|
| 100     | LOAD  | X, rA    | LOAD   | X, rA    | LOAD  | X, rA              |
| 101     | ADD   | l,rA     | ADD    | l,rA     | JUMP  | 105                |
| 102     | JUMP  | 105      | JUMP   | 106      | ADD   | l,rA               |
| 103     | ADD   | rA, rB   | NOOP   |          | ADD   | rA, rB             |
| 104     | SUB   | rC, rB   | ADD    | rA, rB   | SUB   | rC, rB             |
| 105     | STORE | rA, Z    | SUB    | rC, rB   | STORE | rA, Z              |
| 106     |       |          | STORE  | rA, Z    |       |                    |

### Use of the Delayed Branch

| 25. 1 12. | 1      | 2  | 3     | 4  | 5    | . 6 |
|-----------|--------|----|-------|----|------|-----|
| D X, rA   | Ι      | Е  | D     |    |      |     |
| 01, rA    | (H2S)  | Ι  | Е     | 時一 | STAN |     |
| IP 105    | 1-11   | 1  | Ι     | Е  |      | 83  |
| rA, rB    | 11-1-1 |    | 10.00 | Ι  | Е    | 200 |
| RE rA, Z  |        | 11 | 37    | 1  | I    | E   |

(a) Traditional Pipeline

| 100 LOAD X, rA  | Γ |
|-----------------|---|
| 101 ADD 1, rA   |   |
| 102 JUMP 106.   |   |
| 103 NOOP        |   |
| 106 STORE rA, Z | 4 |

100 LOA 101 ADD 102 JUM 103 ADD 105 STO

| ŝ | Ι  | Е  | D   | 14. |         |      |   |
|---|----|----|-----|-----|---------|------|---|
| ŝ |    | Ι  | Е   | 100 | - 112 - | 23.3 | Ş |
| 1 |    |    | Ι   | Е   |         |      |   |
|   |    | H. |     | Ι   | Е       | #    |   |
| - | 11 |    | 2.1 | 1   | Ι       | Е    | D |



100 LOAD X, Ar 101 JUMP 105 102 ADD 1, rA 105 STORE rA, Z

| 82   | Ι  | Е  | D | 5.00 |   | 510 |
|------|----|----|---|------|---|-----|
| 14   |    | Ι  | Е | 「    |   | 1 H |
| -    | 11 |    | Ι | Е    | 1 |     |
| 10-1 | 30 | 10 | 5 | Ι    | Е | D   |

(c) Reversed Instructions

Figure 15.7 Use of the Delayed Branch

Time

7

D

# **MIPS R4000**

One of the first commercially available RISC chip sets was developed by MIPS Technology Inc.

Inspired by an experimental system developed at Stanford

Uses 64 bits for all internal and external data paths and for addresses, registers, and the ALU Is partitioned into two sections, one containing the CPU and the other containing a coprocessor for memory management

Provides for up to 128 Kbytes of high-speed cache, half each for instructions and data Has substantially the same architecture and instruction set of the earlier MIPS designs (R2000 and R3000)

Supports thirty-two 64bit registers

| OP    | Description                             | OP           | Description                                  |
|-------|-----------------------------------------|--------------|----------------------------------------------|
|       | Load/Store Instructions                 |              | Multiply/Divide Instructions                 |
| LB    | Load Byte                               | MULT         | Multiply                                     |
| LBU   | Load Byte Unsigned                      | MULTU        | Multiply Unsigned                            |
| LH    | Load Halfword                           | DIV          | Divide                                       |
| LHU   | Load Halfword Unsigned                  | DIVU         | Divide Unsigned                              |
| LW    | Load Word                               | MFHI         | Move From HI                                 |
| LWL   | Load Word Left                          | MTHI         | Move To HI                                   |
| LWR   | Load Word Right                         | MFLO         | Move From LO                                 |
| SB    | Store Byte                              | MTLO         | Move To LO                                   |
| SH    | Store Halfword                          |              | Jump and Branch Instructions                 |
| SW    | Store Word                              | J            | Jump                                         |
| SWL   | Store Word Left                         | JAL          | Jump and Link                                |
| SWR   | Store Word Right                        | JR           | Jump to Register                             |
|       | thmetic Instructions (ALU Immediate)    | JALR         | Jump and Link Register                       |
| ADDI  | Add Immediate                           | BEQ          | Branch on Equal                              |
| ADDIU | Add Immediate Unsigned                  | BNE          | Branch on Not Equal                          |
| SLTI  | Set on Less Than Immediate              | BLEZ         | Branch on Less Than or Equal to Zero         |
| SLTIU | Set on Less Than Immediate Unsigned     | BGTZ         | Branch on Greater Than Zero                  |
| ANDI  | AND Immediate                           | BLTZ         | Branch on Less Than Zero                     |
| ORI   | OR Immediate                            | BGEZ         | Branch on Greater Than or Equal to Zer       |
| XORI  | Exclusive-OR Immediate                  | BLTZAL       | Branch on Less Than Zero And Link            |
| LUI   | Load Upper Immediate                    | BGEZAL       | Branch on Greater Than or Equal to Zero      |
| .01   | Load Opper minediate                    | DULLAL       | And Link                                     |
| Arit  | hmetic Instructions (3-operand, R-type) |              | Coprocessor Instructions                     |
| ADD   | Add                                     | 1.32/05-     | Load Word to Coprocessor                     |
| ADD   | Add Unsigned                            | LWCz<br>SWCz | Store Word to Coprocessor                    |
| SUB   | Subtract                                | MTCz         |                                              |
| SUBU  | Subtract Unsigned                       | MFCz         | Move To Coprocessor<br>Move From Coprocessor |
|       |                                         |              |                                              |
| SLT   | Set on Less Than                        | CTCz         | Move Control To Coprocessor                  |
| SLTU  | Set on Less Than Unsigned               | CFCz         | Move Control From Coprocessor                |
| AND   | AND                                     | COPz         | Coprocessor Operation                        |
| OR    | OR                                      | BCzT         | Branch on Coprocessor z True                 |
| XOR   | Exclusive-OR                            | BCzF         | Branch on Coprocessor z False                |
| NOR   | NOR                                     |              | Special Instructions                         |
|       | Shift Instructions                      | SYSCALL      | System Call                                  |
| SLL   | Shift Left Logical                      | BREAK        | Break                                        |
| SRL   | Shift Right Logical                     |              |                                              |
| SRA   | Shift Right Arithmetic                  |              |                                              |
| SLLV  | Shift Left Logical Variable             |              |                                              |
| SRLV  | Shift Right Logical Variable            |              |                                              |
| SRAV  | Shift Right Arithmetic Variable         |              |                                              |

# MIPS R-Series Instruction Set

# **MIPS Instruction Formats**



| Operation | Operation code                             |
|-----------|--------------------------------------------|
| IS        | Source register specifier                  |
| n         | Source/destination register specifier      |
| Immediate | Immediate, branch, or address displacement |
| Target    | Jump target address                        |
| rd        | Destination register specifier             |
| Shift     | Shift amount                               |
| Function  | ALU/shift function specifier               |

### Figure 15.9 MIPS Instruction Formats

# **Enhancing the R3000 Pipeline**



(a) Detailed R3000 pipeline

| Cycle | Cycle   | Су | cle | Су | cle | Cy | cle | Су    | cle |
|-------|---------|----|-----|----|-----|----|-----|-------|-----|
| ITLB  | I-Cache | RF | Al  | LU | DT  | LB | D-C | Cache | WB  |

(b) Modified R3000 pipeline with reduced latencies

| Cycle | Cy | cle Cy | cle Cy  | cle Cy | cle |
|-------|----|--------|---------|--------|-----|
| ITLB  | RF | ALU    | D-Cache | TC     | WB  |

| IF      | = | Instruction fetch               |
|---------|---|---------------------------------|
| RD      | = | Read                            |
| MEM     | = | Memory access                   |
| WB      | = | Write back to register file     |
| I-Cache | = | Instruction cache access        |
| RF      | = | Fetch operand from register     |
| D-Cache | = | Data cache access               |
| ITLB    | = | Instruction address translation |
| IDEC    | = | Instruction decode              |
| IA      | = | Compute instruction address     |
| DA      | = | Calculate data virtual address  |
| DTLB    | = | Data address translation        |
| TC      | = | Data cache tag check            |

(c) Optimized R3000 pipeline with parallel TLB and cache accesses

#### Figure 15.10 Enhancing the R3000 Pipeline

# **R3000 Pipeline Stages**

| Pipeline<br>Stage | Phase             | Function                                                                                                                                |
|-------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| IF                | φ1                | Using the TLB, translate an instruction virtual address to a physical<br>address (after a branching decision).                          |
| IF                | ф2                | Send the physical address to the instruction address.                                                                                   |
| RD                | φ1                | Return instruction from instruction cache.                                                                                              |
|                   |                   | Compare tags and validity of fetched instruction.                                                                                       |
| RD                | φ2                | Decode instruction.<br>Read register file.<br>If branch, calculate branch target address.                                               |
| ALU               | $\phi 1 + \phi 2$ | If register-to-register operation, the arithmetic or logical operation is<br>performed.                                                 |
| ALU               | φ1                | If a branch, decide whether the branch is to be taken or not.<br>If a memory reference (load or store), calculate data virtual address. |
| ALU               | φ2                | If a memory reference, translate data virtual address to physical using TLB.                                                            |
| MEM               | φ1                | If a memory reference, send physical address to data cache.                                                                             |
| MEM               | φ2                | If a memory reference, return data from data cache, and check tags.                                                                     |
| WB                | φ1                | Write to register file.                                                                                                                 |

# Theoretical R3000 and Actual R4000 Superpipelines

| 1000  | Clock Cycle |     |     |     |     |     |     |     |     | 1.25 |    |  |
|-------|-------------|-----|-----|-----|-----|-----|-----|-----|-----|------|----|--|
| THE A | IC1         | IC2 | RF  | ALU | ALU | DC1 | DC2 | TC1 | TC2 | WB   |    |  |
| New.  |             | IC1 | IC2 | RF  | ALU | ALU | DC1 | DC2 | TC1 | TC2  | WB |  |

(a) Superpipelined implmentation of the optimized R3000 pipeline



(b) R4000 pipeline

- IF = Instruction fetch first half IS = Instruction fetch second half RF = Fetch operands from register EX = Instruction execute IC = Instruction cache
- DC = Data cache
- DF = Data cache first half
- DS = Data cache second half
- TC = Tag check
- WB = Write back to register file

### Figure 15.11 Theoretical R3000 and Actual R4000 Superpipelines

# **R4000 Pipeline Stages**

### Instruction fetch first half

- Virtual address is presented to the instruction cache and the translation lookaside buffer
- Instruction fetch second half
  - Instruction cache outputs the instruction and the TLB generates the physical address

### Register file

- One of three activities can occur:
  - Instruction is decoded and check made for interlock conditions
  - Instruction cache tag check is made
  - Operands are fetched from the register file
- Tag check
  - Cache tag checks are performed for loads and stores

- Instruction execute
  - One of three activities can occur:
    - If register-to-register operation the ALU performs the operation
    - If a load or store the data virtual address is calculated
    - If branch the branch target virtual address is calculated and branch operations checked
- Data cache first
  - Virtual address is presented to the data cache and TLB
- Data cache second
  - The TLB generates the physical address and the data cache outputs the data
- Write back
  - Instruction result is written back to register file



**Scalable Processor Architecture** 



- Architecture defined by Sun Microsystems
- Sun licenses the architecture to other vendors to produce SPARC-compatible machines

 Inspired by the Berkeley RISC 1 machine, and its instruction set and register organization is based closely on the Berkeley RISC model SPARC Register Window Layout With Three Procedures



Figure 15.12 SPARC Register Window Layout with Three Procedures

Eight Register Windows Forming a Circular Stack in SPARC



Figure 15.13 Eight Register Windows Forming a Circular Stack in SPARC

| OP     | Description             | OP                         | Description                        |  |  |
|--------|-------------------------|----------------------------|------------------------------------|--|--|
|        | Load/Store Instructions | Arithmetic Instructions    |                                    |  |  |
| LDSB   | Load signed byte        | ADD                        | Add                                |  |  |
| LDSH   | Load signed halfword    | ADDCC                      | Add, set icc                       |  |  |
| LDUB   | Load unsigned byte      | ADDX                       | Add with carry                     |  |  |
| LDUH   | Load unsigned halfword  | ADDXCC                     | Add with carry, set icc            |  |  |
| LD     | Load word               | SUB                        | Subtract                           |  |  |
| LDD    | Load doubleword         | SUBCC                      | Subtract, set icc                  |  |  |
| STB    | Store byte              | SUBX                       | Subtract with carry                |  |  |
| STH    | Store halfword          | SUBXCC                     | Subtract with carry, set icc       |  |  |
| STD    | Store word              | MULSCC                     | Multiply step, set icc             |  |  |
| STDD   | Store doubleword        | Jump/Branch Instructions   |                                    |  |  |
|        | Shift Instructions      | BCC                        | Branch on condition                |  |  |
| SLL    | Shift left logical      | FBCC                       | Branch on floating-point condition |  |  |
| SRL    | Shift right logical     | CBCC                       | Branch on coprocessor condition    |  |  |
| SRA    | Shift right arithmetic  | CALL                       | Call procedure                     |  |  |
|        | Boolean Instructions    | JMPL                       | Jump and link                      |  |  |
| AND    | AND                     | TCC                        | Trap on condition                  |  |  |
| ANDCC  | AND, set icc            | SAVE                       | Advance register window            |  |  |
| ANDN   | NAND                    | RESTORE                    | Move windows backward              |  |  |
| ANDNCC | NAND, set icc           | RETT                       | Return from trap                   |  |  |
| OR     | OR                      | Miscellaneous Instructions |                                    |  |  |
| ORCC   | OR, set icc             | SETHI                      | Set high 22 bits                   |  |  |
| ORN    | NOR                     | UNIMP                      | Unimplemented instruction (trap)   |  |  |
| ORNCC  | NOR, set icc            | RD                         | Read a special register            |  |  |
| XOR    | XOR                     | WR                         | Write a special register           |  |  |
| XORCC  | XOR, set icc            | IFLUSH                     | Instruction cache flush            |  |  |
| XNOR   | Exclusive NOR           |                            |                                    |  |  |
| XNORCC | Exclusive NOR, set icc  |                            |                                    |  |  |

# SPARC Instruction Set

Table 15.11 SPARC Instruction Set

# Synthesizing Other Addressing Modes with SPARC Addressing Modes

| Instruction Type     | Addressing Mode   | Algorithm    | SPARC Equivalent                  |
|----------------------|-------------------|--------------|-----------------------------------|
| Register-to-register | Immediate         | operand = A  | S2                                |
| Load, store          | Direct            | EA = A       | R <sub>0</sub> + S2               |
| Register-to-register | Register          | EA = R       | R <sub>S1</sub> , R <sub>S2</sub> |
| Load, store          | Register Indirect | EA = (R)     | $R_{s1} + 0$                      |
| Load, store          | Displacement      | EA = (R) + A | R <sub>s1</sub> + S2              |

S2 = either a register operand or a 13-bit immediate operand

### SPARC Instruction Formats



Figure 15.14 SPARC Instruction Formats

# **RISC versus CISC Controversy**

### Quantitative

- Compare program sizes and execution speeds of programs on RISC and CISC machines that use comparable technology
- Qualitative
  - Examine issues of high level language support and use of VLSI real estate

### Problems with comparisons:

- No pair of RISC and CISC machines that are comparable in lifecycle cost, level of technology, gate complexity, sophistication of compiler, operating system support, etc.
- No definitive set of test programs exists
- Difficult to separate hardware effects from complier effects
- Most comparisons done on "toy" rather than commercial products
- Most commercial devices advertised as RISC possess a mixture of RISC and CISC characteristics