# CS184b: Computer Architecture (Abstractions and Optimizations)

Day 2: April 2, 2003
Instruction Set Architecture



Caltech CS184 Spring2003 -- DeHon

# Today

- · Datapath review
- H&P view on ISA
- Questions
- Themes
- Compilers
- RISC

# RISC?

What does RISC mean?

Caltech CS184 Spring2003 -- DeHon

3

# Terminology

- Primitive Instruction (pinst)
  - Collection of bits which tell a single bitprocessing element what to do

- Includes:

- select compute operation
- input sources in space
  - (interconnect)
- input sources in time
  - (retiming)

0000 00 010 11 0110 net0 add mem slqt#6

# Instructions

- Distinguishing feature of programmable architectures?
  - Instructions -- bits which tell the device how to behave



Caltech CS184 Spring2003 -- DeHon

5

# Single ALU Datapath



6



#### ISA

- Model based around Sequential Instruction Execution
- Visible-Machine-State × Instruction → New Visible-Machine-State
- New Visible-Machine-State identifies next instruction
  - PC←PC+1 or PC←branch-target

C5184a Day 5

# Machine State: Initial

- · Counter: 0
- Instruction Memory:

Caltech CS184 Spring2003 -- DeHon

· Data Memory:

000: A

001: B

010: ?

011: ?

100: 00001111

101: ?

110: ?

111: ?

9

# ISA

- Visible Machine State
  - Registers (including PC)
  - Memory
- Instructions
  - ADD R1, R2, R3
  - LD R3, R4
  - BNE R4, R2, loop



#### Instructions

- Primitive operations for constructing (describing) a computation
- · Need to do?
  - Interconnect (space and time)
  - Compute (intersect bits)
  - Control (select operations to run)



#### uCoded / Decoded

- uCoded
  - Bits directly control datapath
  - Horizontal vs. Vertical
  - Not abstract from implementation
- Decoded
  - more compressed
  - only support most common operations
  - abstract from implementation
  - time/area to decode to datapath control signals



#### **H&P View**

- ISA design done?
- Not many opportunities to completely redefine
- Many things mostly settled
  - at least until big technology perturbations arrive
- Implementation (uArch) is where most of the action is
- Andre: maybe we've found a nice local minima...

# **H&P** Issues

- Registers/stack/accumulator
  - # operands, memory ops in instruction
- Addressing Modes
- Operations
- · Control flow
- Primitive Data types
- Encoding

Caltech CS184 Spring2003 -- DeHon

17

# Register/stack/accumulator

- Driven largely by cost model
  - ports into memory
  - latency of register versus memory
  - instruction encoding (bits to specify)
- Recall Assignment 3: Instructions #4

# Register/stack/accumulator

- Today: Load-Store, General Register arch.
- Registers more freedom of addressing than stack
- Load into register, then operate
  - Separate op, not much slower than memory addressing mode
  - usually use more than once (net reduction)

Caltech CS184 Spring2003 -- DeHon

19

# **Addressing Modes**

Minimal:

Immediate #3

RegisterR1

register indirect (R1)

• Others:

- displacement

indirect (double derference)

– auto increment/decrement ( p[x++]=y)

scaled

20

# **Addressing Modes**

- More capable
  - less instructions
  - potentially longer instructions
    - bits and cycle time
  - many operations (complicate atomicity of instructions)
    - Add (R2)+,(R3)+,(R4)+

21

Caltech CS184 Spring2003 -- DeHon

# Address Space Quote

- The Virtual Address eXtension of the PDP-11 architecture . . . provides a virtual address of about 4.3 gigabytes which, even given the rapid improvement of memory technology, should be adequate far into the future.
- William Strecker, "VAX-11/780—A Virtual address Extension to the PDP-11 Family," AFIPS Proc., National Computer Conference, 1978

# **Operations**

- ALU/Arithmetic
  - add, sub, or, and, xor
  - compare
- Interconnect
  - move registers
  - load, store
- Control
  - jump
  - conditional branch
- procedure call/return

23

# **Operations: ALU**

- Small set of SIMD operations
- · Covers very small fraction of the space of all w×w→w

# Operations: Branching

- · Models:
  - ops set condition codes, branch on condition codes
    - · Extra state
  - compare result placed in register; branch on register zero or one
  - comparison part of branch
    - · May affect critical path for branch resolution

Caltech CS184 Spring2003 -- DeHon

25

# Operations: Procedure call/return

- ? Save registers?
- Update PC
  - call target
  - return address
- Change stack and frame pointers
  - store old
  - install new

# Operations: Procedure call/return

- Question: How much should instruction do?
- Lesson: High variance in work needs to be done
  - which registers need to save
  - best way to transfer arguments to procedures
  - better to expose primitives to the compiler and let it specialize the set of operations to the particular call

Caltech CS184 Spring2003 -- DeHon

# Data Types

- Powers of two from bytes to double words?
  - -8, 16, 32, 64
  - (very implementation driven decision)
- Floating Point types
- Are pointers integers?
- Alignment requirements

# **Encoding**

- Variable vs. Fixed
- How complex is the decoding?
  - Fields in the same place…or have to be routed/muxed?
  - Sequential requirements in decode?
    - E.g. must decode previous byte to know what to do with next byte?

29



# Encoding: RISC/Modern



[DLX Instruction Format from HP2nd ed. (Fig. 2.21)]

Caltech CS184 Spring2003 -- DeHon

31

# **Operation Complexity**

- · Contradiction?
  - Providing primitives
  - including floating point ops

#### **Local Minima?**

- Self-Fulfilling?
  - How would we quantitatively validate need for a new operation?
  - -[cue: bridge story]
  - This is what we use as primitives
  - Funny, we don't find a need for other primitives...

Caltech CS184 Spring2003 -- DeHon

33

#### **Themes**

- · Common case fast
- Provide primitives (building blocks)
- Let compiler specialize to particular operation
- Make decode/operation simple so implementation is fast

# Compilers

- 1960→1990 shift
  - increasing capability and sophistication of compilers
  - e.g.
    - · inter-procedural optimization
    - register assignment (register usage)
    - · strength reduction
    - · dataflow analysis and instruction reordering
    - (some progress) alias analysis

Caltech CS184 Spring2003 -- DeHon

35

# Compilers

- Gap between programmer and Architecture
- Increasingly bridged by compiler
- Less need to make assembly language human programmable
- More opportunity for compiler to specialize, partial evaluate
  - (do stuff at compile time to reduce runtime)
- RISC: "Relegate Interesting Stuff to Compiler"

# Implementation Significance

- André Agree: Implementation issues are significant in the design of ISA
- Many of these issues are more interesting when we discuss in light of implementation issues

Caltech CS184 Spring2003 -- DeHon

37

# ISA Driven by

- 1. Implementation costs
- 2. Compiler technology
- 3. Application structure
- Can't do good architecture in isolation from any of these issues.

#### RISC?

Caltech CS184 Spring2003 -- DeHon

39

# **VAX Instructions**

- Vary in length 1 to 53 bytes
- Some very complex
  - Powerful call routines
  - Polynomial evaluate (polyf)
  - Calculate CRC (crc)

# VAX / MIPS procedure

| MIPS versus VAX             |          |                                                            |                                                                                                                                                                               |           |                                                      |                                                                        |  |
|-----------------------------|----------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|------------------------------------------------------|------------------------------------------------------------------------|--|
|                             |          |                                                            | Saving                                                                                                                                                                        | egisters  |                                                      |                                                                        |  |
|                             |          | add1<br>SM<br>SM<br>SM<br>SM<br>SM<br>SM<br>SM<br>SM       | \$29,\$29, -36<br>\$15, 0(\$29)<br>\$16, 4(\$29)<br>\$17, 8(\$29)<br>\$18, 12(\$29)<br>\$19,16(\$29)<br>\$20,20(\$29)<br>\$24,24(\$29)<br>\$25,28(\$29)<br>\$31,32(\$29)      |           | ord ^m <i< th=""><th>r2,r3,r4,r5,r6,r7&gt;</th></i<> | r2,r3,r4,r5,r6,r7>                                                     |  |
|                             |          |                                                            | Procedi                                                                                                                                                                       | re body   |                                                      |                                                                        |  |
| Move<br>parameters          |          | nove                                                       | \$18, \$4<br>\$20, \$5                                                                                                                                                        |           | moval<br>moval                                       | r7.8(ap)<br>r5.4(ap)                                                   |  |
| Outer<br>loop               | for1tst: | add<br>s1t<br>beq                                          | \$19, \$0, \$0<br>\$8, \$19, \$20<br>\$8, \$0, exit1                                                                                                                          | for1tst:  | c1r1<br>cmp1<br>bgeq                                 | r6<br>r6,(r7)<br>exit1                                                 |  |
| lnner<br>loop               | for2tst: | s1t1<br>bne                                                | \$17. \$191<br>\$8. \$17. 0<br>\$8. \$17. 0<br>\$8. \$0. extt2<br>\$15. \$17. 4<br>\$16. \$18. \$15<br>\$24. 0 (\$16)<br>\$25. 4(\$16)<br>\$8. \$25. \$24<br>\$8. \$25. extt2 | for2tst:  | blss<br>movl<br>addl3<br>cmpl<br>bleg                | r4,r6,#1<br>extt2<br>r3,(r5)<br>r2,r4,#1<br>(r3)[r4],(r3)[r2]<br>extt2 |  |
| Pass parameters<br>and call |          |                                                            | \$4, \$18<br>\$5, \$17<br>Swap                                                                                                                                                |           | push1<br>push1<br>calls                              | (r5)<br>r4<br>#2.swap                                                  |  |
| Innter<br>Icop              |          | add1<br>J                                                  | \$17, \$17, -1<br>for2tst                                                                                                                                                     |           | dec1<br>brb                                          | r4<br>for2tst                                                          |  |
| Outer<br>loop               |          | add1<br>J                                                  | \$19, \$19, 1<br>for1tst                                                                                                                                                      | ex1t2:    | incl<br>brb                                          | r6<br>forltst                                                          |  |
|                             |          |                                                            | Restoring                                                                                                                                                                     | registers |                                                      |                                                                        |  |
|                             |          | lu<br>lu<br>lu<br>lu<br>lu<br>lu<br>lu<br>lu<br>lu<br>addi | \$15. D(\$29)<br>\$16. 4(\$29)<br>\$17. B(\$29)<br>\$18.12(\$29)<br>\$19.16(\$29)<br>\$20.20(\$29)<br>\$24.24(\$29)<br>\$25.2B(\$29)<br>\$31.32(\$29)<br>\$29,\$29,\$36       |           |                                                      |                                                                        |  |
| Procedure return            |          |                                                            |                                                                                                                                                                               |           |                                                      |                                                                        |  |
|                             |          | jr.                                                        | \$31                                                                                                                                                                          | exit1:    | ret                                                  |                                                                        |  |

http://jbsim.cs.pku.edu.cn/users/chengxu/Org\_web\_ext/PDF\_FILES/webext3\_vax.pdf 41

Caltech CS184 Spring2003 -- DeHon

# **RISC**

- Reduced Instruction Set Computers
- · Idea:
  - Provide/expose minimal primitives
  - Make sure primitives fast
  - Compose primitives to build functionality
  - Provide orthogonal instructions

# **RISC Equation**

- Time= CPI × Instructions × CycleTime
- · CISC:
  - Minimize: Instructions
  - Result in High CPI
  - Maybe High CycleTime
- · RISC:
  - Target single-cycle primitives (CPI~1)
  - Instruction Count increases
  - Simple encoding, ops → reduced Cycle Time

43

Caltech CS184 Spring2003 -- DeHon

#### **VAX** Data

#### Average VAX Instruction Timing (Cycles per Instruction)

|            | Compute        | Read  | R-Stall | Write | W-Stall | IB-Stall | Total  |
|------------|----------------|-------|---------|-------|---------|----------|--------|
| Decode     | 1.000          |       |         |       |         | 0.613    | 1.613  |
| Spec1      | 0.895          | 0.306 | 0.364   |       |         | *******  | 1.565  |
| Spec2-6    | 1.052          | 0.148 | 0.116   | 0.161 | 0.192   | 0.102    | 1.771  |
| B-Disp     | 0.221          |       |         |       | 0.102   | 0.005    | 0.226  |
| Simple     | 0.870          | 0.029 | 0.017   | 0.033 | 0.027   |          | 0.977  |
| Field      | 0.482          | 0.049 | 0.058   | 0.007 | 0.002   |          | 0.600  |
| Float      | 0.292          | 0.000 | 0.000   | 0.008 | 0.001   |          | 0.302  |
| Call/Ret   | 0.937          | 0.133 | 0.074   | 0.130 | 0.184   |          | 1.458  |
| System     | 0.434          | 0.015 | 0.031   | 0.014 | 0.028   |          |        |
| Character  | 0.318          | 0.039 | 0.099   | 0.046 | 0.004   |          | 0.522  |
| Decimal    | 0.026          | 0.002 | 0.000   | 0.001 | 0.002   |          | 0.506  |
|            |                | 0.002 | 0.000   | 0.001 | 0.002   |          | 0.031  |
| Int/Except | 0.055          | 0.002 | 0.005   | 0.004 | 0.006   |          | 0.054  |
| Mem Mngmt  | 0.555          | 0.061 | 0.200   | 0.004 |         | •        | 0.071  |
| Abort      | 0.127          | 0.001 | 0.200   | 0.004 | 0.003   |          | 0.824  |
|            | ~,_ <b>~</b> , |       |         |       |         |          | 0.127  |
| TOTAL      | 7.267          | 0.783 | 0.964   | 0.409 | 0.450   | 0.720    | 10.593 |

17

Caltech CS184 Spring2003 -- DeHon

[Emer/Clark, ISCA 1984]

#### RISC Enabler 1

- "large", fast On-Chip SRAM
  - Large enough to hold kernel exploded in RISC Ops ~ 1--10K 32b words?
- Previous machines:
  - Off-chip memory bandwidth bottleneck
  - Fetch single instruction from off chip
  - Execute large number of microinstructions from onchip ROM
    - · ROM smaller than SRAM
- Small/minimal machine → make room for cache

Caltech CS184 Spring2003 -- DeHon

45

#### RISC Enable 2

- High Level Programming
  - Bridge semantic gap by compiler
  - As opposed to providing powerful building blocks to assembly language programmer

#### Fit Problem

- "A great dal depends on being able to fit an entire CPU design on a single chip."
- "RISC computers benefit from being realizable at an earlier date."

47

Caltech CS184 Spring2003 -- DeHon

#### Common Case

 "wherever there is a system function that is expensive or slow in all its generality, but where software can recognize a frequently occurring degenerate case (or can move the entire function from runtime to compile time) that function is moved from hardware to software, resulting in lower cost and improved performance." – 801 paper

#### Measurement Good

- Don't assume you know what's going on measure
- Tune your intuition
- "Boy, you ruin all our fun -- you have data." - DEC designers in response to a detailed quantitative study [Emer/Clark Retrospective on 11/780 performance characterization]

49

Caltech CS184 Spring2003 -- DeHon

# VAX/RISC Compare

Table 1: Machine Implementation Parameters

|                    | VAX 4000/300            | MIPS M/2000               | VAX 8700                 |  |
|--------------------|-------------------------|---------------------------|--------------------------|--|
| Chip First Silicon | 1989                    | 1988                      | n/a                      |  |
| System Ship        | 1990                    | 1989                      | 1986                     |  |
| CPU                | REX520                  | R3000                     | n/a                      |  |
| Technology         | Custom CMOS             | Custom CMOS               | ECL gate array           |  |
| Component counts   |                         |                           |                          |  |
| CPU                | 140K transistors,       | 115K transistors          | approx. 100 gate arrays, |  |
|                    | 180Kbits mem            |                           | 1200 gates each          |  |
| FPU                | 134K transistors        | 105K transistors          | (included above)         |  |
| Feature size       | 1.5 micron              | 1.2 micron                |                          |  |
| Die size           |                         |                           |                          |  |
| CPU                | 12x12 mm <sup>2</sup>   | 7.6x8.7 mm <sup>2</sup>   | n/a                      |  |
| FPU                | 12.7x11 mm <sup>2</sup> | 12.6x12.6 mm <sup>2</sup> | ·                        |  |
| Cycle time         | 28 ns.                  | 40 ns.                    | 45 ns.                   |  |
| On-chip cache      | 2 KB                    | none                      | n/a                      |  |
| Board cache        | 128 KB I+D              | 64 KB I, 64 KB D          | 64 KB I+D                |  |
| TLB                | 64 entries              | 64 entries                | 1024 entries             |  |
| Page size          | 512 bytes               | 4 Kbytes                  | 512 bytes                |  |
| Memory access time | 13 cycles               | 12 cycles                 | 16 cycles                |  |
| FP multiply        | 15 cycles               | 5 cycles                  | 15 cycles                |  |
| FP Add             | 14 cycles               | 2 cycles                  | 11 cycles                |  |
| List price         | \$100K                  | \$80K                     | \$492K                   |  |
| Performance        |                         |                           |                          |  |
| Overall SPECmark   | 7.9                     | 17.6                      | 5.6                      |  |
| Integer SPECmark   | 7.7                     | 19.7                      | 5.0                      |  |
| FP SPECmark        | 8.1                     | 16.3                      | 6.0                      |  |

[Bhandarkar/Clark ASPLOS 1991] 50

# VAX/RISC Compare

Table 2: RISC factors instruc. | CPI MIPS benchmark ratio VAX ratio factor spice2g6 2.48 1.80 8.02 4.44 1.79 matrix300 2.37 3.06 1.90 13.81 4.51nasa7 2.10 3.01 14.95 4.97 2.37 2.70 3.88 1.45 15.16 10.45 fpppp tomcatv 2.86 2.13 17.458.18 2.86 doduc 2.65 1.67 13.16 7.85 2.96 espresso 1.70 1.06 5.405.09 2.99 equtott 1.08 1.254.38 3.51 3.251.62 1.10 6.53 5.97 3.69 5.77 2.17 1.71 9.87

|                        | min | geo. mean | max  |
|------------------------|-----|-----------|------|
| VAX CPI                | 5.4 | 9.9       | 17.4 |
| MIPS CPI               | 1.1 | 1.7       | 3.1  |
| CPI ratio (VAX/MIPS)   | 3.5 | 5.8       | 10.4 |
| Inst. ratio (MIPS/VAX) | 1.1 | 2.2       | 3.9  |
| RISC factor            | 1.8 | 2.7       | 3.7  |
|                        |     |           |      |

[Bhandarkar/Clark ASPLOS 1991] 51

Caltech CS184 Spring2003 -- DeHon

#### VAX

- Smoking gun?:
  - 3-operand instructions
  - One cycle per operand field
  - If field a mem-op, wash with Id/st in RISC
  - If register-op, costs more
- …long way to supporting gap…

# ISA Growth

- Can add instructions (upward compatibility)
- Do we ever get to remove any?

53

Caltech CS184 Spring2003 -- DeHon

# **RISC**

- Everyone believe RISC
  - X86 only one left
  - ...and it's a RISC core...
- · ...but do they understand it?
  - Today's processors pretty complicated
  - Who's managing instruction scheduling?
  - What mean to FPGAs?

# Big Ideas

- Common Case
  - Measure to verify/know what it is!
- Primitives
- Highly specialized instructions brittle
- Design pulls
  - simplify processor implementation
  - simplify coding
- Orthogonallity (limit special cases)
- Compiler: fill in gap between user and hardware architecture

Caltech CS184 Spring2003 -- DeHor