Contents

List of Figures .......................................................................................................................... 9
List of Tables ............................................................................................................................ 11
Preface ...................................................................................................................................... 13

1. Introduction .......................................................................................................................... 21
   1.1 Rationale for SPU Architecture .......................................................................................... 21

2. SPU Architectural Overview ............................................................................................... 23
   2.1 Data Representation ........................................................................................................... 23
       2.1.1 Byte Ordering ............................................................................................................. 23
   2.2 Data Layout in Registers .................................................................................................. 26
   2.3 Instruction Formats .......................................................................................................... 26

3. Memory - Load/Store Instructions ....................................................................................... 28
   3.1 Local Store ....................................................................................................................... 28
       Load Quadword (d-form) ....................................................................................................... 29
       Load Quadword (x-form) ..................................................................................................... 30
       Load Quadword (a-form) ..................................................................................................... 31
       Load Quadword Instruction Relative (a-form) .................................................................... 32
       Store Quadword (d-form) ................................................................................................... 33
       Store Quadword (x-form) ................................................................................................... 34
       Store Quadword (a-form) ................................................................................................... 35
       Store Quadword Instruction Relative (a-form) .................................................................... 36
       Generate Controls for Byte Insertion (d-form) ................................................................. 37
       Generate Controls for Byte Insertion (x-form) ................................................................. 38
       Generate Controls for Halfword Insertion (d-form) ......................................................... 39
       Generate Controls for Halfword Insertion (x-form) ......................................................... 40
       Generate Controls for Word Insertion (d-form) ............................................................... 41
       Generate Controls for Word Insertion (x-form) ............................................................... 42
       Generate Controls for Doubleword Insertion (d-form) .................................................... 43
       Generate Controls for Doubleword Insertion (x-form) .................................................... 44

4. Constant-Formation Instructions ......................................................................................... 45
   Immediate Load Halfword ...................................................................................................... 46
   Immediate Load Halfword Upper ........................................................................................... 47
   Immediate Load Word ............................................................................................................ 48
   Immediate Load Address ....................................................................................................... 49
   Immediate Or Halfword Lower .............................................................................................. 50
   Form Select Mask for Bytes Immediate ............................................................................... 51

5. Integer and Logical Instructions ......................................................................................... 52
   Add Halfword ......................................................................................................................... 53
   Add Halfword Immediate ...................................................................................................... 54
Synergistic Processor Unit

Add Word ................................................................. 55
Add Word Immediate .................................................. 56
Subtract From Halfword ............................................. 57
Subtract From Halfword Immediate ............................. 58
Subtract From Word .................................................. 59
Subtract From Word Immediate ................................. 60
Add Extended ............................................................ 61
Carry Generate .......................................................... 62
Carry Generate Extended .......................................... 63
Subtract From Extended ............................................ 64
Borrow Generate ....................................................... 65
Borrow Generate Extended ....................................... 66
Multiply ................................................................. 67
Multiply Unsigned ...................................................... 68
Multiply Immediate ................................................... 69
Multiply Unsigned Immediate ....................................... 70
Multiply and Add ....................................................... 71
Multiply High .......................................................... 72
Multiply and Shift Right ............................................ 73
Multiply High High .................................................... 74
Multiply High High and Add ....................................... 75
Multiply High High Unsigned ..................................... 76
Multiply High High Unsigned and Add ......................... 77
Count Leading Zeros ................................................ 78
Count Ones in Bytes ................................................ 79
Form Select Mask for Bytes ....................................... 80
Form Select Mask for Halfwords ................................. 81
Form Select Mask for Words ....................................... 82
Gather Bits from Bytes ............................................. 83
Gather Bits from Halfwords ....................................... 84
Gather Bits from Words ............................................ 85
Average Bytes ......................................................... 86
Absolute Differences of Bytes ................................. 87
Sum Bytes into Halfwords ......................................... 88
Extend Sign Byte to Halfword .................................... 89
Extend Sign Halfword to Word ................................. 90
Extend Sign Word to Doubleword ............................. 91
And ................................................................. 92
And with Complement ............................................. 93
And Byte Immediate ............................................... 94
And Halfword Immediate ......................................... 95
And Word Immediate ............................................. 96
Or ................................................................. 97
Or with Complement ............................................... 98
Or Byte Immediate .................................................. 99
Or Halfword Immediate .......................................... 100
Or Word Immediate ............................................... 101
Or Across ............................................................ 102
Exclusive Or ......................................................... 103
Exclusive Or Byte Immediate ................................. 104
6. Shift and Rotate Instructions ................................................................. 112
    Shift Left Halfword ................................................................. 113
    Shift Left Halfword Immediate ........................................ 114
    Shift Left Word ................................................................. 115
    Shift Left Word Immediate ................................................ 116
    Shift Left Quadword by Bits ........................................... 117
    Shift Left Quadword by Bits Immediate ......................... 118
    Shift Left Quadword by Bytes ........................................ 119
    Shift Left Quadword by Bytes Immediate .................. 120
    Shift Left Quadword by Bytes from Bit Shift Count .... 121
    Rotate Halfword .............................................................. 122
    Rotate Halfword Immediate .............................................. 123
    Rotate Word ................................................................. 124
    Rotate Word Immediate .................................................... 125
    Rotate Quadword by Bytes ............................................. 126
    Rotate Quadword by Bytes Immediate ......................... 127
    Rotate Quadword by Bytes from Bit Shift Count .... 128
    Rotate Quadword by Bits .................................................. 129
    Rotate Quadword by Bits Immediate ............................. 130
    Rotate and Mask Halfword .................................................. 131
    Rotate and Mask Halfword Immediate ............................ 132
    Rotate and Mask Word .......................................................... 133
    Rotate and Mask Word Immediate ..................................... 134
    Rotate and Mask Quadword by Bytes ......................... 135
    Rotate and Mask Quadword by Bytes Immediate .... 136
    Rotate and Mask Quadword Bytes from Bit Shift Count ... 137
    Rotate and Mask Quadword by Bits .............................. 138
    Rotate and Mask Quadword by Bits Immediate ........ 139
    Rotate and Mask Algebraic Halfword ......................... 140
    Rotate and Mask Algebraic Halfword Immediate ........ 141
    Rotate and Mask Algebraic Word ...................................... 142
    Rotate and Mask Algebraic Word Immediate ................ 143

7. Compare, Branch, and Halt Instructions ............................................... 144
    Halt If Equal ................................................................. 145
    Halt If Equal Immediate ................................................ 146
    Halt If Greater Than ...................................................... 147
    Halt If Greater Than Immediate .................................... 148
    Halt If Logically Greater Than ...................................... 149
    Halt If Logically Greater Than Immediate ................ 150
    Compare Equal Byte .......................................................... 151
    Compare Equal Byte Immediate .................................... 152
Synergistic Processor Unit

8. Hint-for-Branch Instructions ................................................................. 185
   Hint for Branch (r-form) ................................................................. 186
   Hint for Branch (e-form) ................................................................. 187
   Hint for Branch Relative ............................................................... 188

9. Floating-Point Instructions ................................................................. 189
   9.1 Single Precision (Extended-Range Mode) .................................. 189
   9.2 Double Precision ................................................................. 190
      9.2.1 Conversions Between Single and Double-Precision Format 191
      9.2.2 Exception Conditions ....................................................... 191
   9.3 Floating-Point Status and Control Register (FPSCR) .................. 193
      Floating Add ............................................................................ 195
      Double Floating Add ............................................................... 196
      Floating Subtract ................................................................. 197
      Double Floating Subtract ....................................................... 198
      Floating Multiply ..................................................................... 199
      Double Floating Multiply .................................................... 200
### 10. Control Instructions ................................................................. 225
- Stop and Signal .......................................................... 226
- Stop and Signal with Dependencies .......................... 227
- No Operation (Load) .................................................. 228
- No Operation (Execute) ............................................. 229
- Synchronize ......................................................... 230
- Synchronize Data .................................................. 231
- Move from Special-Purpose Register .................. 232
- Move to Special-Purpose Register ...................... 233

### 11. Channel Instructions ............................................................... 234
- Read Channel .......................................................... 235
- Read Channel Count ............................................... 236
- Write Channel .......................................................... 237

### 12. SPU Interrupt Facility ............................................................... 238
- 12.1 SPU Interrupt Handler .................................. 238
- 12.2 SPU Interrupt Facility Channels .................. 239

### 13. Synchronization and Ordering ..................................................... 240
- 13.1 Speculation, Reordering, and Caching SPU Local Store Access 241
- 13.2 Internal Execution State .................................. 241
- 13.3 Synchronization Primitives .................................. 241
- 13.4 Caching SPU Local Store Access .................. 242
- 13.5 Self-Modifying Code ........................................ 243
- 13.6 External Local Store Access .......................... 243
13.7 Speculation and Reordering of Channel Reads and Channel Writes .......................... 244
13.8 Channel Interface with External Device ........................................................................... 244
13.9 Execution State Set by an SPU Program through the Channel Interface .................. 244
13.10 Execution State Set by an External Device ................................................................. 245

Appendix A. Programming Examples ................................................................................... 247
  A.1 Conversion from Single Precision to Double Precision .................................................. 247
  A.2 Conversion from Double Precision to Single Precision .................................................. 248

Appendix B. Instruction Table Sorted by Instruction Mnemonic ............................................ 249

Appendix C. Details of the Compute-Mask Instructions ....................................................... 255

Revision Log ........................................................................................................................ 257
List of Figures

Figure i. Format of an Instruction Description .......................................................... 15
Figure 2-1. Bit and Byte Numbering of Halfwords .................................................. 24
Figure 2-2. Bit and Byte Numbering of Words ....................................................... 24
Figure 2-3. Bit and Byte Numbering of Doublewords .......................................... 24
Figure 2-4. Bit and Byte Numbering of Quadwords ............................................ 25
Figure 2-5. Register Layout of Data Types ......................................................... 26
Figure 2-6. RR Instruction Format ................................................................. 26
Figure 2-7. RRR Instruction Format ............................................................... 26
Figure 2-8. R17 Instruction Format ................................................................. 26
Figure 2-9. R10 Instruction Format ................................................................. 27
Figure 2-10. R16 Instruction Format ................................................................. 27
Figure 2-11. R18 Instruction Format ................................................................. 27
Figure 13-1. Systems with Multiple Accesses to Local Store ......................... 240
### List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>i</td>
<td>Temporary Names Used in the RTL and Their Widths</td>
<td>18</td>
</tr>
<tr>
<td>ii</td>
<td>Instruction Fields</td>
<td>19</td>
</tr>
<tr>
<td>iii</td>
<td>Instruction Operation Notations</td>
<td>20</td>
</tr>
<tr>
<td>1-1</td>
<td>Key Features of the SPU ISA Architecture and Implementation</td>
<td>21</td>
</tr>
<tr>
<td>2-1</td>
<td>Bit and Byte Numbering Figures</td>
<td>24</td>
</tr>
<tr>
<td>3-1</td>
<td>Example LSLR Values and Corresponding Local Store Sizes</td>
<td>28</td>
</tr>
<tr>
<td>5-1</td>
<td>Binary Values in Register RC and Byte Results</td>
<td>111</td>
</tr>
<tr>
<td>9-1</td>
<td>Single-Precision (Extended-Range Mode) Minimum and Maximum Values</td>
<td>189</td>
</tr>
<tr>
<td>9-2</td>
<td>Instructions and Exception Settings</td>
<td>190</td>
</tr>
<tr>
<td>9-3</td>
<td>Double-Precision (IEEE Mode) Minimum and Maximum Values</td>
<td>190</td>
</tr>
<tr>
<td>9-4</td>
<td>Single-Precision (IEEE Mode) Minimum and Maximum Values</td>
<td>191</td>
</tr>
<tr>
<td>9-5</td>
<td>Instructions and Exception Settings</td>
<td>193</td>
</tr>
<tr>
<td>12-1</td>
<td>Feature Bits [D] and [E] Settings and Results</td>
<td>238</td>
</tr>
<tr>
<td>13-1</td>
<td>Local Store Accesses</td>
<td>240</td>
</tr>
<tr>
<td>13-2</td>
<td>Synchronization Instructions</td>
<td>242</td>
</tr>
<tr>
<td>13-3</td>
<td>Synchronizing Multiple Accesses to Local Store</td>
<td>242</td>
</tr>
<tr>
<td>13-4</td>
<td>Synchronizing through Local Store</td>
<td>243</td>
</tr>
<tr>
<td>13-5</td>
<td>Synchronizing through Channel Interface</td>
<td>244</td>
</tr>
<tr>
<td>B-1</td>
<td>Instructions Sorted by Mnemonic</td>
<td>249</td>
</tr>
<tr>
<td>C-1</td>
<td>Byte Insertion: Rightmost 4 Bits of the Effective Address and Created Mask</td>
<td>255</td>
</tr>
<tr>
<td>C-2</td>
<td>Halfword Insertion: Rightmost 4 Bits of the Effective Address and Created Mask</td>
<td>255</td>
</tr>
<tr>
<td>C-3</td>
<td>Word Insertion: Rightmost 4 Bits of the Effective Address and Created Mask</td>
<td>256</td>
</tr>
<tr>
<td>C-4</td>
<td>Doubleword Insertion: Rightmost 4 Bits of Effective Address and Created Mask</td>
<td>256</td>
</tr>
</tbody>
</table>
Preface

The purpose of this document is to provide a description of the Synergistic Processor Unit (SPU) Instruction Set Architecture (ISA) as it relates to the Cell Broadband Engine Architecture (CBEA).

Who Should Read This Document

This document is intended for designers who plan to develop products using the SPU ISA. Readers of this document should be familiar with the documents listed in Related Publications on page 14.

Document Organization

<table>
<thead>
<tr>
<th>Document Section</th>
<th>Description</th>
</tr>
</thead>
</table>
| Front Matter                                          | Title Page  
Document classification, version number, release date, and copyright and disclaimer information.  
Front Matter  
Contents  
List of Figures  
List of Tables  
Preface  
Describes this document, lists related publications, outlines conventions and notations, explains how to use the instruction descriptions, and provides other general information. |
| Section 1 Introduction on page 21                      | Provides a high-level description of the SPU architecture and its purpose. |
| Section 2 SPU Architectural Overview on page 23        | Provides an overview of the SPU architecture.                              |
| Section 3 Memory - Load/Store Instructions on page 28 | Lists and describes the SPU load/store instructions.                      |
| Section 4 Constant-Formation Instructions on page 45  | Lists and describes the SPU constant-formation instructions.              |
| Section 5 Integer and Logical Instructions on page 52 | Lists and describes the SPU integer and logical instructions.             |
| Section 6 Shift and Rotate Instructions on page 112   | Lists and describes the SPU shift and rotate instructions.                |
| Section 7 Compare, Branch, and Halt Instructions on page 144 | Lists and describes the SPU compare, branch, and halt instructions. |
| Section 8 Hint-for-Branch Instructions on page 185    | Lists and describes the SPU hint-for-branch instruction.                  |
| Section 9 Floating-Point Instructions on page 189     | Lists and describes the SPU floating-point instructions.                  |
| Section 10 Control Instructions on page 225           | Lists and describes the SPU control instructions.                         |
| Section 11 Channel Instructions on page 234           | Describes the instructions used to communicate between the SPU and external devices through the channel interfaces. |
| Section 12 SPU Interrupt Facility on page 238          | Describes the SPU interrupt facility.                                     |
| Section 13 Synchronization and Ordering on page 240   | Describes the SPU sequentially ordered programming model.                 |
| Appendix A Programming Examples on page 247           | Contains several SPU programming examples.                                 |
| Appendix B Instruction Table Sorted by Instruction Mne- | Lists the SPU instructions sorted by their mnemonics.                     |
| monic on page 249                                     | Appendix C Details of the Compute-Mask Instructions on page 255           | Provides the details of the masks that are generated by the compute-mask instructions. |
| Revision Log on page 257                              | Lists revisions made to this document.                                    |
Related Publications

The following is a list of reference materials for the SPU ISA.

<table>
<thead>
<tr>
<th>Title</th>
<th>Version</th>
<th>Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cell Broadband Engine Architecture</td>
<td>1.0</td>
<td>August 2005</td>
</tr>
<tr>
<td>PowerPC® User Instruction Set Architecture, Book I</td>
<td>2.02</td>
<td>January 26, 2005</td>
</tr>
<tr>
<td>PowerPC Virtual Environment Architecture, Book II</td>
<td>2.02</td>
<td>January 26, 2005</td>
</tr>
<tr>
<td>PowerPC Operating Environment Architecture, Book III</td>
<td>2.02</td>
<td>January 26, 2005</td>
</tr>
</tbody>
</table>
How to Use the Instruction Descriptions

Figure 1 illustrates how to use the instruction descriptions provided in this document.

Figure 1. Format of an Instruction Description

Instruction Name: Load Quadword (d-form)
Instruction Mnemonic: LQD
Instruction Operands: rt, symbol(ra)
Instruction Format:

<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>I10</th>
</tr>
</thead>
</table>

Instruction OpCode (Binary):

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |

Instruction Description:
The effective address is computed by adding 16 times the signed value in the I10 field to the preferred slot of register RA and forcing the rightmost 4 bits of the sum to 0. The 16 bytes of data at this address are placed into register RT. This instruction is computed using the following:

- \( I \leftarrow \text{RepLd}(10, 32) \)
- \( EA \leftarrow RA + 16 \times I \)
- \( AA \leftarrow EA \& AMR \& 0xFFFFFFFFF0 \)
- \( RT \leftarrow \text{LocSlo}(AA, 16) \)
Conventions and Notations Used in This Manual

Byte Ordering

Throughout this document, standard IBM big-endian notation is used, meaning that bytes are numbered in ascending order from left to right. Big-endian and little-endian byte ordering are described in the Cell Broad-band Engine Architecture.

Bit Ordering

Bits are numbered in ascending order from left to right with bit 0 representing the most-significant bit (MSb) and bit 31 the least-significant bit (LSb).

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

Bit Encoding

The notation for bit encoding is as follows:

- Hexadecimal values are preceded by an “x” and enclosed in single quotation marks.
  For example: x'0A00'.
- Binary values in sentences appear in single quotation marks.
  For example: ‘1010’.

Instructions, Mnemonics, and Operands

Instruction mnemonics are written in bold type. For example, sync for the synchronize instruction.

As shown in Figure i on page 15, the description of each instruction in this document includes the mnemonic and a formatted list of operands. In addition, it provides a sample assembler language statement showing the format supported by the assembler.
Notations, Encoding, and Referencing

Referencing Registers or Channels, Fields, and Bit Ranges

Registers and channels are referenced by their full name or by their mnemonic, which is also called the short name. Fields are referenced by their field name or by their bit position.

Usually, the register mnemonic is followed by the field name or bit position enclosed in brackets. For example: MSR[R]. An equal sign followed by a value indicates the value to which the field is set. For example: MSR[R] = 0. When referencing a range of bit numbers, the starting and ending bit numbers are enclosed in brackets and separated by a colon. For example: [0:34].

<table>
<thead>
<tr>
<th>Type of Reference</th>
<th>Format</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reference to a specific register and a specific field using the register short name and the field name</td>
<td>Register_Short_Name[Field_Name]</td>
<td>MSR[R]</td>
</tr>
<tr>
<td>Reference to a field using the field name</td>
<td>[Field_Name]</td>
<td>[R]</td>
</tr>
<tr>
<td>Reference to a specific register and to multiple fields using the register short name and the field names</td>
<td>Register_Short_Name[Field_Name1, Field_Name2]</td>
<td>MSR[FE0, FE1]</td>
</tr>
<tr>
<td>Reference to a specific register and to multiple fields using the register short name and the bit positions.</td>
<td>Register_Short_Name[Bit_Number, Bit_Number]</td>
<td>MSR[52, 55]</td>
</tr>
<tr>
<td>Reference to a specific register and to a field using the register short name and the bit position or the bit range.</td>
<td>Register_Short_Name[Bit_Number]</td>
<td>MSR[52]</td>
</tr>
<tr>
<td>Register_Short_Name[Starting_Bit_Number:Ending_Bit_Number]</td>
<td>MSR[39:44]</td>
<td></td>
</tr>
<tr>
<td>A field name followed by an equal sign (=) and a value indicates the value for that field.</td>
<td>Register_Short_Name[Field_Name]=n</td>
<td>MSR[FE0]=1</td>
</tr>
<tr>
<td>Register_Short_Name[Bit_Number]=n</td>
<td>MSR[FE]=x'1'</td>
<td></td>
</tr>
<tr>
<td>Register_Short_Name[Starting_Bit_Number:Ending_Bit_Number]=n</td>
<td>MSR[52]=0</td>
<td></td>
</tr>
<tr>
<td>MSR[52]=x'0'</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MSR[39:43]=x'11'</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MSR[39:43]=10010'</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

1. Where \( n \) is the binary or hex value for the field or bits specified in the brackets.
Register Transfer Language (RTL) Instruction Definitions

This document generally follows the terminology and notation in the PowerPC Architecture™. The following terms and notations are used in this document.

- Quadwords are 128 bits.
- Doublewords are 64 bits.
- Words are 32 bits.
- Halfwords are 16 bits.
- Bytes are 8 bits.
- Numbers are generally shown in decimal format.
- The binary point for fixed-point format data is at the right end of the field or value.
  - Operations are performed with the binary points aligned, even if the fields are of different widths.
- RTL descriptions are provided for most instructions and are intended to clarify the verbal description, which is the primary definition. The following conventions apply to the RTL:
  - \text{LocStor}(x,y)\) refers to the \(y\) bytes starting at local storage location \(x\).
  - \text{RepLeftBit}(x,y)\) returns the value \(x\) with its leftmost bit replicated enough times to produce a total length of \(y\).
  - The program counter (PC) contains the address of the instruction being executed when used as an operand, or the address of the next instruction when used as a target.
  - Temporary names used in the RTL descriptions have the widths shown in Table \(i\).

<table>
<thead>
<tr>
<th>Temporary Name</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td>b, byte, byte1, byte2, c</td>
<td>8 bits</td>
</tr>
<tr>
<td>r, s</td>
<td>16 bits</td>
</tr>
<tr>
<td>bbbb, EA, QA, t, t0, t1, t2, t3, u, v</td>
<td>32 bits</td>
</tr>
<tr>
<td>Q, R, Memdata</td>
<td>128 bits</td>
</tr>
<tr>
<td>Rconcat</td>
<td>256 bits</td>
</tr>
<tr>
<td>i, j, k, m</td>
<td>Meta (for description only)</td>
</tr>
</tbody>
</table>
Instruction Fields

The instructions in this document can contain one or more of the fields described in Table ii.

Table ii. Instruction Fields

<table>
<thead>
<tr>
<th>Field</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>/, //, ///</td>
<td>Reserved field in an instruction. Reserved fields are presently unused and should contain zeros, even where this is not checked by the architecture, to allow for future use without causing incompatibility.</td>
</tr>
<tr>
<td>I7</td>
<td>7-bit immediate</td>
</tr>
<tr>
<td>I8</td>
<td>8-bit immediate</td>
</tr>
<tr>
<td>I10</td>
<td>10-bit immediate</td>
</tr>
<tr>
<td>I16</td>
<td>16-bit immediate</td>
</tr>
<tr>
<td>OP or OPCD</td>
<td>Opcode</td>
</tr>
<tr>
<td>RA[18-24]</td>
<td>Field used to specify a general-purpose register (GPR) to be used as a source or as a target.</td>
</tr>
<tr>
<td>RB[11-17]</td>
<td>Field used to specify a GPR to be used as a source or as a target.</td>
</tr>
<tr>
<td>RC[4-10]</td>
<td>Field used to specify a GPR to be used as a source or as a target.</td>
</tr>
<tr>
<td>RT[25-31]</td>
<td>Field used to specify a GPR to be used as a target.</td>
</tr>
</tbody>
</table>
Instruction Operation Notations

The instructions in this document use the notations described in *Table iii*. This table is ordered with respect to the order of precedence, where the first operator in the table binds most tightly.

*Table iii. Instruction Operation Notations*

<table>
<thead>
<tr>
<th>Notation</th>
<th>Description</th>
<th>See Note</th>
</tr>
</thead>
<tbody>
<tr>
<td>(X_p)</td>
<td>Means bit (p) of register or value field (X)</td>
<td></td>
</tr>
<tr>
<td>(X_{p:q})</td>
<td>Means bits (p) through (q) inclusive of register or value (X)</td>
<td></td>
</tr>
<tr>
<td>(X^p)</td>
<td>Means byte (p) of register or value (X)</td>
<td></td>
</tr>
<tr>
<td>(X^{p:q})</td>
<td>Means bytes (p) through (q) inclusive of register or value (X)</td>
<td></td>
</tr>
<tr>
<td>(X_{p::q})</td>
<td>Means bits (p) and the bits that follow for a total of (q) bits</td>
<td></td>
</tr>
<tr>
<td>(X^{p::q})</td>
<td>Means bytes (p) and the bytes that follow for a total of (q) bytes</td>
<td></td>
</tr>
<tr>
<td>(p^0) and (p^1)</td>
<td>Mean a string of (p) 0 bits and of (p) 1 bits.</td>
<td>1</td>
</tr>
<tr>
<td>(\neg)</td>
<td>unary NOT operator</td>
<td>2</td>
</tr>
<tr>
<td>(*)</td>
<td>Signed multiplication,</td>
<td>3</td>
</tr>
<tr>
<td>(</td>
<td>)*</td>
<td>Unsigned multiplication</td>
</tr>
<tr>
<td>(+)</td>
<td>Twos complement addition</td>
<td>2</td>
</tr>
<tr>
<td>(-)</td>
<td>Twos complement subtraction, unary minus</td>
<td>2</td>
</tr>
<tr>
<td>(=, \neq)</td>
<td>Equals,</td>
<td>2</td>
</tr>
<tr>
<td>(&lt;, \leq, &gt;, \geq)</td>
<td>Signed comparison relations</td>
<td></td>
</tr>
<tr>
<td>(&lt;u &gt;u)</td>
<td>Unsigned comparison relations</td>
<td>2</td>
</tr>
<tr>
<td>&amp;</td>
<td>AND</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>OR</td>
</tr>
<tr>
<td>(\oplus)</td>
<td>Exclusive-Or ((a &amp; \neg b \mid \neg a &amp; b))</td>
<td>2</td>
</tr>
<tr>
<td>(\leftarrow)</td>
<td>Assignment</td>
<td></td>
</tr>
<tr>
<td>LSA</td>
<td>Local Store Address</td>
<td></td>
</tr>
<tr>
<td>LSLR</td>
<td>Local Store Limit Register</td>
<td></td>
</tr>
<tr>
<td>LocStor(LSA, width)</td>
<td>Contents of width bytes of the local store at address LSA</td>
<td></td>
</tr>
<tr>
<td>if... then... else...</td>
<td>Conditional execution. Indenting shows range. Else is optional.</td>
<td></td>
</tr>
<tr>
<td>for, do</td>
<td>Do loop. Indenting shows range. To or by clauses specify incrementing an iteration variable, and a while clause provides termination conditions.</td>
<td></td>
</tr>
<tr>
<td>(\text{ll}, \text{lli}, \text{lll})</td>
<td>Reserved field in an instruction. Reserved fields are presently unused and should contain zeros, even where this is not checked by the architecture, to allow for future use without causing incompatibility</td>
<td></td>
</tr>
</tbody>
</table>

1. This is different from the PowerPC notation, which uses a leading superscript rather than a subscript.
2. The result of this operator is a bit vector of the same width as the input operands.
3. The result of this operator is a bit vector of the width of the sum of the operand widths.
1. Introduction

The purpose of the Synergistic Processor Unit (SPU) Instruction Set Architecture (ISA) document is to describe a processor architecture that can fill a void between general-purpose processors and special-purpose hardware. Whereas the objective of general-purpose processor architectures is to achieve the best average performance on a broad set of applications, and the objective of special-purpose hardware is to achieve the best performance on a single application, the purpose of the architecture described in this document is to achieve leadership performance on critical workloads for game, media, and broadband systems. The purpose of the SPU ISA and the Cell Broadband Engine Architecture (CBEA) is to provide information that allows a high degree of control by expert (real-time) programmers while still maintaining ease of programming.

1.1 Rationale for SPU Architecture

Key workloads for the SPU are:

- The graphics pipeline, which includes surface subdivision and rendering
- Stream processing, which includes encoding, decoding, encryption, and decryption
- Modeling, which includes game physics

The implementations of the SPU ISA achieve better performance to cost ratios than general-purpose processors because the SPU ISA implementations require approximately half the power and approximately half the chip area for equivalent performance. This is made possible by the key features of the architecture and implementation listed in Table 1-1.

Table 1-1. Key Features of the SPU ISA Architecture and Implementation (Page 1 of 2)

<table>
<thead>
<tr>
<th>Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>128-bit SIMD execution unit organization</td>
<td>Many of the applications mentioned above allow for single-instruction multiple-data (SIMD) concurrency. In an SIMD architecture, the cost (area, power) of fetching and decoding instructions is amortized over the multiple data elements processed. A 128-bit (most commonly 4-way 32-bit) SIMD was chosen for commonality with SIMD processing units in other general-purpose processor architectures and hence the existing code base to support it.</td>
</tr>
<tr>
<td>Software-managed memory</td>
<td>Whereas most processors reduce latency to memory by employing caches, the SPU in the broadband architecture implements a small local memory rather than a cache. This approach requires approximately half the area per byte, and significantly less power per access, as compared to a cache hierarchy. In addition, it provides a high degree of control for real-time programming. Because the latency and instruction overhead associated with DMA transfers exceeds that of the latency of servicing a cache miss, this approach achieves an advantage only if the DMA transfer size is sufficiently large and is sufficiently predictable (that is, DMA can be issued before data is needed).</td>
</tr>
<tr>
<td>Load/store architecture to support efficient SRAM design</td>
<td>The SPU ISA microarchitecture is organized to enable efficient implementations that use single-ported (local store) memory.</td>
</tr>
<tr>
<td>Large unified register file</td>
<td>The 128-entry register file in the SPU architecture allows for deeply pipelined high-frequency implementations without requiring register renaming to avoid register starvation. This is especially important when latencies are covered by software loop unrolling or other interleaving techniques. Rename hardware typically consumes a significant fraction of the area and power in modern high-frequency general-purpose processors.</td>
</tr>
<tr>
<td>ISA support to eliminate branches</td>
<td>The SPU ISA defines compare instructions to set masks that can be used in three operand select instructions to create efficient conditional assignments. Such conditional assignments can be used to avoid difficult-to-predict branches.</td>
</tr>
</tbody>
</table>
The SPU “hint for branch” instructions allow programs to avoid a penalty on taken branches when the branch can be predicted sufficiently early. This mechanism achieves an advantage over common branch prediction schemes in that it does not require storing history associated with previous branches and thus saves area and power. The ISA solves the problem associated with hint bits in the branch instructions themselves, where considerable look-ahead (branch scan) in the instruction stream is necessary to process branches early enough that their targets are available when needed.

Much of the code base for game applications assumes a single-precision floating-point format that is distinct from the IEEE 754 format commonly implemented on general-purpose processors. For details on the single-precision format, see Section 9 Floating-Point Instructions on page 189.

Blocking channels for communication with the Synergistic Memory Flow Controller (MFC) or other parts of the system external to the SPU, provide an efficient mechanism to wait for the completion of external events without polling or interrupts/wait loops, both of which burn power needlessly.

The SPU does not include certain features common in general-purpose processors. Specifically, the processor does not support a supervisor mode.
2. SPU Architectural Overview

This section provides an overview of the SPU architecture.

The SPU architecture defines a set of 128 general-purpose registers (GPRs), each of which contains 128 data bits. Registers are used to hold fixed-point and floating-point data. Instructions operate on the full width of the register, treating it as multiple operands of the same format.

The SPU supports halfword (16-bit) and word (32-bit) integers in signed format, and provides limited support for 8-bit unsigned integers. The number representation is two's complement.

The SPU supports single-precision (32-bit) and double-precision (64-bit) floating-point data in IEEE 754 format. However, full single-precision IEEE 754 arithmetic is not implemented.

The architecture does not use a condition register. Instead, comparison operations set results that are either 0 (false) or -1 (true), and that are the same width as the operands compared. These results can be used for bitwise masking, the select instruction, or conditional branches.

The SPU loads and stores access a private memory called local store. The SPU loads and stores transfer quadwords between GPRs and local store. Implementations can feature varying local store sizes; however, the local store address space is limited to 4 GB.

The SPU can send and receive data to external devices through the channel interface. SPU channel instructions transfer quadwords between GPRs and the channel interface. Up to 128 channels are supported. Two channels are defined to access Save-and-Restore Register 0 (SRR0), which holds the address used by the Interrupt Return instruction (\texttt{iret}). The SPU also supports up to 128 special-purpose registers (SPRs). The Move To Special Purpose Register (\texttt{mtspr}) and Move From Special Purpose Register (\texttt{mfspr}) instructions move 128-bit data between GPRs and SPRs.

The SPU also monitors a status signal called the external condition. The Branch Indirect and Set Link If Enabled Data (\texttt{bisle}) instruction conditionally branches based upon the status of the external condition. The SPU interrupt facility can be configured to branch to an interrupt handler at address 0 when the external condition is true.

2.1 Data Representation

2.1.1 Byte Ordering

The architecture defines:

- An 8-bit byte
- A 16-bit halfword
- A 32-bit word
- A 64-bit doubleword
- A 128-bit quadword

Byte ordering defines how the bytes that make up halfwords, words, doublewords, and quadwords are ordered in memory. The SPU supports most-significant byte (MSB) ordering. With MSB ordering, also called \textit{big endian}, the most-significant byte is located in the lowest addressed byte position in a storage unit (byte 0). Instructions are described in this document as they appear in memory, with successively higher addressed bytes appearing toward the right.
The conventions for bit and byte numbering within the various width storage units are shown in the figures listed in Table 2-1.

**Table 2-1. Bit and Byte Numbering Figures**

<table>
<thead>
<tr>
<th>For a figure that shows…</th>
<th>See…</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bit and Byte Numbering of Halfwords</td>
<td>Figure 2-1 on page 24</td>
</tr>
<tr>
<td>Bit and Byte Numbering of Words</td>
<td>Figure 2-2 on page 24</td>
</tr>
<tr>
<td>Bit and Byte Numbering of Doublewords</td>
<td>Figure 2-3 on page 24</td>
</tr>
<tr>
<td>Bit and Byte Numbering of Quadwords</td>
<td>Figure 2-4 on page 25</td>
</tr>
<tr>
<td>Register Layout of Data Types</td>
<td>Figure 2-5 on page 26</td>
</tr>
</tbody>
</table>

These conventions apply to integer and floating-point data (where the most-significant byte holds the sign and at a minimum the start of the exponent). The figures show byte numbers on the top and bit numbers below.

**Figure 2-1. Bit and Byte Numbering of Halfwords**

MSb          LSB
0            1
\[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\]

**Figure 2-2. Bit and Byte Numbering of Words**

MSb          LSB
0            1            2            3
\[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31\]

**Figure 2-3. Bit and Byte Numbering of Doublewords**

MSb          LSB
0            1            2            3
\[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31\]
\[4 5 6 7 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63\]
Figure 2-4. Bit and Byte Numbering of Quadwords

<table>
<thead>
<tr>
<th>MSb</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0123</td>
<td>4567</td>
<td>89</td>
<td>10</td>
</tr>
<tr>
<td></td>
<td>2345</td>
<td>6789</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>3456</td>
<td>6789</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>4567</td>
<td>7890</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>5678</td>
<td>8901</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>6789</td>
<td>9012</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>7890</td>
<td>0123</td>
<td>1011</td>
<td>1213</td>
</tr>
<tr>
<td></td>
<td>8901</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
<tr>
<td></td>
<td>9012</td>
<td>1234</td>
<td>1112</td>
<td>1314</td>
</tr>
</tbody>
</table>
2.2 Data Layout in Registers

All GPRs are 128 bits wide. The leftmost word (bytes 0, 1, 2, and 3) of a register is called the *preferred slot*. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. A set of store assist instructions is available to help store bytes, halfwords, words, and doublewords. *Figure 2-5* illustrates how these data types are laid out in a GPR.

*Figure 2-5. Register Layout of Data Types*

```
<table>
<thead>
<tr>
<th>Preferred Slot</th>
<th>Byte Index</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 2 3</td>
<td>4 5 6 7 8 9 10 11 12 13 14 15</td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>BYTE</td>
</tr>
<tr>
<td></td>
<td>HALFWORD</td>
</tr>
<tr>
<td></td>
<td>ADDRESS</td>
</tr>
<tr>
<td></td>
<td>WORD</td>
</tr>
<tr>
<td></td>
<td>DOUBLEWORD</td>
</tr>
<tr>
<td></td>
<td>QUAD WORD</td>
</tr>
</tbody>
</table>
```

2.3 Instruction Formats

There are six basic instruction formats. These instructions are all 32 bits long. Minor variations of these formats are also used. Instructions in memory must be aligned on word boundaries. The instruction formats are shown in *Figures 2-6 through 2-11*.

**Note:** The OP code field is presented throughout this document in binary format.

*Figure 2-6. RR Instruction Format*

```
<table>
<thead>
<tr>
<th>OP</th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>10</td>
<td>11</td>
<td>17</td>
</tr>
</tbody>
</table>
```

*Figure 2-7. RRR Instruction Format*

```
<table>
<thead>
<tr>
<th>OP</th>
<th>RT</th>
<th>RB</th>
<th>RA</th>
<th>RC</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3</td>
<td>10</td>
<td>11</td>
<td>17</td>
</tr>
</tbody>
</table>
```

*Figure 2-8. RI7 Instruction Format*

```
<table>
<thead>
<tr>
<th>OP</th>
<th>I7</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>10</td>
<td>11</td>
<td>17</td>
</tr>
</tbody>
</table>
```
**Figure 2-9. RI10 Instruction Format**

<table>
<thead>
<tr>
<th>OP</th>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>7</td>
<td>8</td>
<td>17</td>
</tr>
</tbody>
</table>

**Figure 2-10. RI16 Instruction Format**

<table>
<thead>
<tr>
<th>OP</th>
<th>I16</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>8</td>
<td>9</td>
</tr>
</tbody>
</table>

**Figure 2-11. RI18 Instruction Format**

<table>
<thead>
<tr>
<th>OP</th>
<th>I18</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>6</td>
<td>7</td>
</tr>
</tbody>
</table>
3. Memory - Load/Store Instructions

This section lists and describes the SPU load/store instructions.

3.1 Local Store

The SPU architecture defines a private memory, also called the local store, which is byte-addressed. Load and store instructions combine operands from one or two registers and an immediate value to form the effective address of the memory operand. Only aligned 16-byte-long quadwords can be loaded and stored. Therefore, the rightmost 4 bits of an effective address are always ignored and are assumed to be zero.

The size of the SPU local store address space is $2^{32}$ bytes. However, an implementation generally has a smaller actual memory size. The effective size of the memory is specified by the Local Store Limit Register (LSLR). Implementations can provide methods for accessing the LSLR; however, these methods are outside the scope of the SPU instruction set architecture. Implementations can allow modifications to the LSLR value; however, the LSLR must not change while the SPU is running. Every effective address is ANDed with the LSLR before it is used to reference memory. The LSLR can be used to make the memory appear to be smaller than it is, thus providing compatibility for programs compiled for a smaller memory size. The LSLR value is a mask that controls the effective memory size. This value must have the following properties:

- Limit the effective memory size to be less than or equal to the actual memory size
- Be monotonic, so that the least-significant 4 mask bits are ones and so that there is at most a single transition from ‘1’ to ‘0’ and no transitions from ‘0’ to ‘1’ as the bits are read from the least-significant to the most-significant bit. That is, the value must be $2^n - 1$, where $n = \log_2$ (effective memory size).

The effect of this is that references to memory beyond the last byte of the effective size are wrapped—that is, interpreted modulo the effective size. This definition allows an address to be used for a load before it has been checked for validity, and makes it possible to overlap memory latency with other operations more easily.

Stores of less than a quadword are performed by a load-modify-store sequence. A group of assist instructions is provided for this type of sequence. The assist instruction names are prefixed with Generate Control. These instructions are described in this section. For example, see Generate Controls for Byte Insertion (d-form) on page 37.

In a typical system configuration, the SPU local store is externally accessible. The possibility therefore exists of SPU memory being modified asynchronously during the course of execution of an SPU program. All references (loads, stores) to local store by an SPU program, and aligned external references to SPU memory, are atomic. Unaligned references are not atomic, and portions of such operations can be observed by a program executing in the SPU. Table 3-1 shows sample LSLRs and their sizes in local store.

<table>
<thead>
<tr>
<th>LSLR</th>
<th>Local Store Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>x'0003 FFFF'</td>
<td>256 KB</td>
</tr>
<tr>
<td>x'0001 FFFF'</td>
<td>128 KB</td>
</tr>
<tr>
<td>x'0000 FFFF'</td>
<td>64 KB</td>
</tr>
<tr>
<td>x'0000 7FFF'</td>
<td>32 KB</td>
</tr>
</tbody>
</table>
Load Quadword (d-form)

\[ \text{lqd } rt, \text{symbol}(ra) \]

The local store address is computed by adding the signed value in the I10 field, with 4 zero bits appended, to the value in the preferred slot of register RA and forcing the rightmost 4 bits of the sum to zero. The 16 bytes at the local store address are placed into register RT. This instruction is computed using the following:

\[
\begin{align*}
\text{LSA} & \leftarrow (\text{RepLeftBit}(I10 \mid 0b0000,32) + RA_{0:3}) \& \text{LSLR} \& 0xFFFFFFF0 \\
\text{RT} & \leftarrow \text{LocStor} (\text{LSA}, 16)
\end{align*}
\]
Load Quadword (x-form)

\[ \text{lqx \ rt,ra,rb} \]

The local store address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB and forcing the rightmost 4 bits of the sum to zero. The 16 bytes at the local store address are placed into register RT. This instruction is computed using the following:

\[
\begin{array}{c|c|c|c}
\text{LSA} & \leftarrow (\text{RA}^{0:3} + \text{RB}^{0:3}) \land \text{LSLR} \land 0xFFFFFFF0 \\
\text{RT} & \leftarrow \text{LocStor(LSA,16)}
\end{array}
\]
Load Quadword (a-form)

The value in the I16 field, with 2 zero bits appended and extended on the left with copies of the most-significant bit, is used as the local store address. The 16 bytes at the local store address are loaded into register RT.

| LSA | ← RepLeftBit(I16 || 0b00,32) & LSLR & 0xFFFFFF0 |
|-----|-----------------------------------------------|
| RT  | ← LocStor(LSA,16)                             |
Load Quadword Instruction Relative (a-form)

The value in the I16 field, with 2 zero bits appended, is added to the program counter (PC) to form the local store address. The 16 bytes at the local store address are loaded into register RT.

\[
\begin{align*}
\text{LSA} & \leftarrow (\text{RepLeftBit}(I16 \ || \ 0b00, 32) + \text{PC}) \ & \& \text{LSLR} \ & \& 0xFFFFFFF0 \\
\text{RT} & \leftarrow \text{LocStor}(\text{LSA}, 16)
\end{align*}
\]
Store Quadword (d-form)

\texttt{stqd \ rt,symbol(ra)}

\begin{verbatim}
0 0 1 0 1 0 0
\end{verbatim}

The local store address is computed by adding the signed value in the I10 field, with 4 zero bits appended, to the value in the preferred slot of register RA and forcing the rightmost 4 bits of the sum to zero. The contents of register RT are stored at the local store address.

\begin{verbatim}
LSA ← (RepLeftBit(I10 || 0b0000,32) + RA^0..3) & LSLR & 0xFFFFFFF0
LocStor(LSA,16) ← RT
\end{verbatim}
Store Quadword (x-form)

\[
\text{stqx rt,ra,rb}
\]

The local store address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB and forcing the rightmost 4 bits of the sum to zero. The contents of register RT are stored at the local store address.

\[
\begin{array}{cccc}
\text{LSA} & \leftarrow (\text{RA}^{0:3} + \text{RB}^{0:3}) \& \text{LSLR} \& 0xFFFFF0 \\
\text{LocStor}(\text{LSA},16) & \leftarrow \text{RT}
\end{array}
\]
Store Quadword (a-form)

```
stqa rt,symbol
```

The value in the I16 field, with 2 zero bits appended and extended on the left with copies of the most-significant bit, is used as the local store address. The contents of register RT are stored at the location given by the local store address.

```
| LSA | ← RepLeftBit(I16 || 0b00,32) & LSLR & 0xFFFFFFFF |
|-----|--------------------------------------------------|
| LocStor(LSA,16) | ← RT                                           |
```
Store Quadword Instruction Relative (a-form)

The value in the I16 field, with two zero bits appended and extended on the left with copies of the most-significant bit, is added to the program counter (PC) to form the local store address. The contents of register RT are stored at the location given by the local store address.

\[
\begin{align*}
\text{LSA} & \leftarrow (\text{RepLeftBit}(I16 \| 0b00, 32) + \text{PC}) \& \text{LSLR} \& 0xFFFFFFF0 \\
\text{LocStor}(\text{LSA}, 16) & \leftarrow \text{RT}
\end{align*}
\]
Generate Controls for Byte Insertion (d-form)

cbd rt,symbol(ra)

A 4-bit address is computed by adding the value in the signed I7 field to the value in the preferred slot of register RA. The address is used to determine the position of the addressed byte within a quadword. Based on the position, a mask is generated that can be used with the Shuffle Bytes (shufb) instruction to insert a byte at the indicated position within a (previously loaded) quadword. The byte is taken from the rightmost byte position of the preferred slot of the RA operand of the shufb instruction. See Appendix C Details of the Compute-Mask Instructions on page 255 for the details of the created mask.

\[
t \leftarrow (RA^{0:3} + \text{RepLeftBit}(I7,32)) \& 0x0000000F
\]

\[
RT \leftarrow 0x101112131415161718191A1B1C1D1E1F
\]

\[
RT^1 \leftarrow 0x03
\]
Generate Controls for Byte Insertion (x-form)

cbx rt,ra,rb

A 4-bit address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB. The address is used to determine the position of the addressed byte within a quadword. Based on the position, a mask is generated that can be used with the `shufb` instruction to insert a byte at the indicated position within a (previously loaded) quadword. The byte is taken from the rightmost byte position of the preferred slot of the RA operand of the `shufb` instruction. See Appendix C Details of the Compute-Mask Instructions on page 255 for the details of the created mask.

\[
\begin{array}{cccc}
\text{t} & \leftarrow & (\text{RA}^{0:3} + \text{RB}^{0:3}) & \& 0x000000F \\
\text{RT} & \leftarrow & 0x101112131415161718191A1B1C1D1E1F \\
\text{RT}_t & \leftarrow & 0x03 \\
\end{array}
\]
Generate Controls for Halfword Insertion (d-form)

chd rt,symbol(ra)

A 4-bit address is computed by adding the value in the signed I7 field to the value in the preferred slot of register RA and forcing the least-significant bit to zero. The address is used to determine the position of an aligned halfword within a quadword. Based on the position, a mask is generated that can be used with the \texttt{shufb} instruction to insert a halfword at the indicated position within a quadword. The halfword is taken from the rightmost 2 bytes of the preferred slot of the RA operand of the \texttt{shufb} instruction. See Appendix C Details of the Compute-Mask Instructions on page 255 for the details of the created mask.

\begin{tabular}{|c|c|}
\hline
| t & $\leftarrow (RA_{0:3} + \text{RepLeftBlt(I7,32)}) \& 0x0000000E$ \tabularnewline
\hline
| RT & $\leftarrow 0x101112131415161718191A1B1C1D1E1F$ \tabularnewline
\hline
| RT$^{t:2}$ & $\leftarrow 0x0203$ \tabularnewline
\hline
\end{tabular}
Generate Controls for Halfword Insertion (x-form)

chx rt,ra,rb

A 4-bit address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB and forcing the least-significant bit to zero. The address is used to determine the position of an aligned halfword within a quadword. Based on the position, a mask is generated that can be used with the \texttt{shufb} instruction to insert a halfword at the indicated position within a quadword. The halfword is taken from the rightmost 2 bytes of the preferred slot of the RA operand of the \texttt{shufb} instruction. See \textit{Appendix C Details of the Compute-Mask Instructions} on page 255 for the details of the created mask.

| t  | ← (RA^{0..3} + RB^{0..3}) & 0x0000000E |
| RT | ← 0x101112131415161718191A1B1C1D1E1F |
| RT^{t..2} | ← 0x0203 |
Generate Controls for Word Insertion (d-form)

<table>
<thead>
<tr>
<th>cwd</th>
<th>rt,symbol(ra)</th>
</tr>
</thead>
<tbody>
<tr>
<td>00111110110</td>
<td>I7 RA RT</td>
</tr>
<tr>
<td>0123456789 10111213141516171819202122232425262728293031</td>
<td></td>
</tr>
</tbody>
</table>

A 4-bit address is computed by adding the value in the signed I7 field to the value in the preferred slot of register RA and forcing the least-significant 2 bits to zero. The address is used to determine the position of an aligned word within a quadword. Based on the position, a mask is generated that can be used with the `shufb` instruction to insert a word at the indicated position within a quadword. The word is taken from the preferred slot of the RA operand of the `shufb` instruction. See Appendix C Details of the Compute-Mask Instructions on page 255 for the details of the created mask.

\[
\begin{align*}
  t & \leftarrow (\text{RA}^{0:3} + \text{RepLeftBlt(I7,32)}) \& 0x0000000C \\
  RT & \leftarrow 0x101112131415161718191A1B1C1D1E1F \\
  RT^{t:4} & \leftarrow 0x00010203
\end{align*}
\]
Generate Controls for Word Insertion (x-form)

cwx  rt,ra,rb

A 4-bit address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB and forcing the least-significant 2 bits to zero. The address is used to determine the position of an aligned word within a quadword. Based on the position, a mask is generated that can be used with the \texttt{shufb} instruction to insert a word at the indicated position within a quadword. The word is taken from the preferred slot of the RA operand of the \texttt{shufb} instruction. See \textit{Appendix C Details of the Compute-Mask Instructions} on page 255 for the details of the created mask.

\begin{tabular}{|c|}
\hline
\texttt{t} & \textleftarrow (\texttt{RA}^{0:3} + \texttt{RB}^{0:3}) \& 0x0000000C \\
\hline
\texttt{RT} & \textleftarrow 0x101112131415161718191A1B1C1D1E1F \\
\hline
\texttt{RT}^{:4} & \textleftarrow 0x00010203 \\
\hline
\end{tabular}
Generate Controls for Doubleword Insertion (d-form)

A 4-bit address is computed by adding the value in the signed I7 field to the value in the preferred slot of register RA and forcing the least-significant 3 bits to zero. The address is used to determine the position of an aligned doubleword within a quadword. Based on the position, a mask is generated that can be used with the \texttt{shufb} instruction to insert a doubleword at the indicated position within a quadword. The doubleword is taken from the leftmost 8 bytes of the RA operand of the \texttt{shufb} instruction. See \textit{Appendix C Details of the Compute-Mask Instructions} on page 255 for the details of the created mask.

\begin{verbatim}
    t ← (RA\textsubscript{0:3} + RepLeftBit(I7,32)) & 0x00000008
    RT ← 0x101112131415161718191A1B1C1D1E1F
    RT\textsubscript{1:8} ← 0x0001020304050607
\end{verbatim}
Generate Controls for Doubleword Insertion (x-form)

cdx rt,ra,rb

A 4-bit address is computed by adding the value in the preferred slot of register RA to the value in the preferred slot of register RB and forcing the least-significant 3 bits to zero. The address is used to determine the position of the addressed doubleword within a quadword. Based on the position, a mask is generated that can be used with the `shufb` instruction to insert a doubleword at the indicated position within a quadword. The quadword is taken from the leftmost 8 bytes of the RA operand of the `shufb` instruction. See Appendix C Details of the Compute-Mask Instructions on page 255 for the details of the created mask.

\[
\begin{array}{c}
t \leftarrow (RA^{0:3} + RB^{0:3}) \& 0x00000008 \\
RT \leftarrow 0x101112131415161718191A1B1C1D1E1F \\
RT^c[8] \leftarrow 0x0001020304050607
\end{array}
\]
4. Constant-Formation Instructions

This section lists and describes the SPU constant-formation instructions.
Immediate Load Halfword

\textbf{ilh} \quad rt,\text{symbol}

\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
\textbf{\text{I16}} & \textbf{RT} \\
\hline
0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
\hline
\end{tabular}
\end{table}

For each of eight halfword slots:

- The rightmost 16 bits of the value in the I16 field are placed in register RT.

\textbf{Programming Note:} There is no Immediate Load Byte instruction. However, that function can be performed by the \textbf{ilh} instruction with a suitable value in the I16 field.

\begin{align*}
\text{s} & \leftarrow \text{I16} \& 0xFFFF \\
\text{RT}^{0:1} & \leftarrow \text{s} \\
\text{RT}^{2:3} & \leftarrow \text{s} \\
\text{RT}^{4:5} & \leftarrow \text{s} \\
\text{RT}^{6:7} & \leftarrow \text{s} \\
\text{RT}^{8:9} & \leftarrow \text{s} \\
\text{RT}^{10:11} & \leftarrow \text{s} \\
\text{RT}^{12:13} & \leftarrow \text{s} \\
\text{RT}^{14:15} & \leftarrow \text{s}
\end{align*}
Immediate Load Halfword Upper

\( \text{ilhu} \quad \text{rt}, \text{symbol} \)

```
0 1 0 0 0 0 1 0
↓↓↓↓↓↓↓↓↓↓↓↓↓↓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

For each of four word slots:
- The value in the I16 field is placed in the leftmost 16 bits of the word.
- The remaining bits of the word are set to zero.

**Programming Note:** This instruction, when used in conjunction with Immediate Or Halfword Lower (iohl), can be used to form an arbitrary 32-bit value in each word slot of a register. It can also be used alone to load an immediate floating-point constant with up to 7 bits of significance in its fraction.

```
t ← I16 || 0x0000

RT0:3 ← t
RT4:7 ← t
RT8:11 ← t
RT12:15 ← t
```
Immediate Load Word

\textbf{il} \quad \textbf{rt},\textit{symbol}

\begin{verbatim}
0 0 0 0 0 0 1
  ↓  ↓  ↓  ↓  ↓  ↓
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

For each of four word slots:
\begin{itemize}
  \item The value in the I16 field is expanded to 32 bits by replicating the leftmost bit.
  \item The resulting value is placed in register RT.
\end{itemize}

\begin{verbatim}
t ← RepLeftBit(I16,32)
RT^0:3 ← t
RT^4:7 ← t
RT^8:11 ← t
RT^12:15 ← t
\end{verbatim}
\end{verbatim}
Immediate Load Address

`ila rt,symbol`

For each of four word slots:

- The value in the I18 field is placed unchanged in the rightmost 18 bits of register RT.
- The remaining bits of register RT are set to zero.

**Programming Note:** Immediate Load Address can be used to load an immediate value, such as an address or a small constant, without sign extension.

<table>
<thead>
<tr>
<th>Field</th>
<th>Assignments</th>
</tr>
</thead>
</table>
| `I18` | `t` <- `I18`
| `RT0:3` | `t` <- `RT0:3`
| `RT4:7` | `t` <- `RT4:7`
| `RT8:11` | `t` <- `RT8:11`
| `RT12:15` | `t` <- `RT12:15`
Immediate Or Halfword Lower

For each of four word slots:

- The value in the I16 field is prefaced with zeros and ORed with the value in register RT.
- The result is placed into register RT.

Programming Note: Immediate Or Halfword Lower can be used in conjunction with Immediate Load Halfword Upper to load a 32-bit immediate value.

| t                  | ← 0x0000 || I16 |
|--------------------|-----------------|
| RT^{0:3}           | ← RT^{0:3} | t |
| RT^{4:7}           | ← RT^{4:7} | t |
| RT^{8:11}          | ← RT^{8:11} | t |
| RT^{12:15}         | ← RT^{12:15} | t |
Form Select Mask for Bytes Immediate

fsmbi rt,symbol

The I16 field is used to create a mask in register RT by making eight copies of each bit. Bits in the operand are related to bytes in the result in a left-to-right correspondence.

**Programming Note:** This instruction can be used to create a mask for use with the Select Bits instruction. It can also be used to create masks for halfwords, words, and doublewords.

```plaintext
s ← I16
For j = 0 to 15
    If s_j = 0 then r_j ← 0x00 else r_j ← 0xFF
End
RT ← r
```
5. Integer and Logical Instructions

This section lists and describes the SPU integer and logical instructions.
Add Halfword

```
ah        rt,ra,rb
```

For each of eight halfword slots:

- The operand from register RA is added to the operand from register RB.
- The 16-bit result is placed in RT.
- Overflows and carries are not detected.

<table>
<thead>
<tr>
<th>RT&lt;0:1&gt;</th>
<th>← RA&lt;0:1&gt; + RB&lt;0:1&gt;</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT&lt;2:3&gt;</td>
<td>← RA&lt;2:3&gt; + RB&lt;2:3&gt;</td>
</tr>
<tr>
<td>RT&lt;4:5&gt;</td>
<td>← RA&lt;4:5&gt; + RB&lt;4:5&gt;</td>
</tr>
<tr>
<td>RT&lt;6:7&gt;</td>
<td>← RA&lt;6:7&gt; + RB&lt;6:7&gt;</td>
</tr>
<tr>
<td>RT&lt;8:9&gt;</td>
<td>← RA&lt;8:9&gt; + RB&lt;8:9&gt;</td>
</tr>
<tr>
<td>RT&lt;10:11&gt;</td>
<td>← RA&lt;10:11&gt; + RB&lt;10:11&gt;</td>
</tr>
<tr>
<td>RT&lt;12:13&gt;</td>
<td>← RA&lt;12:13&gt; + RB&lt;12:13&gt;</td>
</tr>
<tr>
<td>RT&lt;14:15&gt;</td>
<td>← RA&lt;14:15&gt; + RB&lt;14:15&gt;</td>
</tr>
</tbody>
</table>
Add Halfword Immediate

ahl rt, ra, value

For each of eight halfword slots:
- The signed value in the I10 field is added to the value in register RA.
- The 16-bit result is placed in RT.
- Overflows and carries are not detected.

\[
\begin{array}{c|c|c}
\text{s} & \leftarrow \text{RepLeftBit}(I10, 16) \\
\text{RT}^{0:1} & \leftarrow \text{RA}^{0:1} + s \\
\text{RT}^{2:3} & \leftarrow \text{RA}^{2:3} + s \\
\text{RT}^{4:5} & \leftarrow \text{RA}^{4:5} + s \\
\text{RT}^{6:7} & \leftarrow \text{RA}^{6:7} + s \\
\text{RT}^{8:9} & \leftarrow \text{RA}^{8:9} + s \\
\text{RT}^{10:11} & \leftarrow \text{RA}^{10:11} + s \\
\text{RT}^{12:13} & \leftarrow \text{RA}^{12:13} + s \\
\text{RT}^{14:15} & \leftarrow \text{RA}^{14:15} + s \\
\end{array}
\]
Add Word

\( a \) \( rt, ra, rb \)

For each of four word slots:

- The operand from register RA is added to the operand from register RB.
- The 32-bit result is placed in register RT.
- Overflows and carries are not detected.

<table>
<thead>
<tr>
<th>RT&lt;0:3&gt;</th>
<th>( \leftarrow RA&lt;0:3&gt; + RB&lt;0:3&gt; )</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT&lt;4:7&gt;</td>
<td>( \leftarrow RA&lt;4:7&gt; + RB&lt;4:7&gt; )</td>
</tr>
<tr>
<td>RT&lt;8:11&gt;</td>
<td>( \leftarrow RA&lt;8:11&gt; + RB&lt;8:11&gt; )</td>
</tr>
<tr>
<td>RT&lt;12:15&gt;</td>
<td>( \leftarrow RA&lt;12:15&gt; + RB&lt;12:15&gt; )</td>
</tr>
</tbody>
</table>
Add Word Immediate

\[ ai \quad rt, ra, value \]

For each of four word slots:
- The signed value in the I10 field is added to the operand in register RA.
- The 32-bit result is placed in register RT.
- Overflows and carries are not detected.

\[
\begin{array}{l}
\text{t} \quad \leftarrow \text{RepLeftBit(I10, 32)} \\
\text{RT}^{0:3} \quad \leftarrow \text{RA}^{0:3} + t \\
\text{RT}^{6:7} \quad \leftarrow \text{RA}^{6:7} + t \\
\text{RT}^{8:11} \quad \leftarrow \text{RA}^{8:11} + t \\
\text{RT}^{12:15} \quad \leftarrow \text{RA}^{12:15} + t
\end{array}
\]
Subtract From Halfword

```
sfh       rt, ra, rb
```

For each of eight halfword slots:

- The value in register RA is subtracted from the value in RB.
- The 16-bit result is placed in register RT.
- Overflows and carries are not detected.

| RT<0:1> | ← RB<0:1> + (¬RA<0:1>) + 1 |
| RT<2:3> | ← RB<2:3> + (¬RA<2:3>) + 1 |
| RT<4:5> | ← RB<4:5> + (¬RA<4:5>) + 1 |
| RT<6:7> | ← RB<6:7> + (¬RA<6:7>) + 1 |
| RT<8:9> | ← RB<8:9> + (¬RA<8:9>) + 1 |
| RT<10:11> | ← RB<10:11> + (¬RA<10:11>) + 1 |
| RT<12:13> | ← RB<12:13> + (¬RA<12:13>) + 1 |
| RT<14:15> | ← RB<14:15> + (¬RA<14:15>) + 1 |
Subtract From Halfword Immediate

\[ \text{sfhi } rt, ra, \text{value} \]

For each of eight halfword slots:
- The value in register RA is subtracted from the signed value in the I10 field.
- The 16-bit result is placed in register RT.
- Overflows are not detected.

**Programming Note**: Although there is no Subtract Halfword Immediate instruction, its effect can be achieved by using the Add Immediate Halfword with a negative immediate field.

\[
\begin{array}{l}
\text{t} \leftarrow \text{RepLeftBit}(I10,16) \\
\text{RT}^{0:1} \leftarrow t + (-\text{RA}^{0:1}) + 1 \\
\text{RT}^{2:3} \leftarrow t + (-\text{RA}^{2:3}) + 1 \\
\text{RT}^{4:5} \leftarrow t + (-\text{RA}^{4:5}) + 1 \\
\text{RT}^{6:7} \leftarrow t + (-\text{RA}^{6:7}) + 1 \\
\text{RT}^{8:9} \leftarrow t + (-\text{RA}^{8:9}) + 1 \\
\text{RT}^{10:11} \leftarrow t + (-\text{RA}^{10:11}) + 1 \\
\text{RT}^{12:13} \leftarrow t + (-\text{RA}^{12:13}) + 1 \\
\text{RT}^{14:15} \leftarrow t + (-\text{RA}^{14:15}) + 1 \\
\end{array}
\]
Subtract From Word

\sf{rt, ra, rb}

For each of four word slots:
- The value in register RA is subtracted from the value in register RB.
- The result is placed in register RT.
- Overflows and carries are not detected.

<table>
<thead>
<tr>
<th>Bit Range</th>
<th>Equation</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT0:3</td>
<td>( \leftarrow RB^0:3 + \neg RA^0:3 + 1 )</td>
</tr>
<tr>
<td>RT4:7</td>
<td>( \leftarrow RB^4:7 + \neg RA^4:7 + 1 )</td>
</tr>
<tr>
<td>RT8:11</td>
<td>( \leftarrow RB^8:11 + \neg RA^8:11 + 1 )</td>
</tr>
<tr>
<td>RT12:15</td>
<td>( \leftarrow RB^{12:15} + \neg RA^{12:15} + 1 )</td>
</tr>
</tbody>
</table>
Subtract From Word Immediate

The value in register RA is subtracted from the value in the I10 field.

The result is placed in register RT.

Overflows and carries are not detected.

**Programming Note:** Although there is no Subtract Immediate instruction, its effect can be achieved by using the Add Immediate with a negative immediate field.

\[
\begin{array}{l}
\text{t} \leftarrow \text{RepLeftBit(I10,32)} \\
\text{RT}^{0:3} \leftarrow t + (\lnot \text{RA}^{0:3}) + 1 \\
\text{RT}^{4:7} \leftarrow t + (\lnot \text{RA}^{4:7}) + 1 \\
\text{RT}^{8:11} \leftarrow t + (\lnot \text{RA}^{8:11}) + 1 \\
\text{RT}^{12:15} \leftarrow t + (\lnot \text{RA}^{12:15}) + 1 \\
\end{array}
\]
Add Extended

\texttt{addx rt,ra,rb}

For each of four word slots:

- The operand from register RA is added to the operand from register RB and the least-significant bit of the operand from register RT.
- The 32-bit result is placed in register RT. Bits 0 to 30 of the RT input are reserved and should be zero.

\begin{align*}
\text{RT}^{0:3} & \leftarrow \text{RA}^{0:3} + \text{RB}^{0:3} + \text{RT}_{31} \\
\text{RT}^{4:7} & \leftarrow \text{RA}^{4:7} + \text{RB}^{4:7} + \text{RT}_{63} \\
\text{RT}^{8:11} & \leftarrow \text{RA}^{8:11} + \text{RB}^{8:11} + \text{RT}_{95} \\
\text{RT}^{12:15} & \leftarrow \text{RA}^{12:15} + \text{RB}^{12:15} + \text{RT}_{127}
\end{align*}
Carry Generate

cg               rt,ra,rb

For each of four word slots:

- The operand from register RA is added to the operand from register RB.
- The carry-out is placed in the least-significant bit of register RT.
- The remaining bits of RT are set to zero.

For \( j = 0 \) to 15 by 4

\[
t_{16:32} = ((0 \| RA_{j:4}) + (0 \| RB_{j:4}))
\]

\[
RT_{j:4} \leftarrow (31 \| t_0)
\]

End
Carry Generate Extended

cgx rt, ra, rb

For each of four word slots:
- The operand from register RA is added to the operand from register RB and the least-significant bit of register RT.
- The carry-out is placed in the least-significant bit of register RT.
- The remaining bits of RT are set to zero. Bits 0 to 30 of the RT input are reserved and should be zero.

For \( j = 0 \) to 15 by 4
\[
t_0.32 = (0 || RA_{j:4}) + (0 || RB_{j:4}) + (320 || RT_j.8 + 31) \\
RT_{j:4} \leftarrow 31 || t_0
\]
End
Subtract From Extended

`sfx rt, ra, rb`

For each of four word slots:

- The operand from register RA is subtracted from the operand from register RB. An additional ‘1’ is subtracted from the result if the least-significant bit of RT is ‘0’.
- The 32-bit result is placed in register RT. Bits 0 to 30 of the RT input are reserved and should be zero.

\[
\begin{align*}
RT_{0:3} & \leftarrow RB_{0:3} + (\neg RA_{0:3}) + RT_{31} \\
RT_{4:7} & \leftarrow RB_{4:7} + (\neg RA_{4:7}) + RT_{63} \\
RT_{8:11} & \leftarrow RB_{8:11} + (\neg RA_{8:11}) + RT_{95} \\
RT_{12:15} & \leftarrow RB_{12:15} + (\neg RA_{12:15}) + RT_{127}
\end{align*}
\]
Borrow Generate

bg rt, ra, rb

For each of four word slots:

- If the unsigned value of RA is greater than the unsigned value of RB, then ‘0’ is placed in register RT. Otherwise, ‘1’ is placed in register RT.

```plaintext
For j = 0 to 15 by 4
  if (RBj:4 ≥ RAj:4) then RTj:4 ← 1
  else RTj:4 ← 0
End
```
Borrow Generate Extended

bgx rt,ra,rb

For each of four word slots:

- The operand from register RA is subtracted from the operand from register RB. An additional ‘1’ is subtracted from the result if the least-significant bit of RT is ‘0’. If the result is less than zero, a ‘0’ is placed in register RT. Otherwise, register RT is set to ‘1’. Bits 0 to 30 of the RT input are reserved and should be zero.

For \( j = 0 \) to 15 by \( 4 \)

\[
\begin{align*}
\text{if} & \ (R_{Tj} \cdot 8 + 31) \ \text{then} \\
& \quad \text{if} \ (R_{Bj} >_{4} R_{Aj}) \ \text{then} \ R_{Tj}^{\downarrow 4} \leftarrow 1 \\
& \quad \text{else} \ R_{Tj}^{\downarrow 4} \leftarrow 0 \\
& \quad \text{else} \\
& \quad \quad \text{if} \ (R_{Bj} >_{4} R_{Aj}) \ \text{then} \ R_{Tj}^{\downarrow 4} \leftarrow 1 \\
& \quad \quad \text{else} \ R_{Tj}^{\downarrow 4} \leftarrow 0 \\
\end{align*}
\]

End
Multiply

\texttt{mpy rtr,ra,rb}

\begin{align*}
\begin{array}{cccccccccccccccccccc}
| & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & | & RB & | & RA & | & RT \\
\hline
\end{array}
\end{align*}

For each of four word slots:

- The value in the rightmost 16 bits of register RA is multiplied by the value in the rightmost 16 bits of register RB.
- The 32-bit product is placed in register RT.
- The leftmost 16 bits of each operand are ignored.

\begin{align*}
\text{RT}^{0:3} & \gets \text{RA}^{2:3} \times \text{RB}^{2:3} \\
\text{RT}^{4:7} & \gets \text{RA}^{6:7} \times \text{RB}^{6:7} \\
\text{RT}^{8:11} & \gets \text{RA}^{10:11} \times \text{RB}^{10:11} \\
\text{RT}^{12:15} & \gets \text{RA}^{14:15} \times \text{RB}^{14:15}
\end{align*}
Multiply Unsigned

\[
\text{mpyu } \text{rt,ra,rb}
\]

For each of four word slots:

- The rightmost 16 bits of register RA are multiplied by the rightmost 16 bits of register RB, treating both operands as unsigned.
- The 32-bit product is placed in register RT.

\[
\begin{array}{cccc}
\text{RT}^0:3 & \leftarrow & \text{RA}^2:3 \text{ I } \text{RB}^2:3 \\
\text{RT}^4:7 & \leftarrow & \text{RA}^6:7 \text{ I } \text{RB}^6:7 \\
\text{RT}^8:11 & \leftarrow & \text{RA}^{10:11} \text{ I } \text{RB}^{10:11} \\
\text{RT}^{12:15} & \leftarrow & \text{RA}^{14:15} \text{ I } \text{RB}^{14:15}
\end{array}
\]
Multiply Immediate

\textbf{mpyi} \hspace{1cm} rt,ra,value

\begin{verbatim}
 0 1 1 1 0 1 0 0 \hspace{1cm} I10 \hspace{1cm} RA \hspace{1cm} RT
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ \hspace{1cm} ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
\end{verbatim}

For each of four word slots:

\begin{itemize}
  \item The signed value in the I10 field is multiplied by the value in the rightmost 16 bits of register RA.
  \item The resulting product is placed in register RT.
\end{itemize}

\begin{table}[h]
\begin{tabular}{|c|c|}
\hline
$t$ & \hspace{1cm} \textless \text{RepLeftBit(I10,16)} \textgreater \\
\hline
RT$_{0:3}$ & \hspace{1cm} RA$_{2:3} \times t$ \\
\hline
RT$_{4:7}$ & \hspace{1cm} RA$_{6:7} \times t$ \\
\hline
RT$_{8:11}$ & \hspace{1cm} RA$_{10:11} \times t$ \\
\hline
RT$_{12:15}$ & \hspace{1cm} RA$_{14:15} \times t$ \\
\hline
\end{tabular}
\end{table}
### Multiply Unsigned Immediate

**mpyui**

rt, ra, value

<table>
<thead>
<tr>
<th>t</th>
<th>← RepLeftBit(I10, 16)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT&lt;0:3</td>
<td>← RA&lt;2:3</td>
</tr>
<tr>
<td>RT&lt;4:7</td>
<td>← RA&lt;6:7</td>
</tr>
<tr>
<td>RT&lt;8:11</td>
<td>← RA&lt;10:11</td>
</tr>
<tr>
<td>RT&lt;12:15</td>
<td>← RA&lt;14:15</td>
</tr>
</tbody>
</table>

For each of four word slots:

- The signed value in the I10 field is extended to 16 bits by replicating the leftmost bit. The resulting value is multiplied by the rightmost 16 bits of register RA, treating both operands as unsigned.
- The resulting product is placed in register RT.
Multiply and Add

```plaintext
mpya rt,ra,rb,rc
```

For each of four word slots:

- The value in register RA is treated as a 16-bit signed integer and multiplied by the 16-bit signed value in register RB. The resulting product is added to the value in register RC.
- The result is placed in register RT.
- Overflows and carries are not detected.

**Programming Note:** The operands are right-aligned within the 32-bit field.

| t0     | ← RA[2:3] × RB[2:3] |
| RT[0:3] | ← t0 + RC[0:3] |
Multiply High

`mpyh rt,ra,rb`

For each of four word slots:

- The leftmost 16 bits of the value in register RA are shifted right by 16 bits and multiplied by the 16-bit value in register RB.
- The product is shifted left by 16 bits and placed in register RT. Bits shifted out at the left are discarded. Zeros are shifted in at the right.

**Programming Note:** This instruction can be used in conjunction with `mpyu` and `add` to perform a 32-bit multiply.

<table>
<thead>
<tr>
<th>Register</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>t0</code></td>
<td>← RA6:1 * RB2:3</td>
<td></td>
</tr>
<tr>
<td><code>t1</code></td>
<td>← RA4:5 * RB6:7</td>
<td></td>
</tr>
<tr>
<td><code>t2</code></td>
<td>← RA6:9 * RB10:11</td>
<td></td>
</tr>
<tr>
<td><code>t3</code></td>
<td>← RA12:13 * RB14:15</td>
<td></td>
</tr>
<tr>
<td><code>RT^0:3</code></td>
<td>← <code>t2^0:3</code></td>
<td></td>
</tr>
<tr>
<td><code>RT^4:7</code></td>
<td>← <code>t1^4:7</code></td>
<td></td>
</tr>
<tr>
<td><code>RT^8:11</code></td>
<td>← <code>t2^8:11</code></td>
<td></td>
</tr>
<tr>
<td><code>RT^12:15</code></td>
<td>← <code>t3^12:15</code></td>
<td></td>
</tr>
</tbody>
</table>
Multiply and Shift Right

mpys rt,ra,rb

For each of four word slots:

- The value in the rightmost 16 bits of register RA is multiplied by the value in the rightmost 16 bits of register RB.
- The leftmost 16 bits of the 32-bit product are placed in the rightmost 16 bits of register RT, with the sign bit replicated into the left 16 bits of the register.

| t0   | ← RA2:3 • RB2:3 |
| t1   | ← RA6:7 • RB6:7 |
| t2   | ← RA10:11 • RB10:11 |
| t3   | ← RA14:15 • RB14:15 |
| RT0:3| ← RepLeftBit(t00:1,32) |
| RT4:7| ← RepLeftBit(t10:1,32) |
| RT8:11| ← RepLeftBit(t20:1,32) |
| RT12:15| ← RepLeftBit(t30:1,32) |
### Multiply High High

**mpyhh**  \( rt,ra,rb \)

<table>
<thead>
<tr>
<th></th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1</td>
<td>1 1</td>
<td>0 0</td>
<td>1 1</td>
</tr>
<tr>
<td></td>
<td>↓</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 1 2</td>
<td>3 4</td>
<td>5 6</td>
<td>7 8</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 1 2</td>
<td>3 4</td>
<td>5 6</td>
<td>7 8</td>
</tr>
<tr>
<td>0 1 2</td>
<td>3 4</td>
<td>5 6</td>
<td>7 8</td>
</tr>
<tr>
<td>0 1 2</td>
<td>3 4</td>
<td>5 6</td>
<td>7 8</td>
</tr>
<tr>
<td>0 1 2</td>
<td>3 4</td>
<td>5 6</td>
<td>7 8</td>
</tr>
</tbody>
</table>

For each of four word slots:

- The leftmost 16 bits in register RA are multiplied by the leftmost 16 bits in register RB.
- The 32-bit product is placed in register RT.

<table>
<thead>
<tr>
<th></th>
<th>RA</th>
<th>RB</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>( RT^{0:3} )</td>
<td></td>
<td></td>
<td>( RA^{0:1} \cdot RB^{0:1} )</td>
</tr>
<tr>
<td>( RT^{4:7} )</td>
<td></td>
<td></td>
<td>( RA^{4:5} \cdot RB^{4:5} )</td>
</tr>
<tr>
<td>( RT^{8:11} )</td>
<td></td>
<td></td>
<td>( RA^{8:9} \cdot RB^{8:9} )</td>
</tr>
<tr>
<td>( RT^{12:15} )</td>
<td></td>
<td></td>
<td>( RA^{12:13} \cdot RB^{12:13} )</td>
</tr>
</tbody>
</table>
Multiply High High and Add

\texttt{mpyhha rt,ra,rb}

For each of four word slots:

- The leftmost 16 bits in register \( RA \) are multiplied by the leftmost 16 bits in register \( RB \). The product is added to the value in register \( RT \).
- The sum is placed in register \( RT \).

For each of four word slots:

\[
\begin{align*}
RT_{0:3} & \leftarrow RA_{0:1} \times RB_{0:1} + RT_{0:3} \\
RT_{4:7} & \leftarrow RA_{4:5} \times RB_{4:5} + RT_{4:7} \\
RT_{8:11} & \leftarrow RA_{8:9} \times RB_{8:9} + RT_{8:11} \\
RT_{12:15} & \leftarrow RA_{12:13} \times RB_{12:13} + RT_{12:15}
\end{align*}
\]
Multiply High High Unsigned

\[ \text{mpyhhu} \quad \text{rt}, \text{ra}, \text{rb} \]

For each of four word slots:

- The leftmost 16 bits in register RA are multiplied by the leftmost 16 bits in register RB, treating both operands as unsigned.
- The 32-bit product is placed in register RT.

\[
\begin{array}{cccc}
\text{RT}^{0:3} & \leftarrow & \text{RA}^{0:1} | \text{RB}^{0:1} \\
\text{RT}^{4:7} & \leftarrow & \text{RA}^{4:5} | \text{RB}^{4:5} \\
\text{RT}^{8:11} & \leftarrow & \text{RA}^{8:9} | \text{RB}^{8:9} \\
\text{RT}^{12:15} & \leftarrow & \text{RA}^{12:13} | \text{RB}^{12:13} \\
\end{array}
\]
Multiply High High Unsigned and Add

mpyhhau rt,ra,rb

For each of four word slots:

- The leftmost 16 bits in register RA are multiplied by the leftmost 16 bits in register RB, treating both operands as unsigned. The product is added to the value in register RT.
- The sum is placed in register RT.

\[
egin{align*}
RT^{0:3} &\leftarrow RA^{0:3} \times RB^{0:1} + RT^{0:3} \\
RT^{4:7} &\leftarrow RA^{4:5} \times RB^{4:5} + RT^{4:7} \\
RT^{8:11} &\leftarrow RA^{8:9} \times RB^{8:9} + RT^{8:11} \\
RT^{12:15} &\leftarrow RA^{12:13} \times RB^{12:13} + RT^{12:15}
\end{align*}
\]
Count Leading Zeros

\texttt{clz \ rt,ra}

\begin{verbatim}

0 1 0 1 0 1 0 1 \hfill /// \hfill RA \hfill RT
\downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow
\hfill 0 \hfill 1 \hfill 2 \hfill 3 \hfill 4 \hfill 5 \hfill 6 \hfill 7 \hfill 8 \hfill 9 \hfill 10
\hfill 11 \hfill 12 \hfill 13 \hfill 14 \hfill 15 \hfill 16 \hfill 17 \hfill 18 \hfill 19 \hfill 20
\hfill 21 \hfill 22 \hfill 23 \hfill 24 \hfill 25 \hfill 26 \hfill 27 \hfill 28 \hfill 29 \hfill 30 \hfill 31
\end{verbatim}

For each of four word slots:

- The number of zero bits to the left of the first ‘1’ bit in the operand in register RA is computed.
- The result is placed in register RT. If register RA is zero, the result is 32.

\textbf{Programming Note:} The result placed in register RT satisfies $0 \leq RT \leq 32$. The value in register RT is zero, for example, if the corresponding slot in RA is a negative integer. The value in register RT is 32 if the corresponding slot in register RA is zero.

For $i = 0$ to $3$
\begin{verbatim}
t ← 0; j ← i * 4
u ← RA\hfill :\hfill 4
For m = 0 to 31
    If $u_m = 1$ then leave
    t ← t + 1
End
RT\hfill :\hfill 4 ← t
End
\end{verbatim}
Count Ones in Bytes

cntb rt,ra

For each of 16 byte slots:

- The number of bits in register RA whose value is ‘1’ is computed.
- The result is placed in register RT.

**Programming Note:** The result placed in register RT satisfies $0 \leq RT \leq 8$. The value in register RT is zero, for example, if the value in RA is zero. The value in RT is 8 if the value in RA is -1.

```plaintext
For j = 0 to 15
    c = 0
    b ← RAj
    For m = 0 to 7
        If bm = 1 then c ← c + 1
    End
    RTj ← c
End
```

(See also the *Form Select Mask for Bytes* instruction on page 80.)
Form Select Mask for Bytes

fsmb rt, ra

The rightmost 16 bits of the preferred slot of register RA are used to create a mask in register RT by replicating each bit eight times. Bits in the operand are related to bytes in the result in a left-to-right correspondence.

\[ s \leftarrow RA^{2:3} \& 0xFFFF \]

For \( j = 0 \) to 15
  If \( s_j = 0 \) then \( r_j \leftarrow 0x00 \) else
    \( r_j \leftarrow 0xFF \)
  End

RT \leftarrow r
Form Select Mask for Halfwords

\textbf{fsmh rt,ra}

\begin{align*}
\begin{array}{cccccccccc}
0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 0 & 1 \\
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow & 
\downarrow \\
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & k = 0 \\
\end{array}
\end{align*}

For \( j = 0 \) to 7
\begin{align*}
\text{If } s_j = 0 & \text{ then } r^{k:2} \leftarrow 0x0000 \\
\text{else} & \text{ } r^{k:2} \leftarrow 0xFFFF \\
k = k + 2
\end{align*}

End

\( RT \leftarrow r \)

The rightmost 8 bits of the preferred slot of register RA are used to create a mask in register RT by replicating each bit 16 times. Bits in the operand are related to halfwords in the result, in a left-to-right correspondence.
Form Select Mask for Words

\[ \text{fsm} \quad \text{rt,ra} \]

\[
\begin{array}{cccccccccccccccccccccccccccc}
0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 0 & 0 & /// & & & & & & & & & & \\
\end{array}
\]

The rightmost 4 bits of the preferred slot of register RA are used to create a mask in register RT by replicating each bit 32 times. Bits in the operand are related to words in the result in a left-to-right correspondence.

```plaintext
s ← RA_{28:31}
k = 0
For j = 0 to 3
    If s_j = 0 then r_{k:4} ← 0x00000000 else
    r_{k:4} ← 0xFFFFFFFF
    k = k + 4
End
RT ← r
```
**Gather Bits from Bytes**

<table>
<thead>
<tr>
<th>gbb</th>
<th>rt, ra</th>
</tr>
</thead>
<tbody>
<tr>
<td>00110110</td>
<td>01101112</td>
</tr>
</tbody>
</table>

A 16-bit quantity is formed in the right half of the preferred slot of register RT by concatenating the rightmost bit in each byte of register RA. The leftmost 16 bits of register RT are set to zero, as are the remaining slots of register RT.

```
k = 0
s = 0
For j = 7 to 128 by 8
    sk ← RAj
    k = k + 1
End
RT0:3 ← 0x0000 || s
RT4:7 ← 0
RT8:11 ← 0
RT12:15 ← 0
```
Gather Bits from Halfwords

\[ \text{gbh} \quad \text{rt,ra} \]

An 8-bit quantity is formed in the rightmost byte of the preferred slot of register RT by concatenating the rightmost bit in each halfword of register RA. The leftmost 24 bits of the preferred slot of register RT are set to zero, as are the remaining slots of register RT.

\[
\begin{align*}
k &= 8 \\
s &= 0 \\
\text{For } j &= 15 \text{ to } 128 \text{ by } 16 \\
   &\quad s_k \leftarrow RA_j \\
   &\quad k = k + 1 \\
\text{End} \\
RT^{0:3} &\leftarrow 0x0000 \| s \\
RT^{4:7} &\leftarrow 0 \\
RT^{8:11} &\leftarrow 0 \\
RT^{12:15} &\leftarrow 0
\end{align*}
\]
Gather Bits from Words

A 4-bit quantity is formed in the rightmost 4 bits of register RT by concatenating the rightmost bit in each word of register RA. The leftmost 28 bits of register RT are set to zero, as are the remaining slots of register RT.

\[
gb \quad rt,ra
\]

\[
\begin{array}{cccccccccccccccccccccccc}
0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & \/// & \RA & \RT \\
\end{array}
\]

\[
k = 12 \\
s = 0 \\
\text{For } j = 31 \text{ to } 128 \text{ by } 32 \\
\quad s_k \leftarrow RA_j \\
\quad k \leftarrow k + 1 \\
\text{End} \\
RT^{0:3} \leftarrow 0x0000 \parallel s \\
RT^{4:7} \leftarrow 0 \\
RT^{8:11} \leftarrow 0 \\
RT^{12:15} \leftarrow 0
\]
Average Bytes

\[
\text{avgb rt,ra,rb}
\]

For each of 16 byte slots:

- The operand from register RA is added to the operand from register RB, and ‘1’ is added to the result. These additions are done without loss of precision.
- That result is shifted to the right by 1 bit and placed in register RT.

\[
\text{For } j = 0 \text{ to 15} \\
\quad RT_j \leftarrow ((0\text{x}00 \parallel RA_j) + (0\text{x}00 \parallel RB_j) + 1)_{7:14} \\
\text{End}
\]
Absolute Differences of Bytes

absdb rt,ra,rb

For each of 16 byte slots:

- The operand in register RA is subtracted from the operand in register RB.
- The absolute value of the result is placed in register RT.

Programming Note: The operands are unsigned.

For j = 0 to 15

if (RB[^j] > RA[^j]) then
   RT[^j] ← RB[^j] - RA[^j]
else
   RT[^j] ← RA[^j] - RB[^j]
End
Sum Bytes into Halfwords

**sumb**

```
0 1 0 0 1 0 0 1 1
```

```
RB RA RT
```

For each of four word slots:

- The 4 bytes in register RB are added, and the 16-bit result is placed in bytes 0 and 1 of register RT.
- The 4 bytes in register RA are added, and the 16-bit result is placed in bytes 2 and 3 of register RT.

**Programming Note:** The operands are unsigned.

| RT<0:1> | ← RB<0> + RB<1> + RB<2> + RB<3> |
| RT<2:3> | ← RA<0> + RA<1> + RA<2> + RA<3> |
| RT<4:5> | ← RB<4> + RB<5> + RB<6> + RB<7> |
| RT<6:7> | ← RA<4> + RA<5> + RA<6> + RA<7> |
| RT<8:9> | ← RB<8> + RB<9> + RB<10> + RB<11> |
| RT<10:11> | ← RA<8> + RA<9> + RA<10> + RA<11> |
| RT<12:13> | ← RB<12> + RB<13> + RB<14> + RB<15> |
| RT<14:15> | ← RA<12> + RA<13> + RA<14> + RA<15> |
Extend Sign Byte to Halfword

```
xsbh rt,ra
```

For each of eight halfword slots:
- The sign of the byte in the right byte of the operand in register RA is propagated to the left byte.
- The resulting 16-bit integer is stored in register RT.

**Programming Note:** This is the only instruction that treats bytes as signed.

<table>
<thead>
<tr>
<th>RT0:1</th>
<th>← RepLeftBit(RA1,16)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT2:3</td>
<td>← RepLeftBit(RA3,16)</td>
</tr>
<tr>
<td>RT4:5</td>
<td>← RepLeftBit(RA5,16)</td>
</tr>
<tr>
<td>RT6:7</td>
<td>← RepLeftBit(RA7,16)</td>
</tr>
<tr>
<td>RT8:9</td>
<td>← RepLeftBit(RA9,16)</td>
</tr>
<tr>
<td>RT10:11</td>
<td>← RepLeftBit(RA11,16)</td>
</tr>
<tr>
<td>RT12:13</td>
<td>← RepLeftBit(RA13,16)</td>
</tr>
<tr>
<td>RT14:15</td>
<td>← RepLeftBit(RA15,16)</td>
</tr>
</tbody>
</table>
Extend Sign Halfword to Word

\[
xshw \quad rt, ra
\]

\[
\begin{array}{cccccccccccccc}
0 & 1 & 0 & 1 & 0 & 1 & 1 & 1 & 0 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow
\end{array}
\]

For each of four word slots:

- The sign of the halfword in the right half of the operand in register RA is propagated to the left halfword.
- The resulting 32-bit integer is placed in register RT.

\[
\begin{array}{c}
RT^0:3 \\
\leftarrow \text{RepLeftBit(RA}^2:3, 32) \\
RT^4:7 \\
\leftarrow \text{RepLeftBit(RA}^6:7, 32) \\
RT^8:11 \\
\leftarrow \text{RepLeftBit(RA}^{10:11}, 32) \\
RT^{12:15} \\
\leftarrow \text{RepLeftBit(RA}^{14:15}, 32)
\end{array}
\]
Extend Sign Word to Doubleword

xswd rt,ra

For each of two doubleword slots:

- The sign of the word in the right slot is propagated to the left word.
- The resulting 64-bit integer is stored in register RT.

<table>
<thead>
<tr>
<th></th>
<th>RT^0:7</th>
<th>← RepLeftBit(RA^4:7,64)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>RT^8:16</strong></td>
<td>← RepLeftBit(RA^12:15,64)</td>
<td></td>
</tr>
</tbody>
</table>
And

and rt, ra, rb

\[
\begin{array}{cccccccccccccccc}
0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & R_B & R_A & R_T \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
\end{array}
\]

The values in register RA and register RB are logically ANDed. The result is placed in register RT.

| RT[0:3] | ← RA[0:3] & RB[0:3] |
And with Complement

\[ \text{andc } r_t, r_a, r_b \]

<table>
<thead>
<tr>
<th></th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0101</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1111</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0123456789</td>
<td>0123456789</td>
<td>0123456789</td>
<td>0123456789</td>
</tr>
<tr>
<td>10111121314151617</td>
<td>10111121314151617</td>
<td>10111121314151617</td>
<td>10111121314151617</td>
</tr>
<tr>
<td>0123456789</td>
<td>0123456789</td>
<td>0123456789</td>
<td>0123456789</td>
</tr>
<tr>
<td>202122232425262728293031</td>
<td>202122232425262728293031</td>
<td>202122232425262728293031</td>
<td>202122232425262728293031</td>
</tr>
</tbody>
</table>

The value in register RA is logically ANDed with the complement of the value in register RB. The result is placed in register RT.

\begin{align*}
\text{RT}^{0:3} & \leftarrow \text{RA}^{0:3} \land (\neg \text{RB}^{0:3}) \\
\text{RT}^{4:7} & \leftarrow \text{RA}^{4:7} \land (\neg \text{RB}^{4:7}) \\
\text{RT}^{8:11} & \leftarrow \text{RA}^{8:11} \land (\neg \text{RB}^{8:11}) \\
\text{RT}^{12:15} & \leftarrow \text{RA}^{12:15} \land (\neg \text{RB}^{12:15})
\end{align*}
And Byte Immediate

\texttt{andbi} \hspace{1cm} \texttt{rt,ra,value}

\begin{array}{cccc}
0 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 \\
\downarrow & \downarrow & \downarrow & \downarrow \\
I10 & RA & RT \\
\downarrow & \downarrow & \downarrow & \downarrow \\
0 & 1 & 2 & 3 \\
4 & 5 & 6 & 7 \\
8 & 9 & 10 & 11 \\
12 & 13 & 14 & 15 \\
16 & 17 & 18 & 19 \\
20 & 21 & 22 & 23 \\
24 & 25 & 26 & 27 \\
28 & 29 & 30 & 31 \\
\end{array}

For each of 16 byte slots, the rightmost 8 bits of the I10 field are ANDed with the value in register RA. The result is placed in register RT.

| b | ← \texttt{I10 & 0x00FF} |
| bbbb | ← b || b || b || b |
| RT^{0:3} | ← RA^{0:3} & bbbb |
| RT^{4:7} | ← RA^{4:7} & bbbb |
| RT^{8:11} | ← RA^{8:11} & bbbb |
| RT^{12:15} | ← RA^{12:15} & bbbb |
And Halfword Immediate

\[
\text{andhi } \quad rt, ra, value
\]

For each of eight halfword slots:

- The I10 field is extended to 16 bits by replicating its leftmost bit. The result is ANDed with the value in register RA.
- The 16-bit result is placed in register RT.

\[
\begin{array}{|c|c|c|c|}
\hline
\text{t} & \leftarrow \text{RepLeftBit(I10,16)} \\
\text{RT}^0:1 & \leftarrow \text{RA}^{0:1} & \text{t} \\
\text{RT}^2:3 & \leftarrow \text{RA}^{2:3} & \text{t} \\
\text{RT}^4:5 & \leftarrow \text{RA}^{4:5} & \text{t} \\
\text{RT}^6:7 & \leftarrow \text{RA}^{6:7} & \text{t} \\
\text{RT}^8:9 & \leftarrow \text{RA}^{8:9} & \text{t} \\
\text{RT}^{10:11} & \leftarrow \text{RA}^{10:11} & \text{t} \\
\text{RT}^{12:13} & \leftarrow \text{RA}^{12:13} & \text{t} \\
\text{RT}^{14:15} & \leftarrow \text{RA}^{14:15} & \text{t} \\
\hline
\end{array}
\]
And Word Immediate

\[ \text{andi} \quad \text{rt},\text{ra},\text{value} \]

For each of four word slots:

- The value of the \( I_{10} \) field is extended to 32 bits by replicating its leftmost bit. The result is ANDed with the contents of register RA.
- The result is placed in register RT.

\[
\begin{array}{cccccc}
0 & 0 & 0 & 1 & 0 & 1 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
0 & 1 & 2 & 3 & 4 & 5 \\
\end{array}
\]

\[
\begin{array}{cccccc}
0 & 1 & 2 & 3 & 4 & 5 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
8 & 9 & 10 & 11 & 12 & 13 \\
\end{array}
\]

\[
\begin{array}{cccccc}
16 & 17 & 18 & 19 & 20 & 21 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
22 & 23 & 24 & 25 & 26 & 27 \\
\end{array}
\]

\[
\begin{array}{cccccc}
28 & 29 & 30 & 31 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
32 & 33 & 34 & 35 & 36 & 37 \\
\end{array}
\]

\[
\begin{array}{cccccc}
\text{t} & \leftarrow & \text{RepLeftBit}(I_{10},32) \\
\text{RT}^{0:3} & \leftarrow & \text{RA}^{0:3} \& t \\
\text{RT}^{4:7} & \leftarrow & \text{RA}^{4:7} \& t \\
\text{RT}^{8:11} & \leftarrow & \text{RA}^{8:11} \& t \\
\text{RT}^{12:15} & \leftarrow & \text{RA}^{12:15} \& t \\
\end{array}
\]
Or

\texttt{or rt, ra, rb}

The values in register RA and register RB are logically ORed. The result is placed in register RT.

<table>
<thead>
<tr>
<th>RT&lt;0:3&gt;</th>
<th>← RA&lt;0:3&gt;</th>
<th>RB&lt;0:3&gt;</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT&lt;4:7&gt;</td>
<td>← RA&lt;4:7&gt;</td>
<td>RB&lt;4:7&gt;</td>
</tr>
<tr>
<td>RT&lt;8:11&gt;</td>
<td>← RA&lt;8:11&gt;</td>
<td>RB&lt;8:11&gt;</td>
</tr>
<tr>
<td>RT&lt;12:15&gt;</td>
<td>← RA&lt;12:15&gt;</td>
<td>RB&lt;12:15&gt;</td>
</tr>
</tbody>
</table>
Or with Complement

The value in register RA is ORed with the complement of the value in register RB. The result is placed in register RT.

\[
\begin{align*}
\text{rt} &\leftarrow \text{ra} \cup \text{rb} \\
\text{RT}^{0:3} &\leftarrow \text{RA}^{0:3} \cup \text{~RB}^{0:3} \\
\text{RT}^{4:7} &\leftarrow \text{RA}^{4:7} \cup \text{~RB}^{4:7} \\
\text{RT}^{8:11} &\leftarrow \text{RA}^{8:11} \cup \text{~RB}^{8:11} \\
\text{RT}^{12:15} &\leftarrow \text{RA}^{12:15} \cup \text{~RB}^{12:15}
\end{align*}
\]
Or Byte Immediate

\[ \text{orbi } \text{rt},\text{ra},\text{value} \]

For each of 16 byte slots:
- The rightmost 8 bits of the I10 field are ORed with the value in register RA.
- The result is placed in register RT.

<table>
<thead>
<tr>
<th>b</th>
<th>← I10 &amp; 0x00FF</th>
</tr>
</thead>
<tbody>
<tr>
<td>bbbbb</td>
<td>← b</td>
</tr>
<tr>
<td>RT₀:₃</td>
<td>← RA₀:₃</td>
</tr>
<tr>
<td>RT₄:₇</td>
<td>← RA₄:₇</td>
</tr>
<tr>
<td>RT₈:₁₁</td>
<td>← RA₈:₁₁</td>
</tr>
<tr>
<td>RT₁₂:₁₅</td>
<td>← RA₁₂:₁₅</td>
</tr>
</tbody>
</table>
Or Halfword Immediate

```
orhi rt,ra,value
```

For each of eight halfword slots:

- The I10 field is extended to 16 bits by replicating its leftmost bit. The result is ORed with the value in register RA.
- The result is placed in register RT.

<table>
<thead>
<tr>
<th>t</th>
<th>← RepLeftBit(I10,16)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT0:1</td>
<td>← RA0:1 ∨ t</td>
</tr>
<tr>
<td>RT2:3</td>
<td>← RA2:3 ∨ t</td>
</tr>
<tr>
<td>RT4:5</td>
<td>← RA4:5 ∨ t</td>
</tr>
<tr>
<td>RT6:7</td>
<td>← RA6:7 ∨ t</td>
</tr>
<tr>
<td>RT8:9</td>
<td>← RA8:9 ∨ t</td>
</tr>
<tr>
<td>RT10:11</td>
<td>← RA10:11 ∨ t</td>
</tr>
<tr>
<td>RT12:13</td>
<td>← RA12:13 ∨ t</td>
</tr>
<tr>
<td>RT14:15</td>
<td>← RA14:15 ∨ t</td>
</tr>
</tbody>
</table>
Or Word Immediate

\textbf{ori} \hspace{0.5cm} rt,ra,value

For each of four word slots:

- The I10 field is sign-extended to 32 bits and ORed with the contents of register RA.
- The result is placed in register RT.

\begin{align*}
\text{t} &\leftarrow \text{RepLeftBit(I10,32)} \\
\text{RT}^0:3 &\leftarrow \text{RA}^0:3 \text{ } \text{ } | \text{ } \text{t} \\
\text{RT}^4:7 &\leftarrow \text{RA}^4:7 \text{ } \text{ } | \text{ } \text{t} \\
\text{RT}^8:11 &\leftarrow \text{RA}^8:11 \text{ } \text{ } | \text{ } \text{t} \\
\text{RT}^{12:15} &\leftarrow \text{RA}^{12:15} \text{ } \text{ } | \text{ } \text{t}
\end{align*}
The four words of RA are logically ORed. The result is placed in the preferred slot of register RT. The other three slots of the register are written with zeros.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>RT[4:15]</td>
<td>← 0</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Exclusive Or

**xor**

<table>
<thead>
<tr>
<th>xor rt,ra,rb</th>
</tr>
</thead>
</table>

The values in register RA and register RB are logically XORed. The result is placed in register RT.

| RT<0:3>   | ← RA<0:3> ⊕ RB<0:3> |
| RB<4:7>   | ← RA<4:7> ⊕ RB<4:7> |
| RT<8:11>  | ← RA<8:11> ⊕ RB<8:11> |
| RT<12:15> | ← RA<12:15> ⊕ RB<12:15> |
### Exclusive Or Byte Immediate

**xorbi** \( rt, ra, value \)

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>110</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
</tbody>
</table>

For each of 16 byte slots:

- The rightmost 8 bits of the I10 field are XORed with the value in register RA.
- The result is placed in register RT.

<table>
<thead>
<tr>
<th>b</th>
<th>← I10 &amp; 0x00FF</th>
</tr>
</thead>
<tbody>
<tr>
<td>bbbbb</td>
<td>← b b b b b b b</td>
</tr>
<tr>
<td>RT(^{0:3})</td>
<td>← RA(^{0:3}) @ bbbbb</td>
</tr>
<tr>
<td>RT(^{4:7})</td>
<td>← RA(^{4:7}) @ bbbbb</td>
</tr>
<tr>
<td>RT(^{8:11})</td>
<td>← RA(^{8:11}) @ bbbbb</td>
</tr>
<tr>
<td>RT(^{12:15})</td>
<td>← RA(^{12:15}) @ bbbbb</td>
</tr>
</tbody>
</table>
Exclusive Or Halfword Immediate

`xorhi rt,ra,value`

<table>
<thead>
<tr>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 0 0 0 1 0 1</td>
<td>0 1 2 3 4 5 6 7</td>
<td>8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31</td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The I10 field is extended to 16 bits by replicating the leftmost bit. The resulting value is XORed with the value in register RA.
- The 16-bit result is placed in register RT.

```
\[
t ← \text{RepLeftBit}(I10, 16)
\]

<table>
<thead>
<tr>
<th>(t)</th>
<th>← (\text{RepLeftBit}(I10, 16))</th>
</tr>
</thead>
<tbody>
<tr>
<td>(RT^{0:1})</td>
<td>(← RA^{0:1} \oplus t)</td>
</tr>
<tr>
<td>(RT^{2:3})</td>
<td>(← RA^{2:3} \oplus t)</td>
</tr>
<tr>
<td>(RT^{4:5})</td>
<td>(← RA^{4:5} \oplus t)</td>
</tr>
<tr>
<td>(RT^{6:7})</td>
<td>(← RA^{6:7} \oplus t)</td>
</tr>
<tr>
<td>(RT^{8:9})</td>
<td>(← RA^{8:9} \oplus t)</td>
</tr>
<tr>
<td>(RT^{10:11})</td>
<td>(← RA^{10:11} \oplus t)</td>
</tr>
<tr>
<td>(RT^{12:13})</td>
<td>(← RA^{12:13} \oplus t)</td>
</tr>
<tr>
<td>(RT^{14:15})</td>
<td>(← RA^{14:15} \oplus t)</td>
</tr>
</tbody>
</table>
Exclusive Or Word Immediate

xori rt, ra, value

For each of four word slots:
- The I10 field is sign-extended to 32 bits and XORed with the contents of register RA.
- The 32-bit result is placed in register RT.

\[
\begin{array}{c|c|c}
\text{t} & \text{t} & \text{t} \\
\hline
\text{RT}^{3:3} & \leftarrow \text{RA}^{3:3} \oplus t \\
\text{RT}^{6:7} & \leftarrow \text{RA}^{6:7} \oplus t \\
\text{RT}^{8:11} & \leftarrow \text{RA}^{8:11} \oplus t \\
\text{RT}^{12:15} & \leftarrow \text{RA}^{12:15} \oplus t \\
\end{array}
\]
Nand

\textbf{Nand} \quad \textbf{rt,ra,rb}

\[\text{nand rt,ra,rb}\]

\begin{verbatim}
0 0 0 1 1 0 0 1 0 0 1
\hline
0 1 2 3 4 5 6 7 8 9
\end{verbatim}

- For each of four word slots:
  - The complement of the AND of the bit in register RA and the bit in register RB is placed in register RT.

\begin{tabular}{|c|}
\hline
\text{RT}^{0:3} \leftarrow \neg (\text{RA}^{0:3} \& \text{RB}^{0:3}) \\
\text{RT}^{4:7} \leftarrow \neg (\text{RA}^{4:7} \& \text{RB}^{4:7}) \\
\text{RT}^{8:11} \leftarrow \neg (\text{RA}^{8:11} \& \text{RB}^{8:11}) \\
\text{RT}^{12:15} \leftarrow \neg (\text{RA}^{12:15} \& \text{RB}^{12:15}) \\
\hline
\end{tabular}
Nor

for each of four word slots:

- The values in register RA and register RB are logically ORed.
- The result is complemented and placed in register RT.

| RT<0:3> | ← ¬(RA<0:3> | RB<0:3>) |
| RT<4:7> | ← ¬(RA<4:7> | RB<4:7>) |
| RT<8:11> | ← ¬(RA<8:11> | RB<8:11>) |
| RT<12:15> | ← ¬(RA<12:15> | RB<12:15>) |
For each of four word slots:

- If the bit in register RA and register RB are the same, the result is ‘1’; otherwise, the result is ‘0’.
- The result is placed in register RT.

<table>
<thead>
<tr>
<th>RT(^{0:3})</th>
<th>← RA(^{0:3}) ⊕ (¬RB(^{0:3}))</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT(^{4:7})</td>
<td>← RA(^{4:7}) ⊕ (¬RB(^{4:7}))</td>
</tr>
<tr>
<td>RT(^{8:11})</td>
<td>← RA(^{8:11}) ⊕ (¬RB(^{8:11}))</td>
</tr>
<tr>
<td>RT(^{12:15})</td>
<td>← RA(^{12:15}) ⊕ (¬RB(^{12:15}))</td>
</tr>
</tbody>
</table>
Select Bits

```
selb rt,ra,rb,rc
```

For each of four word slots:
- If the bit in register RC is ‘0’, then select the bit from register RA; otherwise, select the bit from register RB.
- The selected bits are placed in register RT.

\[
RT^{0:15} \leftarrow RC^{0:15} \& RB^{0:15} \mid \neg RC^{0:15} \& RA^{0:15}
\]
Shuffle Bytes

shufb rt,ra,rb,rc

1 0 1 1
\[ \downarrow \downarrow \downarrow \downarrow \]
RT RB RA RC
\[ \downarrow \downarrow \downarrow \downarrow \]
\begin{align*}
\end{align*}

Registers RA and RB are logically concatenated with the least-significant bit of RA adjacent to the most-significant bit of RB. The bytes of the resulting value are considered to be numbered from 0 to 31.

For each byte slot in registers RC and RT:
- The value in register RC is examined, and a result byte is produced as shown in Table 5-1.
- The result byte is inserted into register RT.

**Table 5-1. Binary Values in Register RC and Byte Results**

<table>
<thead>
<tr>
<th>Value in Register RC (Expressed in Binary)</th>
<th>Result Byte</th>
</tr>
</thead>
<tbody>
<tr>
<td>10xxxxxx</td>
<td>x'00'</td>
</tr>
<tr>
<td>110xxxxx</td>
<td>x'FF'</td>
</tr>
<tr>
<td>111xxxxx</td>
<td>x'80'</td>
</tr>
<tr>
<td>Otherwise</td>
<td>The byte of the concatenated register addressed by the rightmost 5 bits of register RC</td>
</tr>
</tbody>
</table>

Rconcat ← RA || RB
For j = 0 to 15
\[ b ← RC_j \]
If \[ b_{0:1} = 0b10 \] then \[ c ← 0x00; \] else
If \[ b_{0:2} = 0b110 \] then \[ c ← 0xFF; \] else
If \[ b_{0:2} = 0b111 \] then \[ c ← 0x80; \] else
Do: \[ b ← b & 0x1F; \]
\[ c ← Rconcat^{16}; \]
End
\[ RT_j ← c \]
End
6. Shift and Rotate Instructions

This section describes the SPU shift and rotate instructions.
Shift Left Halfword

```
shlh rt,ra,rb
```

For each of eight halfword slots:

- The contents of register RA are shifted to the left according to the count in bits 11 to 15 of register RB.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 15, the result is zero.
- Bits shifted out of the left end of the halfword are discarded; zeros are shifted in at the right.

**Note:** Each halfword slot has its own independent shift amount.

```
For j = 0 to 15 by 2
    s ← RBj::2 & 0x001F
    t ← RAj::2
    for b = 0 to 15
        if b + s < 16 then rb ← tb + s
        else rb ← 0
    end
    RTj::2 ← r
end
```
Shift Left Halfword Immediate

shlhi rt,ra,value

For each of eight halfword slots:
- The contents of register RA are shifted to the left according to the count in bits 13 to 17 of the I7 field.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 15, the result is zero.
- Bits shifted out of the left end of the halfword are discarded; zeros are shifted in at the right.

s ← RepLeftBit(I7,16) & 0x001F
For j = 0 to 15 by 2
  t ← RAj::2
  for b = 0 to 15
    if b + s < 16 then rb ← tb + s
    else rb ← 0
  end
  RTj::2 ← r
end
Shift Left Word

\texttt{shl rt,ra,rb}

0 0 0 1 0 1 1 1 0 1 1 1 0 1 1
\hline
\hline
CB PA VR RV
\hline
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
\hline

For each of four word slots:

- The contents of register RA are shifted to the left according to the count in bits 26 to 31 of register RB.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 31, the result is zero.
- Bits shifted out of the left end of the word are discarded; zeros are shifted in at the right.

\textbf{Note:} Each word slot has its own independent shift amount.

\begin{verbatim}
For j = 0 to 15 by 4
  s ← RB_j::4 & 0x0000003F
  t ← RA_j::4
  for b = 0 to 31
    if b + s < 32 then rb ← tb + s
    else rb ← 0
  end
  RT_j::4 ← r
end
\end{verbatim}
Shift Left Word Immediate

\texttt{shli\ rt,ra,value}

<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>\texttt{I7}</th>
<th>\texttt{RA}</th>
<th>\texttt{RT}</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>13</td>
</tr>
</tbody>
</table>

For each of four word slots:

- The contents of register RA are shifted to the left according to the count in bits 12 to 17 of the I7 field.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 31, the result is zero.
- Bits shifted out of the left end of the word are discarded; zeros are shifted in at the right.

\[
s \leftarrow \text{RepLeftBit}(I7, 32) \& 0x0000003F
\]

For \( j = 0 \) to 15 by 4

\[
t \leftarrow \text{RA}^{j\cdot4}
\]

for \( b = 0 \) to 31

\[
\begin{align*}
\text{if } b + s < 32 \text{ then } r_b & \leftarrow t_b + s \\
\text{else } r_b & \leftarrow 0
\end{align*}
\]

end

\[
\text{RT}^{j\cdot4} \leftarrow r
\]

end
Shift Left Quadword by Bits

```
shlqbi rt,ra,rb
```

```
0 0 1 1 1 0 1 1 1
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 1 2 3 4 5 6 7 8
```

The contents of register RA are shifted to the left according to the count in bits 29 to 31 of the preferred slot of register RB. The result is placed in register RT. A shift of up to 7 bit positions is possible.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bits shifted out of the left end of the register are discarded, and zeros are shifted in at the right.

```
s ← RB_{29:31}
for b = 0 to 127
    if b + s < 128 then rb ← RA_b + s
    else rb ← 0
end
RT ← r
```
Shift Left Quadword by Bits Immediate

```
shlqbii     rt,ra,value

0 0 1 1 1 1 1 1 0 1 1
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

The contents of register RA are shifted to the left according to the count in bits 15 to 17 of the I7 field. The result is placed in register RT. A shift of up to 7 bit positions is possible.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bits shifted out of the left end of the register are discarded, and zeros are shifted in at the right.

```
s ← I7 & 0x07
for b = 0 to 127
    if b + s < 128 then rb ← RAb + s
    else rb ← 0
end
RT ← r
```
Shift Left Quadword by Bytes

shlqby rt, ra, rb

The bytes of register RA are shifted to the left according to the count in bits 27 to 31 of the preferred slot of register RB. The result is placed in register RT.

If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 15, the result is zero.

Bytes shifted out of the left end of the register are discarded, and bytes of zeros are shifted in at the right.

s ← RB27:31
for b = 0 to 15
    if b + s < 16 then rb ← RAb + s
    else rb ← 0
end
RT ← r
Shift Left Quadword by Bytes Immediate

shlqbyi rt, ra, value

0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

The bytes of register RA are shifted to the left according to the count in bits 13 to 17 of the I7 field. The result is placed in register RT.

If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 15, the result is zero.

Bytes shifted out of the left end of the register are discarded, and zero bytes are shifted in at the right.

s ← I7 & 0x1F
for b = 0 to 15
  if b + s < 16 then rb ← RAb + s
  else rb ← 0
end
RT ← r
Shift Left Quadword by Bytes from Bit Shift Count

shlqbybi rt,ra,rb

The bytes of register RA are shifted to the left according to the count in bits 24 to 28 of the preferred slot of register RB. The result is placed in register RT.

If the count is zero, the contents of register RA are copied unchanged into register RT. If the count is greater than 15, the result is zero.

Bytes shifted out of the left end of the register are discarded, and bytes of zeros are shifted in at the right.

```
s ← RB_{24:28}
for b = 0 to 15
    if b + s < 16 then rb ← RA_b + s
    else rb ← x00
end
RT ← r
```
Rotate Halfword

For each of eight halfword slots:

- The contents of register RA are rotated to the left according to the count in bits 12 to 15 of register RB.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT.
- Bits rotated out of the left end of the halfword are rotated in at the right end.

Note: Each halfword slot has its own independent rotate amount.

```
For j = 0 to 15 by 2
  s ← RBj%2 & 0x000F
  t ← RAj%2
  for b = 0 to 15
    if b + s < 16 then rb ← tb + s
    else rb ← tb + s - 16
  end
  RTj%2 ← r
end
```
### Rotate Halfword Immediate

**rothi** \(rt, ra, value\)

<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>I7</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The contents of register RA are rotated to the left according to the count in bits 14 to 17 of the I7 field.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT.
- Bits rotated out of the left end of the halfword are rotated in at the right end.

$$s \leftarrow \text{RepLeftBit}(I7, 16) \& 0x000F$$

For \(j = 0\) to 15 by 2

\[
t \leftarrow RA_{j+2}
\]

for \(b = 0\) to 15

- if \(b + s < 16\) then \(r_{b} \leftarrow t_{b} + s\)
- else \(r_{b} \leftarrow t_{b} + s - 16\)

end

\(RT_{j+2} \leftarrow r\)
Rotate Word

\texttt{rot \ rot,ra,rb}

For each of four word slots:

- The contents of register RA are rotated to the left according to the count in bits 27 to 31 of register RB.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT.
- Bits rotated out of the left end of the word are rotated in at the right end.

For \( j = 0 \) to 15 by 4

\[
\begin{align*}
    s & \leftarrow \text{RB}[j::4] \& 0x0000001F \\
    t & \leftarrow \text{RA}[j::4] \\
    \text{for } b = 0 \text{ to } 31 & \\
        \text{if } b + s < 32 \text{ then } r_b & \leftarrow t_b + s \\
        \text{else } r_b & \leftarrow t_b + s - 32 \\
    \text{end} \\
    \text{RT}[j::4] & \leftarrow r \\
    \text{end}
\end{align*}
\]
Rotate Word Immediate

`roti rt,ra,value`

```
  I7  RA    RT
  0 0 0 1 1 1 0 0 0
   ↓   ↓   ↓   ↓   ↓
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

For each of four word slots:

- The contents of register RA are rotated to the left according to the count in bits 13 to 17 of the I7 field.
- The result is placed in register RT.
- If the count is zero, the contents of register RA are copied unchanged into register RT.
- Bits rotated out of the left end of the word are rotated in at the right end.

```
S ← RepLeftBit(I7,32) & 0x0000001F
For j = 0 to 15 by 4
    t ← RA[j::4]
    for b = 0 to 31
        if b + s < 32 then rb ← tb + s
        else rb ← tb + s - 32
    end
    RT[j::4] ← r
end
```
Rotate Quadword by Bytes

\texttt{rotqby \ rt,ra,rb}

\begin{center}
\begin{tabular}{cccc}
0 & 0 & 1 & 1 \\
\downarrow & \downarrow & \downarrow & \downarrow \\
0 & 1 & 2 & 3 \\
\end{tabular}
\end{center}

The bytes in register RA are rotated to the left according to the count in the rightmost 4 bits of the preferred slot of register RB. The result is placed in register RT. Rotation of up to 15 byte positions is possible.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bytes rotated out of the left end of the register are rotated in at the right.

\begin{verbatim}
t4 ← RB_{28:31}
If t4 = 0 then r ← RA;
Else Do
    For i = 0 to 15
        c = mod(i + t4,16)
        r' ← RA^c
    End
End
RT ← r
\end{verbatim}
Rotate Quadword by Bytes Immediate

\texttt{rotqbyi} \hspace{1em} \texttt{rt,ra,value}

\begin{verbatim}
 0 0 1 1 1 1 1 1 0 0 \hspace{1em} I7 \hspace{1em} RA \hspace{1em} RT
  ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Rotate Quadword by Bytes from Bit Shift Count

rotqbybi rt,ra,rb

0 0 1 1 1 0 0 0 1 1 0 0

\begin{array}{cccc}
\text{RB} & \text{RA} & \text{RT} \\
0 & 1 & 2 & 3 \\
4 & 5 & 6 & 7 \\
8 & 9 & 10 & 11 \\
12 & 13 & 14 & 15 \\
16 & 17 & 18 & 19 \\
20 & 21 & 22 & 23 \\
24 & 25 & 26 & 27 \\
28 & 29 & 30 & 31
\end{array}

The bytes of register RA are rotated to the left according to the count in bits 25 to 28 of the preferred slot of register RB. The result is placed in register RT.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bytes rotated out of the left end of the register are rotated in at the right.

\begin{verbatim}
s ← RB_{24:28} for b = 0 to 15
    if b + s < 16 then rb ← RA_{b} + s
    else rb ← RA_{b} + s - 16
end
RT ← r
\end{verbatim}
Rotate Quadword by Bits

\[
\text{rotqbi} \quad \text{rt, ra, rb}
\]

\[
\begin{array}{cccccccccccccccccccccccccccc}
0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
\end{array}
\]

The contents of register RA are rotated to the left according to the count in bits 29 to 31 of the preferred slot of register RB. The result is placed in register RT. Rotation of up to 7 bit positions is possible.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bits rotated out at the left end of the register are rotated in at the right.

\[
s \leftarrow RB_{29:31}
\]

for \( b = 0 \) to 127

\[
\text{if } b + s < 128 \text{ then } r_b \leftarrow RA_b + s
\]

\[
\text{else } r_b \leftarrow RA_b + s - 128
\]

end

RT \leftarrow r
Rotate Quadword by Bits Immediate

\[
\text{rotqbii} \quad \text{rt,ra,value}
\]

\[
\begin{array}{cccccc}
0 & 0 & 1 & 1 & 1 & 1 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
0 & 1 & 2 & 3 & 4 & 5 \\
\end{array}
\]

The contents of register RA are rotated to the left according to the count in bits 15 to 17 of the I7 field. The result is placed in register RT. Rotation of up to 7 bit positions is possible.

If the count is zero, the contents of register RA are copied unchanged into register RT.

Bits rotated out at the left end of the register are rotated in at the right.

\[
s \leftarrow I_4:6 \\
\text{for } b = 0 \text{ to } 127 \\
\quad \text{if } b + s < 128 \text{ then } r_b \leftarrow RA_b + s \\
\quad \text{else } r_b \leftarrow RA_b + s - 128 \\
\text{end} \\
\text{RT} \leftarrow r
\]
Rotate and Mask Halfword

rothm rt,ra,rb

For each of eight halfword slots:

- The shift_count is (0 - RB) modulo 32.
- If the shift_count is less than 16, then RT is set to the contents of RA shifted right shift_count bits, with zero fill at the left.
- Otherwise, RT is set to zero.

**Note:** Each halfword slot has its own independent rotate amount.

For j = 0 to 15 by 2

\[ s \leftarrow (0 - RB^j) \mod 32 \]
\[ t \leftarrow RA^j \]

for b = 0 to 15

if \( b \geq s \) then \( r_b \leftarrow t_b - s \)
else \( r_b \leftarrow 0 \)

end

\[ RT^j \leftarrow r \]

end

**Programming Note:** The *Rotate and Mask* and *Rotate and Mask Algebraic* instructions provide support for a logical right shift and algebraic right shift, respectively. They differ from a conventional right logical or algebraic shift in that the shift amount accepted by the instructions is the twos complement of the right shift amount. Thus, to shift right logically the contents of R2 by the number of bits given in R1, the following sequence could be used:

\[
\text{sfi } r3, r1, 0 \quad \text{Form twos complement}
\]
\[
\text{rothm } r4, r2, r3 \quad \text{Rotate, then mask}
\]

For the immediate forms of these instructions, the formation of the twos complement shift quantity can be performed during assembly or compilation.
Rotate and Mask Halfword Immediate

`rothmi rt,ra,value`

<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>17</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The shift_count is (0 - I7) modulo 32.
- If the shift_count is less than 16, then RT is set to the contents of RA shifted right shift_count bits, with zero fill at the left.
- Otherwise, RT is set to zero.

```
s ← (0 - RepLeftBit(I7,32)) & 0x0000003F
For j = 0 to 15 by 4
    t ← RAj::4
    for b = 0 to 31
        if b ≥ s then rb ← tb - s
        else rb ← t0
    end
    RTj::4 ← r
end
```

**Programming Note:** The *Rotate and Mask* and *Rotate and Mask Algebraic* instructions provide support for a logical right shift and algebraic right shift, respectively. They differ from a conventional right logical or algebraic shift in that the shift amount accepted by the instructions is the twos complement of the right shift amount. Thus, to shift right logically the contents of R2 by the number of bits given in R1, the following sequence could be used:

```
sfi r3,r1,0 Form twos complement
rotn r4,r2,r3 Rotate, then mask
```

For the immediate forms of these instructions, the formation of the twos complement shift quantity can be performed during assembly or compilation.
Rotate and Mask Word

\texttt{rotm \texttt{rt,ra,rb}}

<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>RB</td>
<td>RA</td>
<td>RT</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
<tr>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>21</td>
</tr>
<tr>
<td>22</td>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
<td>31</td>
<td></td>
</tr>
</tbody>
</table>

For each of four word slots:

- The shift\_count is \((0 - RB)\) modulo 64.
- If the shift\_count is less than 32, then RT is set to the contents of RA shifted right shift\_count bits, with zero fill at the left.
- Otherwise, RT is set to zero.

\begin{verbatim}
For j = 0 to 15 by 4
    s ← (0 - RB\textsuperscript{j\&4}) & 0x0000003F
    t ← RA\textsuperscript{j\&4}
    for b = 0 to 31
        if b ≥ s then rb ← tb - s
        else rb ← 0
    end
    RT\textsuperscript{j\&4} ← r
end
\end{verbatim}

**Programming Note:** The \texttt{Rotate and Mask} and \texttt{Rotate and Mask Algebraic} instructions provide support for a logical right shift and algebraic right shift, respectively. They differ from a conventional right logical or algebraic shift in that the shift amount accepted by the instructions is the twos complement of the right shift amount. Thus, to shift right logically the contents of R2 by the number of bits given in R1, the following sequence could be used:

\begin{verbatim}
sfi r3,r1,0 Form twos complement
rotm r4,r2,r3 Rotate, then mask
\end{verbatim}

For the immediate forms of these instructions, the formation of the twos complement shift quantity can be performed during assembly or compilation.
Rotate and Mask Word Immediate

\[ \text{rotmi } rt, ra, \text{value} \]

```
0 0 0 0 1 1 1 1 0 0 1 17 RA RT
\downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

For each of four word slots:

- The shift_count is \((0 - 17) \mod 64\).
- If the shift_count is less than 32, then RT is set to the contents of RA shifted right shift_count bits, with zero fill at the left.
- Otherwise, RT is set to zero.

```
s \leftarrow (0 - \text{RepLeftBit}(17,32)) \& 0x0000003F
For j = 0 to 15 by 4
    t \leftarrow RA^{j/4}
    for b = 0 to 31
        if b \geq s then \(r_b \leftarrow t_b - s\)
        else \(r_b \leftarrow 0\)
    end
    \(RT^{j/4} \leftarrow r\)
end
```

**Programming Note:** The *Rotate and Mask* and *Rotate and Mask Algebraic* instructions provide support for a logical right shift and algebraic right shift, respectively. They differ from a conventional right logical or algebraic shift in that the shift amount accepted by the instructions is the two's complement of the right shift amount. Thus, to shift right logically the contents of R2 by the number of bits given in R1, the following sequence could be used.

```
sfi r3, r1, 0 Form two's complement
rotm r4, r2, r3 Rotate, then mask
```

For the immediate forms of these instructions, the formation of the two's complement shift quantity can be performed during assembly or compilation.
Rotate and Mask Quadword by Bytes

\texttt{rotqmby \, rt,ra,rb}

\begin{align*}
0 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 0 & 1 \\
\downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow & \downarrow \\
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
10 & 11 & 12 & 13 & 14 & 15 & 16 & 17 & 18 & 19 \\
20 & 21 & 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 \\
30 & 31
\end{align*}

The shift\_count is (0 - the preferred word of RB) modulo 32. If the shift\_count is less than 16, then RT is set to the contents of RA shifted right shift\_count bytes, filling at the left with \texttt{x'00'} bytes. Otherwise, RT is set to zero.

\begin{verbatim}
s ← (0 - RB_{27:31}) & 0x1F
for b = 0 to 15
    if b ≥ s then \texttt{rb} ← \texttt{rb} − s
    else \texttt{rb} ← \texttt{0x00}
end
RT ← r
\end{verbatim}
The shift_count is (0 - I7) modulo 32. If the shift_count is less than 16, then RT is set to the contents of RA shifted right shift_count bytes, filling at the left with x'00' bytes. Otherwise, all bytes of RT are set to x'00'.

```
s ← (0 - I7) & 0x1F
for b = 0 to 15
    if b ≥ s then rb ← t^b - s
    else rb ← 0x00
end
RT ← r
```
Rotate and Mask Quadword Bytes from Bit Shift Count

**rotqmbybi** \( rt, ra, rb \)

The shift_count is \((0 \text{ minus bits 24 to 28 of } RB) \mod 32\). If the shift_count is less than 16, then RT is set to the contents of RA, which is shifted right shift_count bytes, and filled at the left with \(x'00'\) bytes. Otherwise, all bytes of RT are set to \(x'00'\).

\[
\begin{align*}
s &\leftarrow (0 - RB_{24:28}) \& 0x1F \\
\text{for } b = 0 \text{ to } 15 \\
&\quad \text{if } b \geq s \text{ then } r^b \leftarrow RA^b - s \\
&\quad \text{else } r^b \leftarrow 0x00
\end{align*}
\]
Rotate and Mask Quadword by Bits

**rotqmbi** \( \text{rt,ra,rb} \)

The shift_count is \((0 - \text{the preferred word of } RB) \mod 8\). RT is set to the contents of RA, shifted right by shift_count bits, filling at the left with zero bits.

\[
s \leftarrow (0 - RB_{29:31}) \& 0x07
\text{for } b = 0 \text{ to } 127
\quad \text{if } b \geq s \text{ then } t_b \leftarrow t_b - s
\quad \text{else } t_b \leftarrow 0
\text{end}
RT \leftarrow r
\]
Rotate and Mask Quadword by Bits Immediate

The shift_count is \((0 - I7) \mod 8\). RT is set to the contents of RA, shifted right by shift_count bits, filling at the left with zero bits.

\[
s \leftarrow (0 - I7) \& 0x07
\]
\[
\text{for } b = 0 \text{ to } 127
\]
\[
\quad \text{if } b \geq s \text{ then } r_b \leftarrow t_b - s
\]
\[
\quad \text{else } r_b \leftarrow 0
\]
\[
\text{end}
\]
\[
RT \leftarrow r
\]
**Rotate and Mask Algebraic Halfword**

**Syntax:**

```
rotmah rt, ra, rb
```

<table>
<thead>
<tr>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<td>12</td>
<td>13</td>
<td>14</td>
</tr>
<tr>
<td>15</td>
<td>16</td>
<td>17</td>
</tr>
<tr>
<td>18</td>
<td>19</td>
<td>20</td>
</tr>
<tr>
<td>21</td>
<td>22</td>
<td>23</td>
</tr>
<tr>
<td>24</td>
<td>25</td>
<td>26</td>
</tr>
<tr>
<td>27</td>
<td>28</td>
<td>29</td>
</tr>
<tr>
<td>30</td>
<td>31</td>
<td></td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The shift_count is `(0 - RB)` modulo 32.
- If the shift_count is less than 16, then RT is set to the contents of RA shifted right shift_count bits, replicating bit 0 (of the halfword) at the left.
- Otherwise, all bits of this halfword of RT are set to bit 0 of this halfword of RA.

**Note:** Each halfword slot has its own independent rotate amount.

```
For j = 0 to 15 by 2
    s ← (0 - RBj::2) & 0x001F
    t ← RAj::2
    for b = 0 to 15
        if b ≥ s then rb ← tb - s
            else rb ← t0
    RTj::2 ← rb
end
```

For each of eight halfword slots:
Rotate and Mask Algebraic Halfword Immediate

\textbf{rotmahi} \hspace{1em} \textbf{rt,ra,value}

For each of eight halfword slots:

- The \textit{shift\_count} is \((0 - I7) \mod 32\).
- If the \textit{shift\_count} is less than 16, then \textit{RT} is set to the contents of \textit{RA} shifted right \textit{shift\_count} bits, replicating bit 0 (of the halfword) at the left.
- Otherwise, all bits of this halfword of \textit{RT} are set to bit 0 of this halfword of \textit{RA}.

\begin{verbatim}
s ← (0 - RepLeftBit(I7,16)) & 0x001F
For j = 0 to 15 by 2
    t ← RA^{j+2}
    for b = 0 to 15
        if b ≥ s then \(r_b ← t_b - s\)
        else \(r_b ← t_b\)
    end
    RT^{j+2} ← r
end
\end{verbatim}
Rotate and Mask Algebraic Word

**rotma rt,ra,rb**

For each of four word slots:

- The shift_count is \((0 - RB)\) modulo 64.
- If the shift_count is less than 32, then RT is set to the contents of RA shifted right shift_count bits, replicating bit 0 (of the word) at the left.
- Otherwise, all bits of this word of RT are set to bit 0 of this word of RA.

```
For j = 0 to 15 by 4
  s ← (0 - RBj^4) & 0x0000003F
  t ← RAj^4
  for b = 0 to 31
    if b ≥ s then rb ← tb - s
    else rb ← t0
  end
  RTj^4 ← r
end
```
Rotate and Mask Algebraic Word Immediate

\textbf{rotmai} \quad \text{rt},\text{ra},\text{value}

\begin{verbatim}
  0 0 0 0 1 1 1 1 0 1 0 17 RA RT
  ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
\end{verbatim}

For each of four word slots:

- The shift\_count is (0 - I7) modulo 64.
- If the shift\_count is less than 32, then RT is set to the contents of RA shifted right shift\_count bits, replicating bit 0 (of the word) at the left.
- Otherwise, all bits of this word of RT are set to bit 0 of this word of RA.

\begin{verbatim}
s ← (0 - \text{RepLeftBit(I7,32)}) & 0x0000003F
For j = 0 to 15 by 4
  t ← RA_j::4
  for b = 0 to 31
    if b ≥ s then r_b ← t_b - s
    else r_b ← t_0
  end
  RT_j::4 ← r
end
\end{verbatim}
7. Compare, Branch, and Halt Instructions

This section lists and describes the SPU compare, branch, and halt instructions. For more information on the SPU interrupt facility, see Section 12 on page 238.

Conditional branch instructions operate by examining a value in a register, rather than by accessing a specialized condition code register. The value is taken from the preferred slot. It is usually set by a compare instruction.

Compare instructions perform a comparison of the values in two registers, or a value in a register and an immediate value. The result is indicated by setting into the target register a result value that is the same width as the register operands. If the comparison condition is met, the value is all one bits; if not, the value is all zero bits.

Logical comparison instructions treat the operands as unsigned integers. Other compare instructions treat the operands as two's complement signed integers.

A set of “Halt” instructions is provided that stops execution when the tested condition is met. These are intended to be used, for example, to check addresses or subscript ranges in situations where failure to meet the condition is regarded as a serious error. The stop that occurs is not precise, so execution can generally not be restarted.

Floating-point compare instructions are listed in Section 9 Floating-Point Instructions on page 189 with the other floating-point instructions.
## Halt If Equal

**heq ra,rb**

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

The value in the preferred slot of register RA is compared with the value in the preferred slot of register RB. If the values are equal, execution of the program stops at or after the halt.

**Programming Note:** RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

If RA⁰:³ = RB⁰:³ then
   Stop after executing zero or more instructions after the halt.

End
Halt If Equal Immediate

\texttt{heqi ra,symbol}

The value in the I10 field is extended to 32 bits by replicating the leftmost bit. The result is algebraically compared to the value in the preferred slot of register RA. If the value from register RA is equal to the immediate value, execution of the SPU program stops at or after the halt instruction.

\textbf{Programming Note:} RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

If $\text{RA}_{0:3} = \text{RepLeftBit}(I10, 32)$ then

\begin{center}
\begin{tabular}{c}
Stop after executing zero or more instructions after the halt.
\end{tabular}
\end{center}
Halt If Greater Than

hgt ra,rb

0 1 0 0 1 0 1 1 0 0 0

RB RA RT

The value in the preferred slot of register RA is compared with the value in the preferred slot of register RB. If the value from register RA is greater than the RB value, execution of the SPU program stops at or after the halt instruction.

Programming Note: RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

If RA[0:3] > RB[0:3] then
Stop after executing zero or more instructions after the halt.

End
Halt If Greater Than Immediate

hgti ra,symbol

The value in the I10 field is extended to 32 bits by replicating the leftmost bit. The result is algebraically compared to the value in the preferred slot of register RA. If the value from register RA is greater than the immediate value, execution of the SPU program stops at or after the halt instruction.

Programming Note: RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

If RA^{0:3} > RepLeftBit(I10, 32) then
    Stop after executing zero or more instructions after the halt.
End
Halt If Logically Greater Than

\texttt{hlgt ra,rb}

The value in the preferred slot of register RA is compared with the value in the preferred slot of register RB. If the value from register RA is greater than the value from register RB, execution of the SPU program stops at or after the halt instruction.

**Programming Note:** RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

\begin{verbatim}
If RA<sub>0:3</sub> \textgreater\textless RB<sub>0:3</sub> then
    Stop after executing zero or more instructions after the halt.
End
\end{verbatim}
Halt If Logically Greater Than Immediate

Hlgti \text{ra,\text{symbol}}

The value in the I10 field is extended to 32 bits by replicating the leftmost bit. The result is logically compared to the value in the preferred slot of register RA. If the value from register RA is logically greater than the immediate value, execution of the SPU program stops at or after the halt instruction.

**Programming Note:** RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.

If RA_{0:3} >^{	ext{u}} \text{RepLeftBit(I10,32)} then
- Stop after executing zero or more instructions after the halt.

End
Compare Equal Byte

\texttt{ceqb \ rt,ra,rb}

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>RB</td>
<td>RA</td>
<td>RT</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>21</td>
</tr>
<tr>
<td>22</td>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
<td>31</td>
<td>32</td>
</tr>
</tbody>
</table>

For each of 16 byte slots:

- The operand from register RA is compared with the operand from register RB. If the operands are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 8-bit result is placed in register RT.

\begin{verbatim}
for i = 0 to 15
    If RA^i = RB^i then
        RT^i ← 0xFF
    else
        RT^i ← 0x00
End
\end{verbatim}
Compare Equal Byte Immediate

cqbi rt, ra, value

0 1 1 1 1 1 0

I10 RA RT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

For each of 16 byte slots:
- The value in the rightmost 8 bits of the I10 field is compared with the value in register RA. If the two values are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 8-bit result is placed in register RT.

for i = 0 to 15
    If RA\(^i\) = I10\(_{2:9}\) then
        RT\(^i\) ← 0xFF
    else
        RT\(^i\) ← 0x00
End
Compare Equal Halfword

\[ \text{ceqh} \quad \text{rt}, \text{ra}, \text{rb} \]

For each of 8 halfword slots:

- The operand from register RA is compared with the operand from register RB. If the operands are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 16-bit result is placed in register RT.

```
for i = 0 to 15 by 2
    If RA_i^{16} = RB_i^{16} then
        RT_i^{16} ← 0xFFFF
    else
        RT_i^{16} ← 0x0000
End
```
Compare Equal Halfword Immediate

```
compare_equal_halfword_immediate
```

<table>
<thead>
<tr>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The value in the I10 field is extended to 16 bits by replicating its leftmost bit and compared with the value in register RA. If the two values are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 16-bit result is placed in register RT.

```
for i = 0 to 15 by 2
    if RA^i = RepLeftBit(I10,16) then
        RT^i ← 0xFFFF
    else
        RT^i ← 0x0000
End
```
### Compare Equal Word

**ceq r,ra,rb**

For each of four word slots:

- The operand from register RA is compared with the operand from register RB. If the operands are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 32-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 4
    If RAi=4 = RBi=4 then
        RTi=4 ← 0xFFFFFFFF
    else
        RTi=4 ← 0x00000000
End
```
Compare Equal Word Immediate

**ceqi** \( \text{rt,ra,value} \)

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>10</td>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
</tr>
<tr>
<td>20</td>
<td>21</td>
<td>22</td>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
</tr>
<tr>
<td>30</td>
<td>31</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

For each of four word slots:

- The I10 field is extended to 32 bits by replicating its leftmost bit and comparing it with the value in register RA. If the two values are equal, a result of all one bits (true) is produced. If they are unequal, a result of all zero bits (false) is produced.
- The 32-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 4
    if RAi::4 = RepLeftBit(I10,32) then
        RTi::4 ← 0xFFFFFFFF
    else
        RTi::4 ← 0x00000000
End
```
Compare Greater Than Byte

cgtb rt,ra,rb

For each of 16 byte slots:

- The operand from register RA is compared with the operand from register RB. If the operand in register RA is greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 8-bit result is placed in register RT.

```
for i = 0 to 15
    If RA_i > RB_i then
        RT_i <- 0xFF
    else
        RT_i <- 0x00
End
```
Compare Greater Than Byte Immediate

cgtbi

rt, ra, value

<table>
<thead>
<tr>
<th></th>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

For each of 16 byte slots:

- The value in the rightmost 8 bits of the I10 field is algebraically compared with the value in register RA. If the value in register RA is greater, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.
- The 8-bit result is placed in register RT.

```plaintext
for i = 0 to 15
    If RA^i > I10^2:9 then
        RT^i ← 0xFF
    else
        RT^i ← 0x00
End
```
Compare Greater Than Halfword

cgth rt, ra, rb

For each of 8 halfword slots:

- The operand from register RA is compared with the operand from register RB. If the operand in register RA is greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 16-bit result is placed in register RT.

```
for i = 0 to 15 by 2
  If RA'i'2 > RB'i'2 then
    RT'i'2 ← 0xFFFF
  else
    RT'i'2 ← 0x0000
End
```
**Compare Greater Than Halfword Immediate**

cgthi \( \text{rt,ra,value} \)

<table>
<thead>
<tr>
<th>I10</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The value in the I10 field is extended to 16 bits and algebraically compared with the value in register RA. If the value in register RA is greater than the I10 value, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.
- The 16-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 2
    If RA^[i-2] > RepLeftBit(I10,16) then
        RT^[i-2] ← 0xFFFF
    else
        RT^[i-2] ← 0x0000
End
```
### Compare Greater Than Word

cgt rt,ra,rb

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>RB</td>
<td>RA</td>
<td>RT</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

For each of four word slots:

- The operand from register RA is compared with the operand from register RB. If the operand in register RA is greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 32-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 4
    if RA[i:4] > RB[i:4] then
        RT[i:4] ← 0xFFFFFFFF
    else
        RT[i:4] ← 0x00000000
End
```
**Compare Greater Than Word Immediate**

**cgti rt,ra,value**

For each of four word slots:

- The value in the I10 field is extended to 32 bits by sign extension and compared with the value in register RA. If the value in register RA is greater than the I10 value, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 32-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 4
    If RAi:4 > RepLeftBit(I10,32) then
        RTi:4 ← 0xFFFFFFFF
    else
        RTi:4 ← 0x00000000
End
```
Compare Logical Greater Than Byte

`clgtb rt, ra, rb`

<table>
<thead>
<tr>
<th></th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1</td>
<td>1 0</td>
<td>1 0</td>
<td>1 0</td>
</tr>
<tr>
<td>1 1</td>
<td>1 1</td>
<td>1 0</td>
<td>1 0</td>
</tr>
<tr>
<td>1 0</td>
<td>0 0</td>
<td>0 0</td>
<td>0 0</td>
</tr>
</tbody>
</table>

For each of 16 byte slots:

- The operand from register RA is logically compared with the operand from register RB. If the operand in register RA is greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 8-bit result is placed in register RT.

```plaintext
for i = 0 to 15
    If RA[i] > u RB[i] then
        RT[i] ← 0xFF
    else
        RT[i] ← 0x00
End
```
Compare Logical Greater Than Byte Immediate

clgbi rt,ra,value

For each of 16 byte slots:

- The value in the rightmost 8 bits of the I10 field is logically compared with the value in register RA. If the value in register RA is greater, a result of all one bits (true) is produced. Otherwise, a result of all zero (false) bits is produced.
- The 8-bit result is placed in register RT.

```
for i = 0 to 15
    if RA^i I10[2:9] then
        RT^i ← 0xFF
    else
        RT^i ← 0x00
End
```
Compare Logical Greater Than Halfword

clgth rt,ra,rb

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
<tr>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>21</td>
<td>22</td>
</tr>
<tr>
<td>23</td>
<td>24</td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
<td>31</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

For each of eight halfword slots:

- The operand from register RA is logically compared with the operand from register RB. If the operand in register RA is greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 16-bit result is placed in register RT.

```plaintext
for i = 0 to 15 by 2
    RT[2:i-2] ← 0xFFFF
  else
    RT[2:i-2] ← 0x0000
End
```
Compare Logical Greater Than Halfword Immediate

clgthi rt,ra,value

0 1 0 1 1 1 0 1
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 1 2 3 4 5 6 7

I10

RA

RT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

For each of eight halfword slots:

- The value in the I10 field is extended to 16 bits by replicating the leftmost bit and logically compared with the value in register RA. If the value in register RA is logically greater than the I10 value, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.
- The 16-bit result is placed in register RT.

for i = 0 to 15 by 2
    If RA:\^2 >u RepLeftBit(I10, 16) then
        RT:\^2 ← 0xFFFF
    else
        RT:\^2 ← 0x0000
End
Compare Logical Greater Than Word

\[
\text{clgt } rt, ra, rb
\]

For each of four word slots:

- The operand from register RA is logically compared with the operand from register RB. If the operand in register RA is logically greater than the operand in register RB, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.
- The 32-bit result is placed in register RT.

for i = 0 to 15 by 4
  If RA[i/4] >u RB[i/4] then
    RT[i/4] ← 0xFFFFFFFF
  else
    RT[i/4] ← 0x00000000
End
## Compare Logical Greater Than Word Immediate

**clgti rt,ra,value**

```
0 1 0 1 1 1 0 0  I10  RA  RT
↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

For each of four word slots:

- The value in the I10 field is extended to 32 bits by sign extension and logically compared with the value in register RA. If the value in register RA is logically greater than the I10 value, a result of all one bits (true) is produced. Otherwise, a result of all zero bits (false) is produced.

- The 32-bit result is placed in register RT.

```
for i = 0 to 15 by 4
  if RA[i::4] > RepLeftBit(I10,32) then
    RT[i::4] ← 0xFFFFFFFF
  else
    RT[i::4] ← 0x00000000

End
```
Branch Relative

br symbol

0 0 1 1 0 0 1 0 0

I16 ///

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Execution proceeds with the target instruction. The address of the target instruction is computed by adding the value of the I16 field, extended on the right with two zero bits with the result treated as a signed quantity, to the address of the Branch Relative instruction.

Programming Note: If the value of the I16 field is zero, an infinite one instruction loop is executed.

PC ← (PC + RepLeftBit(I16 || 0b00,32)) & LSLR
Branch Absolute

Execution proceeds with the target instruction. The address of the target instruction is the value of the I16 field, extended on the right with two zero bits and extended on the left with copies of the most-significant bit.

\[
\text{PC} \leftarrow \text{RepLeftBit}(I16 \parallel 0b00, 32) \& \text{LSLR}
\]
Branch Relative and Set Link

**brsl**  **rt,symbol**

Execution proceeds with the target instruction. In addition, a link register is set.

The address of the target instruction is computed by adding the value of the I16 field, extended on the right with two zero bits with the result treated as a signed quantity, to the address of the Branch Relative and Set Link instruction.

The preferred slot of register RT is set to the address of the byte following the Branch Relative and Set Link instruction. The remaining slots of register RT are set to zero.

**Programming Note:** If the value of the I16 field is zero, an infinite one instruction loop is executed.

<table>
<thead>
<tr>
<th>RT&lt;sup&gt;0:3&lt;/sup&gt;</th>
<th>← (PC + 4) &amp; LSLR</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT&lt;sup&gt;4:15&lt;/sup&gt;</td>
<td>← 0</td>
</tr>
<tr>
<td>PC</td>
<td>← (PC + RepLeftBit(I16</td>
</tr>
</tbody>
</table>
Branch Absolute and Set Link

brasl rt.symbol

<table>
<thead>
<tr>
<th>I16</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>001100010</td>
<td>0123456789 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1</td>
</tr>
</tbody>
</table>

Execution proceeds with the target instruction. In addition, a link register is set.

The address of the target instruction is the value of the I16 field, extended on the right with two zero bits and extended on the left with copies of the most-significant bit.

The preferred slot of register RT is set to the address of the byte following the Branch Absolute and Set Link instruction. The remaining slots of register RT are set to zero.

\[
\begin{array}{|c|c|}
\hline
\text{RT}_{0:3} & \leftarrow (PC + 4) \& \text{LSLR} \\
\text{RT}_{4:15} & \leftarrow 0 \\
\text{PC} & \leftarrow \text{RepLeftBit}(I16 \parallel 0b00,32) \& \text{LSLR} \\
\hline
\end{array}
\]
Branch Indirect

bi ra

0 0 1 1 0 1 0 0 0 / D E / / / / RA ///

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Execution proceeds with the instruction addressed by the preferred slot of register RA. The rightmost 2 bits of the value in register RA are ignored and assumed to be zero. Interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

PC ← RA_{31:3} & LSLR & 0xFFFFFFFC
if (E = 0 and D = 0) interrupt enable status is not modified
if (E = 1 and D = 0) enable interrupts at target
if (E = 0 and D = 1) disable interrupts at target
if (E = 1 and D = 1) reserved
Interrupt Return

```
iret ra
```

Execution proceeds with the instruction addressed by SRR0. RA is considered to be a valid source whose value is ignored. Interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

```
PC ← SRR0
```

- if (E = 0 and D = 0) interrupt enable status is not modified
- if (E = 1 and D = 0) enable interrupts at target
- if (E = 0 and D = 1) disable interrupts at target
- if (E = 1 and D = 1) reserved
Branch Indirect and Set Link if External Data

```
bisled rt,ra
```

```
  0 0 1 1 0 1 0 1 1 / D E / / / / RA RT
  ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

The external condition is examined. If it is false, execution continues with the next sequential instruction. If the external condition is true, the effective address of the next instruction is taken from the preferred word slot of register RA.

The address of the instruction following the `bisled` instruction is placed into the preferred word slot of register RT; the remainder of register RT is set to zero.

If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

```
u ← LSLR & (PC + 4)
t ← RA0:3 & LSLR & 0xFFFFFFFFC
RT0:3 ← u
RT4:15 ← 0

if (external event) then
    PC ← t
    if (E = 0 and D = 0) interrupt enable status is not modified
    if (E = 1 and D = 0) enable interrupts at target
    if (E = 0 and D = 1) disable interrupts at target
    if (E = 1 and D = 1) reserved
else
    PC ← u
end
```
Branch Indirect and Set Link

\[ \text{bisl rt,ra} \]

The effective address of the next instruction is taken from the preferred word slot of register RA, with the rightmost 2 bits assumed to be zero. The address of the instruction following the \texttt{bisl} instruction is placed into the preferred word slot of register RT. The remainder of register RT is set to zero. Interrupts can be enabled or disabled with the E or D feature bits (see \textit{Section 12 SPU Interrupt Facility} on page 238).

\[
\begin{align*}
    t & \leftarrow RA^{0:3} \& \text{LSLR} \& \text{0xFFFFFFFC} \\
    u & \leftarrow \text{LSLR} \& (PC + 4) \\
    RT^{0:3} & \leftarrow u \\
    RT^{4:15} & \leftarrow \text{0x00} \\
    PC & \leftarrow t \\
\end{align*}
\]

\begin{itemize}
    \item if (E = 0 and D = 0) interrupt enable status is not modified
    \item if (E = 1 and D = 0) enable interrupts at target
    \item if (E = 0 and D = 1) disable interrupts at target
    \item if (E = 1 and D = 1) reserved
\end{itemize}
Branch If Not Zero Word

brnz      rt,symbol

Examine the preferred slot; if not zero, proceed with the branch target. Otherwise, proceed with the next instruction.

The address of the branch target is computed by appending two zero bits to the value of the I16 field, extending it on the left with copies of the most-significant bit, and adding it to the value of the instruction counter.

\[
\text{If } RT_{0:3} \neq 0 \text{ then}
\quad PC \leftarrow (PC + \text{RepLeftBit}(I16 || 0b00)) \& \text{LSLR} \& 0xFFFFFFF
\]
\[
\text{else}
\quad PC \leftarrow (PC+4) \& \text{LSLR}
\]

End
Branch If Zero Word

\[ \text{brz} \quad rt,\text{symbol} \]

Examine the preferred slot. If it is zero, proceed with the branch target. Otherwise, proceed with the next instruction.

The address of the branch target is computed by appending two zero bits to the value of the I16 field, extending it on the left with copies of the most-significant bit, and adding it to the value of the instruction counter.

\[
\begin{align*}
\text{If } R^0:3 = 0 \text{ then} & \quad \text{PC } \leftarrow (\text{PC} + \text{RepLeftBit(I16} \| 0b00)) \& \text{LSLR} \& 0xFFFFFFFFC \\
\text{else} & \quad \text{PC } \leftarrow (\text{PC} + 4) \& \text{LSLR}
\end{align*}
\]
Branch If Not Zero Halfword

brhnz rt,symbol

Examine the preferred slot. If the rightmost halfword is not zero, proceed with the branch target. Otherwise, proceed with the next instruction.

The address of the branch target is computed by appending two zero bits to the value of the I16 field, extending it on the left with copies of the most-significant bit, and adding it to the value of the instruction counter.

If $RT_{2:3} \neq 0$

$$PC \leftarrow (PC + \text{RepLeftBit(I16 || 0b00)}) \& \text{LSLR} \& 0xFFFFFFFFC$$

else

$$PC \leftarrow (PC + 4) \& \text{LSLR}$$

End
Branch If Zero Halfword

**brhz**

\[ \text{rt, symbol} \]

\[
\begin{array}{ccccccccccccccccccccccccccc}
0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & \text{I16} & \text{RT} \\
\hline
\end{array}
\]

Examine the preferred slot. If the rightmost halfword is zero, proceed with the branch target. Otherwise, proceed with the next instruction.

The address of the branch target is computed by appending two zero bits to the value of the I16 field, extending it on the left with copies of the most-significant bit, and adding it to the value of the instruction counter.

```plaintext
If RT_{2:3} = 0 then
    PC ← (PC + RepLeftBit(I16 || 0b00)) & LSLR & 0xFFFFFFFFC
else
    PC ← (PC + 4) & LSLR
End
```
Branch Indirect If Zero

If the preferred slot of register RT is not zero, execution proceeds with the next sequential instruction. Otherwise, execution proceeds at the address in the preferred slot of register RA, treating the rightmost 2 bits as zero. If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

\[
\text{t} \leftarrow \text{RA}_{0:3} \text{ & LSLR} \& 0xFFFFFFFC \\
\text{u} \leftarrow \text{LSLR} \& (\text{PC} + 4)
\]

If \( \text{RT}_{0:3} = 0 \) then

- PC \( \leftarrow \text{t} \text{ & LSLR} \& 0xFFFF FFFC \)
  - if \( (E = 0 \text{ and } D = 0) \) interrupt enable status is not modified
  - if \( (E = 1 \text{ and } D = 0) \) enable interrupts at target
  - if \( (E = 0 \text{ and } D = 1) \) disable interrupts at target
  - if \( (E = 1 \text{ and } D = 1) \) reserved

else

- \( \text{PC} \leftarrow \text{u} \)

End
Branch Indirect If Not Zero

`binz rt, ra`

If the preferred slot of register RT is zero, execution proceeds with the next sequential instruction. Otherwise, execution proceeds at the address in the preferred slot of register RA, treating the rightmost 2 bits as zero. If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

```
t ← RA[0:3] & LSLR & 0xFFFFFFFC
u ← LSLR & (PC + 4)

If RT[0:3] ≠ 0 then
    PC ← t & LSLR & 0xFFFFFFFC
    if (E = 0 and D = 0) interrupt enable status is not modified
    if (E = 1 and D = 0) enable interrupts at target
    if (E = 0 and D = 1) disable interrupts at target
    if (E = 1 and D = 1) reserved
else
    PC ← u
End
```
Branch Indirect If Zero Halfword

`bihz rt,ra`

If the rightmost halfword of the preferred slot of register RT is not zero, execution proceeds with the next sequential instruction. Otherwise, execution proceeds at the address in the preferred slot of register RA, treating the rightmost 2 bits as zero. If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

```plaintext
If RT2:3 = 0 then do
    t ← RA0:3 & LSLR & 0xFFFFFFFC
    u ← LSLR & (PC + 4)
    if (E = 0 and D = 0) interrupt enable status is not modified
    if (E = 1 and D = 0) enable interrupts at target
    if (E = 0 and D = 1) disable interrupts at target
    if (E = 1 and D = 1) reserved
else
    PC ← u
End
```

If the rightmost halfword of the preferred slot of register RT is not zero, execution proceeds with the next sequential instruction. Otherwise, execution proceeds at the address in the preferred slot of register RA, treating the rightmost 2 bits as zero. If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).
Branch Indirect If Not Zero Halfword

bihnz rt,ra

If the rightmost halfword of the preferred slot of register RT is zero, execution proceeds with the next sequential instruction. Otherwise, execution proceeds at the address in the preferred slot of register RA, treating the rightmost 2 bits as zero. If the branch is taken, interrupts can be enabled or disabled with the E or D feature bits (see Section 12 SPU Interrupt Facility on page 238).

\[
\text{t} \leftarrow \text{RA}^{0:3} \& \text{LSLR} \& 0xFFFFF\text{FC} \\
\text{u} \leftarrow \text{LSLR} \& (\text{PC} + 4) \\
\]

If \( \text{RT}^{2:3} \neq 0 \) then

- \( \text{PC} \leftarrow \text{t} \& \text{LSLR} \& 0xFFFFF\text{FC} \)
- if \( \text{E} = 0 \) and \( \text{D} = 0 \) interrupt enable status is not modified
- if \( \text{E} = 1 \) and \( \text{D} = 0 \) enable interrupts at target
- if \( \text{E} = 0 \) and \( \text{D} = 1 \) disable interrupts at target
- if \( \text{E} = 1 \) and \( \text{D} = 1 \) reserved

else

- \( \text{PC} \leftarrow \text{u} \)

End
8. Hint-for-Branch Instructions

This section lists and describes the SPU hint-for-branch instructions.

These instructions have no semantics. They provide a hint to the implementation about a future branch instruction, with the intention that the information be used to improve performance by either prefetching the branch target or by other means.

Each of the hint-for-branch instructions specifies the address of a branch instruction and the address of the expected branch target address. If the expectation is that the branch is not taken, the target address is the address of the instruction following the branch.

The instructions in this section use the variables *brinst* and *brtarg*, which are defined as follows:

- *brinst* = r0
- *brtarg* = l16
Hint for Branch (r-form)

The address of the branch target is given by the contents of the preferred slot of register RA. The RO field gives the signed word offset from the `hbr` instruction to the branch instruction. If the P feature bit is set, the instruction ignores the value of RA and instead allows an inline prefetch to occur. When the P feature bit is set, the RO field, formed by concatenating ROH (high) and ROL (low), must be set to zero.

```
branch target address ← RA(0:3) & LSLR & 0xFFFFFFFFC
branch instruction address ← (RepLeftBit(ROH || ROL || 0b00,32) + PC) & LSLR
```
**Hint for Branch (a-form)**

```
hbra  brinst,brtarg
```

The address of the branch target is specified by an address in the I16 field. The value has 2 bits of zero appended on the right before it is used.

The RO field, formed by concatenating ROH (high) and ROL (low), gives the signed word offset from the `hbra` instruction to the branch instruction.

- `branch target address ← RepLeftBit(I16 || 0b00,32) & LSLR`
- `branch instruction address ← (RepLeftBit(ROH || ROL || 0b00,32) + PC) & LSLR`
Hint for Branch Relative

```
hbrr     brinst,brtarg
```

The address of the branch target is specified by a word offset given in the I16 field. The signed I16 field is added to the address of the `hbrr` instruction to determine the absolute address of the branch target.

The RO field, formed by concatenating ROH (high) and ROL (low), gives the signed word offset from the `hbrr` instruction to the branch instruction.

```
branch target address ← (RepLeftBit(I16 || 0b00,32) + PC) & LSLR
branch instruction address ← (RepLeftBit(ROH || ROL || 0b00,32) + PC) & LSLR
```
9. Floating-Point Instructions

This section lists and describes the SPU floating-point instructions. This section also describe the differences between SPU floating point and IEEE standard floating point.

Although the single-precision, floating-point instructions do not calculate results compliant with IEEE Standard 754, the data formats for single-precision and double-precision floating-point instructions that are used in the SPU are those defined by IEEE Standard 754.

9.1 Single Precision (Extended-Range Mode)

For single-precision operations, the range of normalized numbers is extended. However, the full standard is not implemented. The range of nonzero numbers that can be represented and operated on in the SPU is between the minimum and maximum listed in Table 9-1.

<table>
<thead>
<tr>
<th>Number Format</th>
<th>Minimum (Smin)</th>
<th>Maximum (Smax)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Binary</td>
<td>(001)[1.000...000)</td>
<td>(255)[1.111...111)</td>
</tr>
<tr>
<td>Decimal</td>
<td>1 x 2^{-126}</td>
<td>(2 - 2^{-23}) x 2^{128}</td>
</tr>
<tr>
<td></td>
<td>1.2 x 10^{-38}</td>
<td>6.8 x 10^{38}</td>
</tr>
</tbody>
</table>

Zero has two representations:
- For a positive zero, all bits are zero; that is, the sign, exponent, and fraction are zero.
- For a negative zero, the sign is one; the exponent and fraction are zero.

As inputs, both kinds of zero are supported; however, a zero result is always a positive zero.

For single-precision operations:
- Not a Number (NaN) is not supported as an operand, and is not produced as a result.
- Infinity (Inf) is not supported. An operation that produces a magnitude greater than the largest number representable in the target floating-point format instead produces a number with the appropriate sign, the largest biased exponent, and a magnitude of all (binary) ones. It is important to note that the representation of Inf, which is used on the power processor unit (PPU) and conforms to the IEEE standard, is interpreted by the SPU as a number that is smaller than the largest number used on the SPU.
- Denorms are not supported, and are treated as zero. Thus, an operation that would generate a denorm under IEEE rules instead generates a +0. If a denorm is used as an operand, it is treated as a zero.
- The only supported rounding mode is truncation (toward zero).

Exceptions for single-precision extended-range arithmetic include the following:
- For extended-range arithmetic, four kinds of exception conditions are tested: overflow, underflow, divide-by-zero, and IEEE noncompliant result.
- Overflow ( OVF )
  An overflow exception occurs when the magnitude of the result before rounding is bigger than the largest positive representable number, Smax. If the operation in slice \( k \) produces an overflow, the OVF flag for slice \( k \) in the Floating-Point Status and Control Register (FPSCR) is set, and the result is saturated to Smax with the appropriate sign.
• Underflow (UNF)
  An underflow exception occurs when the magnitude of the result before rounding is smaller than the
  smallest positive representable number, Smin. If the operation in slice \( k \) produces an underflow, the UNF
  flag for slice \( k \) in the FPSCR is set, and the result is saturated to +0.

• Divide-by-Zero (DBZ)
  A divide-by-zero exception occurs when the input of an estimate instruction has a zero exponent. If the
  operation in slice \( k \) produces a divide-by-zero exception, the DBZ flag for slice \( k \) in the FPSCR is set.

• IEEE noncompliant result (DIFF)
  A different-from-IEEE exception indicates that the nonzero result produced with extended-range arith-
  metic could be different from the IEEE result. This occurs when one of the following conditions exists:
  – Any of the inputs or the result has a maximal exponent (IEEE arithmetic treats such an operand as
    NaN or Infinity; extended-range arithmetic treats them as normalized values.)
  – Any of the inputs has a zero exponent and a nonzero fraction (IEEE arithmetic treats such an oper-
    and as a denormal number; extended-range arithmetic treats them as a zero.)
  – An underflow occurs; that is, the result before rounding is different from zero and the result after
    rounding is zero.
  If this happens for the operation in slice \( k \), the DIFF flag for slice \( k \) in the FPSCR is set.

These exceptions can only be set by extended-range floating-point instructions. Table 9-2 lists the instruc-
tions for which exceptions can be set.

Table 9-2. Instructions and Exception Settings

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Set OVF</th>
<th>Set UNF</th>
<th>Set DBZ</th>
<th>Set DIFF</th>
</tr>
</thead>
<tbody>
<tr>
<td>fa, fs, fm, fma, fms, fnms, fi</td>
<td>Yes</td>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>frest, frsqest</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>csflt, cuflt</td>
<td>Yes</td>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>cflts, cfltu, fceq, fcneq, fcgt, fcmgt</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>No</td>
</tr>
</tbody>
</table>

9.2 Double Precision

For double precision, normal IEEE semantics and definitions apply. The range of the nonzero numbers
supported by this format is between the minimum and the maximum listed in Table 9-3.

Table 9-3. Double-Precision (IEEE Mode) Minimum and Maximum Values

<table>
<thead>
<tr>
<th>Number Format</th>
<th>Minimum (Dmin) Denormalized</th>
<th>Maximum (Dmax) Normalized</th>
</tr>
</thead>
<tbody>
<tr>
<td>Binary</td>
<td>(0001)([0.]000...001)</td>
<td>(2046)([1.]111...111)</td>
</tr>
<tr>
<td>Decimal</td>
<td>(2^{-52} \times 2^{-1022})</td>
<td>((2 - 2^{-52}) \times 2^{1024})</td>
</tr>
<tr>
<td></td>
<td>(4.9 \times 10^{-324})</td>
<td>(1.8 \times 10^{308})</td>
</tr>
</tbody>
</table>

For double-precision operations:

• Only a subset of the operations required by the IEEE standard is supported in hardware.
• All four rounding modes are supported. The field RN in the FPSCR specifies the current rounding mode.
• The IEEE exceptions are detected and accumulated in the FPSCR. Trapping is not supported.
The IEEE standard recognizes two kinds of NaNs. These are values that have the maximum biased exponent value and a nonzero fraction value. The sign bit is ignored. If the high-order bit of the fraction field is ‘0’, then the NaN is a Signaling NaN (SNaN); otherwise, it is a Quiet NaN (QNaN). When a QNaN is the result of a floating-point operation, the result is always the default QNaN. That is, the high-order bit of the fraction field is ‘1’, all the other bits of the fraction field are zero, and the sign bit is zero.

The IEEE standard and the PowerPC Architecture have very strict rules on the propagation of NaNs, which are not implemented in this architecture. Thus, whenever a QNaN result is due to propagating an input QNaN or SNaN, the NAN flag in the FPSCR is set in order to signal a possibly noncompliant result.

Denorms are only supported as results. A denormal operand is treated as zero (this also applies to the setting of the IEEE flags); the sign of the operand is preserved. Whenever a denormal operand is forced to zero, the DENORM flag in the FPSCR is set in order to signal a possibly noncompliant result.

### 9.2.1 Conversions Between Single and Double-Precision Format

There are two types of conversions: one rounding a double-precision number to a single-precision number, the other extending a single-precision number to a double-precision number. Both operations comply with the IEEE standard, except for the handling of denormal inputs, which are forced to zero. Thus, for these two operations, NaNs, infinities, and denormal results are supported in double as well as in single precision. The range of nonzero IEEE single-precision numbers is between the minimum and the maximum listed in Table 9-4.

<table>
<thead>
<tr>
<th>Number Format</th>
<th>Minimum (Smin) Denormalized</th>
<th>Maximum (Smax) Normalized</th>
</tr>
</thead>
<tbody>
<tr>
<td>Binary</td>
<td>(001)[0.000... 001]</td>
<td>(254)[1.111... 111]</td>
</tr>
<tr>
<td>Decimal</td>
<td>$2^{-23} \times 2^{-126}$</td>
<td>$(2 - 2^{-23}) \times 2^{127}$</td>
</tr>
<tr>
<td></td>
<td>$1.4 \times 10^{-45}$</td>
<td>$3.4 \times 10^{38}$</td>
</tr>
</tbody>
</table>

### 9.2.2 Exception Conditions

This architecture only supports nontrap exception handling; that is, exception conditions are detected and reported in the appropriate fields of the FPSCR. These flags are sticky; once set, they remain set until they are cleared by an FPSCR-write instruction. These exception flags are not set by the single-precision operations executed in the extended range. Since the double-precision operations are 2-way SIMD, there are two sets of these flags.

**Inexact Result (INX)**

An inexact result is detected when the delivered result value differs from what would have been computed if both the exponent range and precision were unbounded.

**Overflow (OVF)**

An overflow occurs when the magnitude of what would have been the rounded result if the exponent range were unbounded exceeds that of the largest finite number of the specified result precision.
**Underflow (UNF)**

For nontrap exception handling, the IEEE 754 standard defines the underflow as the following:

\[
\text{UNF} = \text{tiny} \, \text{AND} \, \text{loss_of_accuracy}
\]

Where there are two definitions each for tiny and loss of accuracy, and the implementation is free to choose any of the four combinations. This architecture implements *tiny-before-rounding* and *inexact result* (INX), thus:

\[
\text{UNF} = \text{tiny\_before\_rounding} \, \text{AND} \, \text{inexact\_result}
\]

**Note:** Tiny before rounding is detected when a nonzero result value, computed as though the exponent range were unbounded, would be less in magnitude than the smallest normalized number.

**Invalid Operation (INV)**

An invalid operation exception occurs whenever an operand is invalid for the specified operation. For operations implemented in hardware, the following operations give rise to an invalid operation exception condition:

- Any floating-point operation on a signaling NaN (SNaN)
- For add, subtract, and fused multiply add operations on magnitude subtraction of infinities; that is, infinity - infinity
- Multiplication of infinity by zero.

**Note:** Denormal inputs are treated as zeros.

**Not Propagated NAN (NAN)**

The IEEE standard and the PowerPC Architecture require special handling of input NaNs, but SPU implementations can deliver the default QNaN as a result of double-precision operations. When at least one of the inputs is a NaN, the resulting QNaN can differ from the result delivered by a fully PowerPC-compliant design. This is flagged in the NAN field.

**Denormal Input Forced to Zero (DENORM)**

SPU implementations can force certain double-precision denormal operands to zeros before the processing of double-precision operations. If an implementation forces these operands to zeros, the zero will preserve the sign of the original denormal value. When a denormal input is forced to zero, the DENORM exception flag is set in the FPSCR to signal that the result could differ from an IEEE-compliant result.

**Programming Note:** Applications that require IEEE-compliant double-precision results can use the NAN and DENORM flags in the FPSCR to detect noncompliant results. This allows the code to be re-executed in a less efficient but compliant manner. Both flags are sticky, so large blocks of code can be guarded, minimizing the overhead of the code checking. For example,

```c
clear fpscr
fast code block
if (NAN || DENORM)
{
    compliant code block
}
```

On SPUs within CBEA-compliant processors, the SPU can stop and signal the PPE to request that the PPE perform the calculation and then restart the SPU.
Table 9-5 lists the instructions for which exceptions can be set.

### Table 9-5. Instructions and Exception Settings

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Set OVF</th>
<th>Set UNF</th>
<th>Set INX</th>
<th>Set INV</th>
<th>Set NAN</th>
<th>Set DENORM</th>
</tr>
</thead>
<tbody>
<tr>
<td>dfa, dfa, dfm, dfma, dfms, dfnms, dfnma</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>fesd</td>
<td>No</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>frds</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
</tbody>
</table>

### 9.3 Floating-Point Status and Control Register (FPSCR)

The Floating-Point Status and Control Register (FPSCR) records the status resulting from the floating-point operations and controls the rounding mode for double-precision operations. The FPSCR is read by the Floating-Point Status and Control Register Read instruction (fscrrd) and written with the FPSCR-write instruction (fscrwr). Bits [22:23] are control bits; the remaining bits are either status bits or unused. All the status bits in the FPSCR are sticky. That is, once set, the sticky bits remain set until they are cleared by an fscrwr instruction.

The format of the FPSCR is as follows.

**Bits Description**

<table>
<thead>
<tr>
<th>Bits:21</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>22:23</td>
<td>Rounding mode RN</td>
</tr>
<tr>
<td>00</td>
<td>Round to nearest even</td>
</tr>
<tr>
<td>01</td>
<td>Round towards zero (truncate)</td>
</tr>
<tr>
<td>10</td>
<td>Round towards +infinity</td>
</tr>
<tr>
<td>11</td>
<td>Round towards -infinity</td>
</tr>
<tr>
<td>24:28</td>
<td>Unused</td>
</tr>
<tr>
<td>29:31</td>
<td>Single-precision exception flags for slice 0</td>
</tr>
<tr>
<td>29</td>
<td>Overflow (OVF)</td>
</tr>
<tr>
<td>30</td>
<td>Underflow (UNF)</td>
</tr>
<tr>
<td>31</td>
<td>Nonzero result produced with extended-range arithmetic could be different from the IEEE compliant result (DIFF)</td>
</tr>
<tr>
<td>32:49</td>
<td>Unused</td>
</tr>
<tr>
<td>50:55</td>
<td>IEEE exception flags for slice 0 of the 2-way SIMD double-precision operations</td>
</tr>
<tr>
<td>50</td>
<td>Overflow (OVF)</td>
</tr>
<tr>
<td>51</td>
<td>Underflow (UNF)</td>
</tr>
<tr>
<td>52</td>
<td>Inexact result (INX)</td>
</tr>
<tr>
<td>53</td>
<td>Invalid operation (INV)</td>
</tr>
<tr>
<td>54</td>
<td>Possibly noncompliant result due to QNaN propagation (NAN)</td>
</tr>
<tr>
<td>55</td>
<td>Possibly noncompliant result due to denormal operand (DENORM)</td>
</tr>
<tr>
<td>56:60</td>
<td>Unused</td>
</tr>
<tr>
<td>61:63</td>
<td>Single-precision exception flags for slice 1 (OVF, UNF, DIFF)</td>
</tr>
<tr>
<td>64:87</td>
<td>Unused</td>
</tr>
<tr>
<td>82:87</td>
<td>IEEE exception flags for slice 1 of the 2-way SIMD double-precision operations (OVF, UNF, INX, INV, NAN, DENORM)</td>
</tr>
<tr>
<td>88:92</td>
<td>Unused</td>
</tr>
<tr>
<td>93:95</td>
<td>Single-precision exception flags for slice 2 (OVF, UNF, DIFF)</td>
</tr>
</tbody>
</table>
### Floating-Point Instructions

<table>
<thead>
<tr>
<th>Bits</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>96:115</td>
<td>Unused</td>
</tr>
<tr>
<td>116:119</td>
<td>Single-precision divide-by-zero flags for each of the four slices</td>
</tr>
<tr>
<td>116</td>
<td>DBZ for slice 0</td>
</tr>
<tr>
<td>117</td>
<td>DBZ for slice 1</td>
</tr>
<tr>
<td>118</td>
<td>DBZ for slice 2</td>
</tr>
<tr>
<td>119</td>
<td>DBZ for slice 3</td>
</tr>
<tr>
<td>120:124</td>
<td>Unused</td>
</tr>
<tr>
<td>125:127</td>
<td>Single-precision exception flags for slice 3 (OVF, UNF, DIFF)</td>
</tr>
</tbody>
</table>
Floating Add

$\text{fa \ rt,ra,rb}$

For each of the four word slots:

- The operand from register RA is added to the operand from register RB.
- The result is placed in register RT.

If the magnitude of the result is greater than $S_{\text{max}}$, then $S_{\text{max}}$ (with the correct sign) is produced as the result. If the magnitude of the result is less than $S_{\text{min}}$, then zero is produced.
Double Floating Add

```
dfa rt, ra, rb
```

For each of two doubleword slots:

- The operand from register RA is added to the operand from register RB.
- The result is placed in register RT.
Floating Subtract

fs rt,ra,rb

<table>
<thead>
<tr>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

For each of the four word slots:

- The operand from register RB is subtracted from the operand from register RA.
- The result is placed in register RT.
- If the magnitude of the result is greater than Smax, then Smax (with the correct sign) is produced as the result. If the magnitude of the result is less than Smin, then zero is produced.
Double Floating Subtract

defs rt,ra,rb

For each of two doubleword slots:

- The operand from register RB is subtracted from the operand from register RA.
- The result is placed in register RT.
### Floating Multiply

**fm**  
**rt,ra,rb**

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>13</td>
</tr>
</tbody>
</table>

For each of the four word slots:

- The operand from register RA is multiplied by the operand from register RB.
- The result is placed in register RT.
- If the magnitude of the result is greater than Smax, then Smax (with the correct sign) is produced. If the magnitude of the result is less than Smin, then zero is produced.
Double Floating Multiply

dfm \( \text{rt,ra,rb} \)

For each of two doubleword slots:

- The operand from register RA is multiplied by the operand from register RB.
- The result is placed in register RT.
Floating Multiply and Add

`fma rt,ra,rb,rc`

For each of the four word slots:

- The operand from register RA is multiplied by the operand from register RB and added to the operand from register RC. The multiplication is exact and not subject to limits on its range.
- The result is placed in register RT.
- If the magnitude of the result of the addition is greater than Smax, then Smax (with the correct sign) is produced. If the magnitude of the result is less than Smin, then zero is produced.
Double Floating Multiply and Add

dfma rt,ra,rb

<table>
<thead>
<tr>
<th>d</th>
<th>f</th>
<th>m</th>
<th>a</th>
<th>rt</th>
<th>ra</th>
<th>rb</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

```
    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    26    27    28    29    30    31
```

For each of two doubleword slots:

- The operand from register RA is multiplied by the operand from register RB and added to the operand from register RT. The multiplication is exact and not subject to limits on its range.
- The result is placed in register RT.
Floating Negative Multiply and Subtract

fnms rt,ra,rb,rc

For each of the four word slots:

- The operand from register RA is multiplied by the operand from register RB, and the product is subtracted from the operand from register RC. The result of the multiplication is exact and not subject to limits on its range.
- The result is placed in register RT.
- If the magnitude of the result of the subtraction is greater than Smax, then Smax (with the correct sign) is produced. If the magnitude of the result of the subtraction is less than Smin, then zero is produced.
Double Floating Negative Multiply and Subtract

\[ \text{dfnms } rt, ra, rb \]

For each of two doubleword slots:

- The operand from register RA is multiplied by the operand from register RB. The operand from register RT is subtracted from the product. The result, which is placed in register RT, is usually obtained by negating the rounded result of this multiply subtract operation. There is one exception: If the result is a QNaN, the sign bit of the result is zero.

- This instruction produces the same result as would be obtained by using the Double Floating Multiply and Subtract instruction and then negates any result that is not a NaN.

- The multiplication is exact and not subject to limits on its range.
Floating Multiply and Subtract

fms rt,rb,ra,rc

For each of the four word slots:

- The operand from register RA is multiplied by the operand from register RB. The result of the multiplication is exact and not subject to limits on its range. The operand from register RC is subtracted from the product.

- The result is placed in register RT.

- If the magnitude of the result of the subtraction is greater than Smax, then Smax (with the correct sign) is produced. If the magnitude of the result of the subtraction is less than Smin, then zero is produced.
### Double Floating Multiply and Subtract

**dfms**  
**rt,ra,rb**

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>1</th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
<td>↓</td>
</tr>
</tbody>
</table>

For each of two doubleword slots:

- The operand from register RA is multiplied by the operand from register RB. The multiplication is exact and not subject to limits on its range. The operand from register RT is subtracted from the product.
- The result is placed in register RT.

```plaintext
dfms rt,ra,rb
0 1 1 0 1 0 1 1 0 1
```

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```
Double Floating Negative Multiply and Add

dfnma rt,ra,rb

For each of two doubleword slots:

- The operand from register RA is multiplied by the operand from register RB and added to the operand from register RT. The multiplication is exact and not subject to limits on its range. The result, which is placed in register RT, is usually obtained by negating the rounded result of this multiply add operation. There is one exception: If the result is a QNaN, the sign bit of the result is 0.

- This instruction produces the same result as would be obtained by using the Double Floating Multiply and Add instruction and then negating any result that is not a NaN.
Floating Reciprocal Estimate

\[ \text{frest, rt, ra} \]

For each of the four word slots:

1. The operand in register RA is used to compute a base and a step for estimating the reciprocal of the operand. The result, in the form shown below, is placed in register RT. S is the sign bit of the base result.

2. The base result is expressed as a floating-point number with 13 bits in the fraction, rather than the usual 23 bits. The remaining 10 bits of the fraction are used to encode the magnitude of the step as a 10-bit denormal fraction; the exponent is that of the base.

3. The step fraction differs from the base fraction (and any normalized IEEE fraction) in that there is a ‘0’ in front of the binary point and three additional bits of ‘0’ between the binary point and the fraction. The represented numbers are as follows:

   - Let \( x \) be the initial value in register RA. The result placed in RT, which is interpreted as a regular IEEE number, provides an estimate of the reciprocal of a nonzero \( x \).
   - If the operand in register RA has a zero exponent, a divide-by-zero exception is flagged.

**Programming Note:** The result returned by this instruction is intended as an operand for the Floating Interpolate instruction.

The quality of the estimate produced by the Floating Reciprocal Estimate instruction is sufficient to produce a result within 1 ulp of the IEEE single-precision reciprocal after interpolation and a single step of Newton-Raphson. Consider this code sequence:

\[
\begin{align*}
\text{FREST} & \quad y0, x \quad \text{// table-lookup} \\
\text{FI} & \quad y1, x, y0 \quad \text{// interpolation} \\
\text{FNMS} & \quad t1, x, y1, \text{ONE} \quad \text{// } t1 = -(x \ast y1 - 1.0) \\
\text{FMA} & \quad y2, t1, y1, y1 \quad \text{// } y2 = t1 \ast y1 + y1 \\
\end{align*}
\]

Three ranges of input must be described separately:

1. **Zeros**
   - \( y2 = x'7FFF \text{ FFFF}' (1.999 \times 2^{128}) \)

   - \( y2 = x'7FFFF FFF' (1.999 \times 2^{128}) \)

   - \( y2 = x'7FFFF FFF' (1.999 \times 2^{128}) \)
Big

If $|x| \geq 2^{126}$, then $1/x$ underflows to zero, $y_2 = 0$.

Note: This underflows for one value of $x$ that IEEE single-precision reciprocal would not. If this is a concern, the following code sequence produces the IEEE answer:

```assembly
maxnounderflow=0x7e800000
min=0x00800000
msb=0x80000000
FCMEQ selmask,x,maxnounderflow
AND s1,x,msb
OR smin,s1,min
SELB y3,selmask,y2,smin
```

Normal

$1/x = Y$ where $x \times Y < 1.0$ and $x \times \text{INC}(Y) \geq 1.0$.

INC$(y)$ gives the sfp number with the same sign as $y$ and next larger magnitude. The absolute error bound is:

$$| Y - y_2 | \leq 1 \text{ ulp} \quad (\text{either } y_2 = Y, \text{ or INC}(y_2) = Y)$$
Floating Reciprocal Absolute Square Root Estimate

frsqest rt,ra

0 0 1 1 0 1 1 0 0 1

For each of the four word slots:

- The operand in register RA is used to compute a base and step for estimating the reciprocal of the square root of the absolute value of the operand. The result is placed in register RT. The sign bit (S) will be zero.
- Let x be the initial value of register RA. The result placed in register RT, interpreted as a regular IEEE number, provides an estimate of the reciprocal square root of abs(x).
- If the operand in register RA has a zero exponent, a divide-by-zero exception is flagged.

Programming Note: The result returned by this instruction is intended as an operand for the Floating Interpolate instruction.

The quality of the estimate produced by the Floating Reciprocal Absolute Square Root Estimate instruction is sufficient to produce an IEEE single-precision reciprocal after interpolation and a single step of Newton-Raphson. Consider the following code sequence:

```
mask=0x7fffffff
half=0.5
one=1.0
FRSQEST y0,x // table-lookup
AND ax,x,mask // ax=ABS(x)
FI y1,ax,y0 // interpolation
FM t1,ax,y1 // t1= ax * y1
FM t2,y1,HALF // t2= y1 * 0.5
FNMS t1,t1,y1,ONE // t1= -(t1 * y1 - 1.0)
FMA y2,t1,t2,y1 // y2= t1 * t2 + y1
```

Three ranges of input must be described separately:

Zeros, where: x fraction ≤ 0x000ff53c then y2 = 0x7fffffff (1.999 x 2\textsuperscript{128})

Zeros where: x fraction > 0x000ff53c, y2 ≥ 0x7fc00000

The following sequence could be used to correct the answer:

```
zero=0.0
mask=0x7fffffff
FCMEQ z,x,zero
AND zmask,z,mask
OR y3,zmask,y2
```
Normal

\( \frac{1}{\sqrt{x}} = Y \) where \( x \times Y^2 < 1.0 \) and \( x \times \text{INC}(Y)^2 \geq 1.0 \)

\text{INC}(y)\) gives the sfp number with the same sign as \( y \) and next larger magnitude.

The absolute error bound is:

\[ |Y - y^2| \leq 1 \text{ ulp} \quad (0 \text{ and } \pm 1 \text{ are all possible}) \]
Floating Interpolate

\[ \text{fi } rt,ra,rb \]

For each of the four word slots:

- The operand in register RB is disassembled to produce a floating-point base and step according to the format described in *Floating Reciprocal Estimate* on page 208; that is, a sign, biased exponent, base fraction, and step fraction.

- Bits 13 to 31 of register RA are taken to represent a fraction, \( Y \), whose binary point is to the left of bit 13; that is, \( Y \leftarrow 0.RA_{13:31} \).

The result is computed by the following:

\[ RT \leftarrow (-1)^S \times (1.\text{BaseFraction} - 0.000\text{StepFraction} \times Y) \times 2^{(\text{BiasedExponent} -127)}. \]

**Programming Note:** If the operand in register RB is the result of an \texttt{frest} or \texttt{frsqest} instruction with the operand from register RA, then the result of the \texttt{fi} instruction placed in register RT provides a more accurate estimation.
Convert Signed Integer to Floating

csflt rt,ra,scale

For each of the four word slots:

- The signed 32-bit integer value in register RA is converted to an extended-range, single-precision, floating-point value.
- The result is divided by $2^{\text{scale}}$ and placed in register RT. The factor scale is an 8-bit unsigned integer provided by 155 minus the unsigned value from the I8 field. If the value scale is not in the range of 0 to 127, the result of the operation is undefined.
- The scale factor describes the number of bit positions between the binary point of the magnitude and the right end of register RA. A scale factor of zero means that the register RA value is an unscaled integer.
Convert Floating to Signed Integer

cflts rt,ra, scale

For each of the four word slots:

- The extended-range, single-precision, floating-point value in register RA is multiplied by $2^{\text{scale}}$. The factor scale is an 8-bit unsigned integer provided by 173 minus the unsigned value from the I8 field. If the value scale is not in the range of 0 to 127, the result of the operation is undefined.

- The product is converted to a signed 32-bit integer. If the intermediate result is greater than $(2^{31} - 1)$, it saturates to $(2^{31} - 1)$; if it is less than $-2^{31}$, it saturates to $-2^{31}$. The resulting signed integer is placed in register RT.

- The scale factor is the location of the binary point of the result, expressed as the number of bit positions from the right end of the register RT. A scale factor of zero means that the value in register RT is an unscaled integer.
Convert Unsigned Integer to Floating

cuflt rt,ra,scale

For each of the four word slots:

- The unsigned 32-bit integer value in register RA is converted to an extended-range, single-precision, floating-point value.
- The result is divided by $2^{\text{scale}}$ and placed in register RT. The factor scale is an 8-bit unsigned integer provided by 155 minus the unsigned value from the I8 field. If the value scale is not in the range of 0 to 127, the result of the operation is undefined.
- The scale factor describes the number of bit positions between the binary point of the magnitude and the right end of register RA. A scale factor of zero means that the register RA value is an unscaled integer.
Convert Floating to Unsigned Integer

cfltu rt,ra,scale

For each of the four word slots:

- The extended-range, single-precision, floating-point value in register RA is multiplied by $2^{\text{scale}}$. The factor scale is an 8-bit unsigned integer provided by 173 minus the unsigned value from the I8 field. If the value scale is not in the range of 0 to 127, the result of the operation is undefined.

- The product is converted to an unsigned 32-bit integer. If the intermediate result is greater than $(2^{32} - 1)$ it saturates to $(2^{32} - 1)$. If the product is negative, it saturates to zero. The resulting unsigned integer is placed in register RT.

- The scale factor is the location of the binary point of the result, expressed as the number of bit positions from the right end of the register RT. A scale factor of zero means that the value in RT is an unscaled integer.
Floating Round Double to Single

frds rt,ra

For each of the two doubleword slots:

- The double-precision value in register RA is rounded to a single-precision, floating-point value and placed in the left word slot. Zeros are placed in the right word slot.
- The rounding is performed in accordance with the rounding mode specified in the Floating-Point Status Register. Double-precision exceptions are detected and accumulated in the FPU Status Register.
Floating Extend Single to Double

fesd rt,ra

For each of the two doubleword slots:

- The single-precision value in the left slot of register RA is converted to a double-precision, floating-point value and placed in register RT. The contents of the right word slot are ignored.
- Double-precision exceptions are detected and accumulated in the FPU Status Register.
Floating Compare Equal

fcmpeq rt, ra, rb

For each of the four word slots:

- The floating-point value from register RA is compared with the floating-point value from register RB. If the values are equal, a result of all ones (true) is produced in register RT. Otherwise, a result of zero (false) is produced in register RT. Two zeros always compare equal independent of their fractions and signs.
- This instruction is always executed in extended-range mode, and ignores the setting of the mode bit.
Floating Compare Magnitude Equal

fcmeq \texttt{rt,ra,rb}

For each of the four word slots:

- The absolute value of the floating-point number in register RA is compared with the absolute value of the floating-point number in register RB. If the absolute values are equal, a result of all ones (true) is produced in register RT. Otherwise, a result of zero (false) is produced in register RT. Two zeros always compare equal independent of their fractions and signs.
- This instruction is always executed in extended-range mode, and ignores the setting of the mode bit.
### Floating Compare Greater Than

**fcgt rt,ra,rb**

<table>
<thead>
<tr>
<th></th>
<th>RB</th>
<th>RA</th>
<th>RT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 0 1 1 0 0 0 1 0</td>
<td>0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

For each of the four word slots:

- The floating-point value in register RA is compared with the floating-point value in register RB. If the value in RA is greater than the value in RB, a result of all ones (true) is produced in register RT. Otherwise, a result of zero (false) is produced in register RT. Two zeros never compare greater than independent of their sign bits and fractions.

- This instruction is always executed in extended-range mode, and ignores the setting of the mode bit.
Floating Compare Magnitude Greater Than

fcmgt rt,ra,rb

For each of the four word slots:

- The absolute value of the floating-point number in register RA is compared with the absolute value of the floating-point number in register RB. If the absolute value of the value from register RA is greater than the absolute value of the value from register RB, a result of all ones (true) is produced in register RT. Otherwise, a result of zero (false) is produced in register RT. Two zeros never compare greater than, independent of their fractions and signs.

- This instruction is always executed in extended-range mode, and ignores the setting of the mode bit.
Floating-Point Status and Control Register Write

The 128-bit value of register RA is written into the Floating-Point Status and Control Register (FPSCR). The value of the unused bits in the FPSCR is undefined. RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.
Floating-Point Status and Control Register Read

fscrrd    rt

0  1  1  1  0  0  1  1  0  0  0  ///  ///  RT

\[\begin{array}{cccccccccccccccccccccccc}
\end{array}\]

This instruction reads the value of the Floating-Point Status and Control Register (FPSCR). In the result, the unused bits of the FPSCR are forced to zero. The result is placed in the register RT.
10. Control Instructions

This section lists and describes the SPU control instructions.
Stop and Signal

stop

0 0 0 0 0 0 0 0 0 0 0  

///

Stop and Signal Type

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Execution of the program in the SPU stops, and the external environment is signaled. No further instructions are executed.

PC ← PC + 4 & LSLR

precise stop
Stop and Signal with Dependencies

stopd

0 0 1 0 1 0 0 0 0 0 0 0 0 0 RB RA RC

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Execution of the program in the SPU stops.

PC ← PC + 4 & LSLR
precise stop

Programming Note: This instruction differs from stop only in that, in typical implementations, instructions with dependencies can be replaced with stopd to create a breakpoint without affecting the instruction timings.
No Operation (Load)

Inop

```
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 /// /// RT
```

This instruction has no effect on the execution of the program. It exists to provide implementation-defined control of instruction issuance. RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.
No Operation (Execute)

nop

This instruction has no effect on the execution of the program. It exists to provide implementation-defined control of instruction issuance. RT is a false target. Implementations can schedule instructions as though this instruction produces a value into RT. Programs can avoid unnecessary delay by programming RT so as not to appear to source data for nearby subsequent instructions. False targets are not written.
Synchronize

`sync`

```
0 0 0 0 0 0 0 0 1 0 C
```

This instruction has no effect on the execution of the program other than to cause the processor to wait until all pending store instructions have completed before fetching the next sequential instruction. This instruction must be used following a store instruction that modifies the instruction stream.

The C feature bit causes channel synchronization to occur before instruction synchronization occurs. Channel synchronization allows an SPU state modified through channel instructions to affect execution. Synchronization is discussed in more detail in Section 13 Synchronization and Ordering on page 240.
Synchronize Data

dsnc

0 0 0 0 0 0 0 0 1 1

This instruction forces all earlier load, store, and channel instructions to complete before proceeding. No subsequent load, store, or channel instructions can start until the previous instructions complete. The dsnc instruction allows SPU software to ensure that the local store data would be consistent if it were observed by another entity. This instruction does not affect any prefetching of instructions that the processor might have done. Synchronization is discussed in more detail in Section 13 Synchronization and Ordering on page 240.
Special-Purpose Register SA is copied into register RT. If SPR SA is not defined, zeros are supplied.

```plaintext
if defined(SPR(SA)) then
  RT ← SPR(SA)
else
  RT ← 0
```
Move to Special-Purpose Register

mtspr  sa, rt

0 0 1 0 0 0 1 1 0 0  ///  SA  RT
   ↓     ↓     ↓     ↓     ↓     ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

The contents of the preferred slot of register RT is written to Special-Purpose Register SA. If SPR SA is not defined, no operation is performed.

if defined(SPR(SA)) then
    SPR(SA) ← RT
else
    do nothing
11. Channel Instructions

The SPU provides an input/output interface based on message passing called the “channel interface”. This section describes the instructions used to communicate between the SPU and external devices through the channel interface.

Channels are 128-bit wide communication paths between the SPU and external devices. Each channel operates in one direction only, and is called either a read channel or a write channel, according to the operation that the SPU can perform on the channel. Instructions are provided that allow the SPU program to read from or write to a channel; the operations performed must match the type of channel addressed.

An implementation can implement any number of channels up to 128. Each channel has a channel number in the range 0-127. Channel numbers have no particular significance, and there is no relationship between the direction of a channel and its number.

The channels and the external devices have capacity. Channel capacity is the minimum number of reads or writes that can be performed without delay. Attempts to access a channel without capacity cause instruction processing to cease until capacity becomes available and the access can complete. The SPU maintains counters to measure channel capacity and provides an instruction to read channel capacity.

So long as capacity is available, the channels and external devices can service a burst of SPU accesses without requiring the SPU to delay execution. An attempt to write to a channel beyond its capacity causes the SPU to hang until the external device empties the channel. An attempt to read from a channel when it is empty also causes the SPU to hang until the device inserts data into the channel.
Read Channel

`rdch rt,ca`

```
0 0 0 0 0 0 0 1 1 0 1 ///
 CA  RT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
```

The SPU waits for data to become available in channel CA (capacity is available). When data is available to the channel, it is moved from the channel and placed into register RT.

If the channel designated by the CA field is not a valid, readable channel, the SPU will stop on or after the `rdch` instruction.

```plaintext
if readable(Channel(CA)) then
    RT ← Channel(CA)
else
    Stop after executing zero or more instructions after the `rdch`.
```
Read Channel Count

\[
\text{rchcnt} \quad \text{rt,ca} \\
0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 1 \quad 1 \quad 1 \quad 1 \\
\downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \\
0 \quad 1 \quad 2 \quad 3 \quad 4 \quad 5 \quad 6 \quad 7 \quad 8 \quad 9 \\
10 \quad 11 \quad 12 \quad 13 \quad 14 \quad 15 \quad 16 \quad 17 \quad 18 \quad 19 \quad 20 \quad 21 \quad 22 \quad 23 \quad 24 \quad 25 \quad 26 \quad 27 \quad 28 \quad 29 \quad 30 \quad 31 \\
\]

The channel capacity of channel CA is placed into the preferred slot of register RT. The channel capacity of unimplemented channels is zero.

\[
\text{RT}^{0:3} \leftarrow \text{Channel Capacity(CA)} \\
\text{RT}^{4:15} \leftarrow 0
\]
Write Channel

wrch ca,rt

0 0 1 0 0 0 1 1 0 1 //

if writeable(Channel(CA)) then
    Channel(CA) <- RT
else
    Stop after executing zero or more instructions after the wrch.

The SPU waits for capacity to become available in channel CA before executing the wrch instruction. When capacity is available in the channel, the contents of register RT are placed into channel CA. Channel writes targeting channels that are not valid writable channels cause the SPU to stop on or after the wrch instruction.
12. SPU Interrupt Facility

This section describes the SPU interrupt facility.

External conditions are monitored and managed through external facilities that are controlled through the channel interface. External conditions can affect SPU instruction sequencing through the following facilities:

- The **bisled** instruction

  The **bisled** instruction tests for the existence of an external condition and branches to a target, if it is present. The **bisled** instruction allows the SPU software to poll for external conditions and to call a handler subroutine, if one is present. When polling is not required, the SPU can be enabled to interrupt normal instruction processing and to vector to a handler subroutine when an external condition appears.

- The interrupt facility

The following indirect branch instructions allow software to enable and disable the interrupt facility during critical subroutines:

- `bi`
- `bisl`
- `bisled`
- `biz`
- `binz`
- `bihz`
- `bihnz`

All of these branch instructions provide the [D] and [E] feature bits. When one of these branches is taken, the interrupt-enable status changes before the target instruction is executed. *Table 12-1* describes the feature bit settings and their results.

**Table 12-1. Feature Bits [D] and [E] Settings and Results**

<table>
<thead>
<tr>
<th>Feature Bit Setting</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>[D]</td>
<td>[E]</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

12.1 SPU Interrupt Handler

The SPU supports a single interrupt handler. The entry point for this handler is address 0 in local store. When a condition is present and interrupts are enabled, the SPU branches to address 0 and disables the interrupt facility. The address of the next instruction to be executed is saved in the SRR0 register. The **iret** instruction can be used to return from the handler. **iret** branches indirectly to the address held in the SRR0 register. **iret**, like the other indirect branches, has an [E] feature bit that can be used to re-enable interrupts.
### 12.2 SPU Interrupt Facility Channels

The interrupt facility uses several channels for configuration, state observation, and state restoration. The current value of SRR0 can be read from the SPU_RdSRR0 channel, and the SPU_WrSRR0 channel provides write access to SRR0. When SRR0 is written by `wrch 14`, synchronization is required to ensure that this new value is available to the `iret` instruction. This synchronization is provided by executing the `sync` instruction with the `[C]`, or Channel Sync, feature bit set. Without this synchronization, `iret` instructions executed after `wrch 14` instructions branch to unpredictable addresses. The SPU_RdSRR0 and SPU_WrSRR0 support nested interrupts by allowing software to save and restore SRR0 to a save area in local store.
13. Synchronization and Ordering

The SPU provides a sequentially ordered programming model so that, with a few exceptions, all previous instructions appear to be finished before the next instruction is started.

Systems including an SPU often feature external devices with direct local store access. Figure 13-1 shows a common organization in which the external devices also communicate with the SPU via the channel interface. These systems are shared memory multiprocessors with message passing.

Figure 13-1. Systems with Multiple Accesses to Local Store

Table 13-1 defines five transactions serviced by the local store. The SPU ISA does not define the behavior of the external device or how the external device accesses the local store. When this document refers to an external write of local store, it assumes the external device delivers data to the local store such that a subsequent SPU load from local store can retrieve the data.

Table 13-1. Local Store Accesses

<table>
<thead>
<tr>
<th>Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Load</td>
<td>SPU load instruction gets data from local store read.</td>
</tr>
<tr>
<td>Store</td>
<td>SPU store instruction sends data to local store write.</td>
</tr>
<tr>
<td>Fetch</td>
<td>SPU instruction fetch gets data from local store read.</td>
</tr>
<tr>
<td>ExtWrite</td>
<td>External device sends data to local store write.</td>
</tr>
<tr>
<td>ExtRead</td>
<td>External device gets data from local store read.</td>
</tr>
</tbody>
</table>

Interaction between the local store access of the external devices and those of the SPU can expose effects of SPU implementation-specific reordering, speculation, buffering, and caching. This section discusses how to order sequences of these transactions to obtain consistent results.
13.1 Speculation, Reordering, and Caching SPU Local Store Access

SPU local store access is weakly consistent (see PowerPC Virtual Environment Architecture, Book II). Therefore, the sequential execution model, as applied to instructions that cause storage accesses, guarantees only that those accesses appear to be performed in program order with respect to the SPU executing the instructions. These accesses might not appear to be performed in program order with respect to external local store accesses or with respect to the SPU instruction fetch. This means that, in the absence of external local store writes, an SPU load from any particular address returns the data written by the most recent SPU store to that address. However, an instruction fetch from that address does not necessarily return that data.

The SPU is allowed to cache, buffer, and otherwise reorder its local store accesses. SPU loads, stores, and instruction fetches might or might not access the local store. The SPU can speculatively read the local store. That is, the SPU can read the local store on behalf of instructions that are not required by the program. The SPU does not speculatively write the local store. If and when the SPU stores access the local store, the SPU only writes the local store on behalf of stores required by the program. Instruction fetches, loads, and stores can access the local store in any order.

13.2 Internal Execution State

The channel interface can be used to modify the SPU internal execution state. An internal execution state is any state within an SPU, but outside the local store, that is modified through the channel interface and that can affect the sequence or execution of instructions. For example, programs can change SRR0 by writing the SPU_WrSRRO channel, and SRR0 is the internal execution state. State changes made through the channel interface might not be synchronized with SPU program execution.

13.3 Synchronization Primitives

The SPU provides three synchronization instructions: \texttt{dsync}, \texttt{sync}, and \texttt{sync.c}. These instructions have both coherency and instruction serializing effects, as shown in Table 13-2 Synchronization Instructions on page 242. Programs can use the coherency effects of these primitives to ensure that the local store state is consistent with SPU loads and stores. The instruction serializing effects allow the SPU program to order its local store access.

The \texttt{dsync} instruction orders loads, stores, and channel accesses but not instruction fetches. When a \texttt{dsync} completes, the SPU will have completed all prior loads, stores, and channel accesses and will not have begun execution of any subsequent loads, stores, or channel accesses. At this time, an external read from a local store address returns the data stored by the most recent SPU store to that address. SPU loads after the \texttt{dsync} return the data externally written prior to the moment when the \texttt{dsync} completes. The \texttt{dsync} instruction affects only SPU instruction sequencing and the coherency of loads and stores with respect to actual local store state. The SPU does not broadcast \texttt{dsync} notification to external devices that access local store, and, therefore, does not affect the state of the external devices.

The \texttt{sync} instruction is much like \texttt{dsync}, but it also orders instruction fetches. Instruction fetches from a local store address after a \texttt{sync} instruction return data stored by the most recent store instruction or external write to that address. The \texttt{sync.c} instruction builds upon the \texttt{sync} instruction. It ensures that the effects upon the internal state caused by prior \texttt{wrch} instructions are propagated and influence the execution of the following instructions. SPU execution begins with a start event and ends with a stop event. Both start and stop perform \texttt{sync.c}. 

Table 13-2. Synchronization Instructions

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Coherency Effects</th>
<th>Instruction Serialization Effects</th>
</tr>
</thead>
<tbody>
<tr>
<td>dsync</td>
<td>Ensures that subsequent external reads access data written by prior stores. Ensures that subsequent loads access data written by external writes.</td>
<td>Forces load and store access of local store due to instructions prior to the dsync to be completed prior to completion of dsync. Forces read channel operations due to instructions prior to the dsync to be completed prior to completion of the dsync. Forces load and store access of local store due to instructions after the dsync to occur after completion of the dsync. Forces read and write channel operations due to instructions after the dsync to occur after completion of the dsync.</td>
</tr>
<tr>
<td>sync</td>
<td>Ensures that subsequent external reads access data written by prior stores. Ensures that subsequent instruction fetches access data written by prior stores and external writes. Ensures that subsequent loads access data written by external writes.</td>
<td>Forces all access of local store and channels due to instructions prior to the sync to be completed prior to completion of the sync. Forces all access of local store and channels due to instructions after the sync to occur after completion of the sync.</td>
</tr>
<tr>
<td>sync.c</td>
<td>Ensures that subsequent external reads access data written by prior stores. Ensures that subsequent instruction fetches access data written by prior stores and external writes. Ensures that subsequent loads access data written by external writes. Ensures that subsequent instruction processing is influenced by all internal execution states modified by previous wrch instructions.</td>
<td>Forces all access of local store and channels due to instructions prior to the sync.c to be completed prior to completion of the sync.c. Forces all access of local store and channels due to instructions after the sync.c to occur after completion of the sync.c.</td>
</tr>
</tbody>
</table>

Table 13-3 details which synchronization primitives are required between local store writes and local store reads to ensure that the reads access data written by the prior writes.

Table 13-3. Synchronizing Multiple Accesses to Local Store

<table>
<thead>
<tr>
<th>Writer</th>
<th>Store</th>
<th>Fetch</th>
<th>Load</th>
<th>ExtRead</th>
</tr>
</thead>
<tbody>
<tr>
<td>Store</td>
<td>nothing</td>
<td>sync</td>
<td>nothing</td>
<td>dsync</td>
</tr>
<tr>
<td>ExtWrite</td>
<td>dsync</td>
<td>sync</td>
<td>dsync</td>
<td>N/A</td>
</tr>
</tbody>
</table>

13.4 Caching SPU Local Store Access

Implementations of the SPU can feature caches of local store data for either instructions, data, or both. These caches must reflect data to and from the local store when synchronization requires the state of the local store to be consistent. The dsync instruction ensures that modified data is visible to external devices that access the local store, and that data modified by these external devices is visible to subsequent loads and stores. The sync instructions also ensure that data modified by either stores or external puts is visible to a subsequent instruction fetch. For example, an instruction cache that does not snoop must be invalidated when sync is executed, and a copy-back data cache that does not snoop must be flushed and invalidated when either sync or dsync is executed.
13.5 Self-Modifying Code

SPU programs can store instructions in local store and execute them. If the SPU has already read the instructions from local store, prior to the store, the new instructions are not seen by SPU execution. Self-modifying code should always execute a sync instruction before executing the stored code. The sync instruction ensures that all stores complete before the next instruction is fetched from local store.

13.6 External Local Store Access

Loads and stores do not necessarily access the local store in program order. Accesses from external devices can be interleaved in ways that are inconsistent with program order. The dsync instruction forces all preceding loads and stores to complete their local store access before allowing any further loads or stores to be initiated, while sync ensures that the next instruction is fetched after the sync instruction is executed. An external device can synchronize with an SPU program through local store access. Table 13-4 shows how an SPU program can reliably send and receive data from an external device, synchronizing only through the local store.

Table 13-4. Synchronizing through Local Store

<table>
<thead>
<tr>
<th>External Device</th>
<th>SPU</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>SPU sends data through local store address C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store data to C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>dsync</td>
<td>Force subsequent store to follow the store to C</td>
<td></td>
</tr>
<tr>
<td>Store marker to D</td>
<td></td>
<td></td>
</tr>
<tr>
<td>dsync</td>
<td>Force the store to D to access the local store</td>
<td></td>
</tr>
<tr>
<td>eloop: Read D</td>
<td></td>
<td></td>
</tr>
<tr>
<td>If not marker goto eloop</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Read C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SPU receives data through local store address A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write data to A</td>
<td>This is the order in which the external device modifies local store. The ordering is not controlled by the SPU ISA.</td>
<td></td>
</tr>
<tr>
<td>Write marker to B</td>
<td></td>
<td></td>
</tr>
<tr>
<td>loop: dsync</td>
<td>Force subsequent load to access local store</td>
<td></td>
</tr>
<tr>
<td>Load from B</td>
<td></td>
<td></td>
</tr>
<tr>
<td>If not marker goto loop</td>
<td>Ensure A and B are both written to local store</td>
<td></td>
</tr>
<tr>
<td>dsync</td>
<td>Force subsequent load to execute after load from B</td>
<td></td>
</tr>
<tr>
<td>Load from A</td>
<td>Must get data</td>
<td></td>
</tr>
</tbody>
</table>
13.7 Speculation and Reordering of Channel Reads and Channel Writes

The SPU does not reorder or speculatively execute channel reads or channel writes. All operations at the channel interface represent instructions in the order they occur in the program.

13.8 Channel Interface with External Device

The channel interface delivers channel reads and writes to the SPU interface in program order, but there are no ordering guarantees with respect to load and stores. It is possible that a message sent to an external device may trigger the external device to directly access the local store. SPU programs might want to use either `sync` or `dsync` instructions, or both, to order SPU loads and stores relative to the external accesses. Table 13-5 shows how an SPU program might reliably send and receive data from an external device synchronizing through the channel interface.

Table 13-5. Synchronizing through Channel Interface

<table>
<thead>
<tr>
<th>External Device</th>
<th>SPU</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>SPU receives data through local store address A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write data to A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Send message to channel B</td>
<td>\texttt{rdch B}</td>
<td>Wait for message</td>
</tr>
<tr>
<td></td>
<td>\texttt{dsync}</td>
<td>Ensure load from A is executed after \texttt{rdch}, and access the data in local store</td>
</tr>
<tr>
<td></td>
<td>\texttt{load from A}</td>
<td>Must get data</td>
</tr>
<tr>
<td>SPU sends data through local store address C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store data to C</td>
<td>\texttt{dsync}</td>
<td>Ensure data is in local store</td>
</tr>
<tr>
<td>\texttt{wrch D}</td>
<td></td>
<td>Send message</td>
</tr>
<tr>
<td>Receive message from channel D</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Read data from C</td>
<td></td>
<td>The ordering is not controlled by the SPU ISA.</td>
</tr>
</tbody>
</table>

Note: The SPU architecture does not specify what actions an external device can perform in response to a channel read or write. The SPU does not wait for those actions to complete, and it does not synchronize the local store state prior to or after the channel operation.

13.9 Execution State Set by an SPU Program through the Channel Interface

Some SPU channels can control aspects of SPU execution state; for example, SRR0. State changes made through channel writes might not affect subsequent instructions. Execution of the `sync.c` instruction ensures that the new state does affect the next instruction.
13.10 Execution State Set by an External Device

Execution state changes made by an external device are ordered with respect to other externally requested state changes but not with respect to SPU instruction execution. The external device can stop the SPU, make execution state changes, start the SPU, and be certain the new state is visible to program execution.
Appendix A. Programming Examples

A.1 Conversion from Single Precision to Double Precision

This example converts four single-precision numbers in register rin to two double-precision numbers in each of rout and rout1.

```
shri.q rexph=rin,27  ; high order part of exponent as an integer
fceu.q rzero=rin,R0  ; Assumes r0=0; check for zero or denorm input
rotm.q rsign=rin,-31 ; Copy sign bit to bit 31
andi.q rexph=rexph,0b01111 ; Extract exponent bits 7 to 4
shli.q rsign=rsign,7 ; Rsign = 0…0 s 0^7
ai.q rexph=rexph,111000 ; Convert exponent to DP bias
shli.q rout=rin,5 ; Preshift of mantissa: e[3:0], f[1:23]^-5
andc.q rexph=rexph,rzero ; Exponent cleared in case of zero/dernomal input
andc.q rout =rout,rzero ; Mantissa cleared in case of zero/dernomal input
or.q rexph=rexph,rsign ; Sign is ORed in, Rexp = (0..0, s g[10:4])
Nop ; Delay slot
shufb.q rout=rout,rexph,rindex ; First pair of DP results
shufb.q rout1=rout,rexph,rindex1 ; Second pair of DP results
```
A.2 Conversion from Double Precision to Single Precision

This example converts a double-precision number in the slot 0 of register rin to a single-precision value in the preferred slot of register rf.

```
or          rhigh=rin,rin   High order part copied
rotqbi      rf=rin,3        Collect relevant mantissa bits (g[3:0], f[1:28])
rotm        rhabs,rhigh,-1  Dropping the sign bit, shifted off the right end
rotm        rsign,rhigh,-31 rsing = 0 ... 0 s
rotm        rexd, rhabs,-25 Extract exponent, rexp = 0...0 g[10:14]
rotm        rf=rf,-5       Rf = 0^5, g[3:0], f[1:23]
ai          rexs,rexd,8    rex = rexp + 128/16
cti         Rmax, Rexp,71  rmax = -1 iff overflow; exponent > 128
andi         rexs=rex,'0 1^4' Extract exponent bits, e[7:4]
cgt          rmin,XMIN,rhabs rmin = 0 iff number to be truncated to 0
rotm        rexs,rex,-27   Align exponent for single-precision format
rotm        rsign=rsign,-31 rsing = s 0...0
A            rf=rf,rex      Combine exponent and mantissa: 0, e[7:0], f[1:23]
cgt          rmin=XMIN,rhabs rmin = 0 iff number to be truncated to 0
Nop
or          Rf=Rf,rmax     Set to 1...1 if rounded to Xmax
Nop
And         rf=rf,rmin     Set to 0...0 if truncated to 0
Nop
or          rf=rf,rsign    OR in the sign bit
```
### Table B-1. Instructions Sorted by Mnemonic (Page 1 of 6)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Instruction</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>a</td>
<td>Add Word</td>
<td>55</td>
</tr>
<tr>
<td>absdb</td>
<td>Absolute Differences of Bytes</td>
<td>87</td>
</tr>
<tr>
<td>addx</td>
<td>Add Extended</td>
<td>61</td>
</tr>
<tr>
<td>ah</td>
<td>Add Halfword</td>
<td>53</td>
</tr>
<tr>
<td>ahi</td>
<td>Add Halfword Immediate</td>
<td>54</td>
</tr>
<tr>
<td>ai</td>
<td>Add Word Immediate</td>
<td>56</td>
</tr>
<tr>
<td>and</td>
<td>And</td>
<td>92</td>
</tr>
<tr>
<td>andbi</td>
<td>And Byte Immediate</td>
<td>94</td>
</tr>
<tr>
<td>andc</td>
<td>And with Complement</td>
<td>93</td>
</tr>
<tr>
<td>andhi</td>
<td>And Halfword Immediate</td>
<td>95</td>
</tr>
<tr>
<td>andi</td>
<td>And Word Immediate</td>
<td>96</td>
</tr>
<tr>
<td>avgb</td>
<td>Average Bytes</td>
<td>86</td>
</tr>
<tr>
<td>bg</td>
<td>Borrow Generate</td>
<td>65</td>
</tr>
<tr>
<td>bgx</td>
<td>Borrow Generate Extended</td>
<td>66</td>
</tr>
<tr>
<td>bi</td>
<td>Branch Indirect</td>
<td>173</td>
</tr>
<tr>
<td>bihnx</td>
<td>Branch Indirect If Not Zero Halfword</td>
<td>184</td>
</tr>
<tr>
<td>bihz</td>
<td>Branch Indirect If Zero Halfword</td>
<td>183</td>
</tr>
<tr>
<td>binz</td>
<td>Branch Indirect If Not Zero</td>
<td>182</td>
</tr>
<tr>
<td>bisl</td>
<td>Branch Indirect and Set Link</td>
<td>176</td>
</tr>
<tr>
<td>bisled</td>
<td>Branch Indirect and Set Link if External Data</td>
<td>175</td>
</tr>
<tr>
<td>biz</td>
<td>Branch Indirect If Zero</td>
<td>181</td>
</tr>
<tr>
<td>br</td>
<td>Branch Relative</td>
<td>169</td>
</tr>
<tr>
<td>bra</td>
<td>Branch Absolute</td>
<td>170</td>
</tr>
<tr>
<td>brasl</td>
<td>Branch Absolute and Set Link</td>
<td>172</td>
</tr>
<tr>
<td>brhnz</td>
<td>Branch If Not Zero Halfword</td>
<td>179</td>
</tr>
<tr>
<td>brhz</td>
<td>Branch If Zero Halfword</td>
<td>180</td>
</tr>
<tr>
<td>brnz</td>
<td>Branch If Not Zero Word</td>
<td>177</td>
</tr>
<tr>
<td>brsl</td>
<td>Branch Relative and Set Link</td>
<td>171</td>
</tr>
<tr>
<td>brz</td>
<td>Branch If Zero Word</td>
<td>178</td>
</tr>
<tr>
<td>cbdb</td>
<td>Generate Controls for Byte Insertion (d-form)</td>
<td>37</td>
</tr>
<tr>
<td>cbx</td>
<td>Generate Controls for Byte Insertion (x-form)</td>
<td>38</td>
</tr>
<tr>
<td>cdd</td>
<td>Generate Controls for Doubleword Insertion (d-form)</td>
<td>43</td>
</tr>
<tr>
<td>cdx</td>
<td>Generate Controls for Doubleword Insertion (x-form)</td>
<td>44</td>
</tr>
<tr>
<td>ceq</td>
<td>Compare Equal Word</td>
<td>155</td>
</tr>
<tr>
<td>ceqb</td>
<td>Compare Equal Byte</td>
<td>151</td>
</tr>
</tbody>
</table>
### Table B-1. Instructions Sorted by Mnemonic (Page 2 of 6)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Instruction</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>ceqbi</td>
<td>Compare Equal Byte Immediate</td>
<td>152</td>
</tr>
<tr>
<td>ceqh</td>
<td>Compare Equal Halfword</td>
<td>153</td>
</tr>
<tr>
<td>ceqhi</td>
<td>Compare Equal Halfword Immediate</td>
<td>154</td>
</tr>
<tr>
<td>ceqi</td>
<td>Compare Equal Word Immediate</td>
<td>156</td>
</tr>
<tr>
<td>cflts</td>
<td>Convert Floating to Signed Integer</td>
<td>214</td>
</tr>
<tr>
<td>cfltu</td>
<td>Convert Floating to Unsigned Integer</td>
<td>216</td>
</tr>
<tr>
<td>cg</td>
<td>Carry Generate</td>
<td>62</td>
</tr>
<tr>
<td>cgt</td>
<td>Compare Greater Than Word</td>
<td>161</td>
</tr>
<tr>
<td>cgtb</td>
<td>Compare Greater Than Byte</td>
<td>157</td>
</tr>
<tr>
<td>cgtbi</td>
<td>Compare Greater Than Byte Immediate</td>
<td>158</td>
</tr>
<tr>
<td>cgth</td>
<td>Compare Greater Than Halfword</td>
<td>159</td>
</tr>
<tr>
<td>cgthi</td>
<td>Compare Greater Than Halfword Immediate</td>
<td>160</td>
</tr>
<tr>
<td>cgti</td>
<td>Compare Greater Than Word Immediate</td>
<td>162</td>
</tr>
<tr>
<td>cgx</td>
<td>Carry Generate Extended</td>
<td>63</td>
</tr>
<tr>
<td>chd</td>
<td>Generate Controls for Halfword Insertion (d-form)</td>
<td>39</td>
</tr>
<tr>
<td>chx</td>
<td>Generate Controls for Halfword Insertion (x-form)</td>
<td>40</td>
</tr>
<tr>
<td>clgt</td>
<td>Compare Logical Greater Than Word</td>
<td>167</td>
</tr>
<tr>
<td>clgtb</td>
<td>Compare Logical Greater Than Byte</td>
<td>163</td>
</tr>
<tr>
<td>clgtbi</td>
<td>Compare Logical Greater Than Byte Immediate</td>
<td>164</td>
</tr>
<tr>
<td>clgth</td>
<td>Compare Logical Greater Than Halfword</td>
<td>165</td>
</tr>
<tr>
<td>clgthi</td>
<td>Compare Logical Greater Than Halfword Immediate</td>
<td>166</td>
</tr>
<tr>
<td>clgti</td>
<td>Compare Logical Greater Than Word Immediate</td>
<td>168</td>
</tr>
<tr>
<td>clz</td>
<td>Count Leading Zeros</td>
<td>78</td>
</tr>
<tr>
<td>cntb</td>
<td>Count Ones in Bytes</td>
<td>79</td>
</tr>
<tr>
<td>csflt</td>
<td>Convert Signed Integer to Floating</td>
<td>213</td>
</tr>
<tr>
<td>cuflt</td>
<td>Convert Unsigned Integer to Floating</td>
<td>215</td>
</tr>
<tr>
<td>cwd</td>
<td>Generate Controls for Word Insertion (d-form)</td>
<td>41</td>
</tr>
<tr>
<td>cwx</td>
<td>Generate Controls for Word Insertion (x-form)</td>
<td>42</td>
</tr>
<tr>
<td>dfa</td>
<td>Double Floating Add</td>
<td>196</td>
</tr>
<tr>
<td>dfm</td>
<td>Double Floating Multiply</td>
<td>200</td>
</tr>
<tr>
<td>dfma</td>
<td>Double Floating Multiply and Add</td>
<td>202</td>
</tr>
<tr>
<td>dfms</td>
<td>Double Floating Multiply and Subtract</td>
<td>206</td>
</tr>
<tr>
<td>dfnma</td>
<td>Double Floating Negative Multiply and Add</td>
<td>207</td>
</tr>
<tr>
<td>dfnms</td>
<td>Double Floating Multiply and Subtract</td>
<td>206</td>
</tr>
<tr>
<td>dfs</td>
<td>Double Floating Subtract</td>
<td>198</td>
</tr>
<tr>
<td>dsync</td>
<td>Synchronize Data</td>
<td>231</td>
</tr>
<tr>
<td>eqv</td>
<td>Equivalent</td>
<td>109</td>
</tr>
</tbody>
</table>
### Table B-1. Instructions Sorted by Mnemonic (Page 3 of 6)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Instruction</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>fa</td>
<td>Floating Add</td>
<td>195</td>
</tr>
<tr>
<td>fceq</td>
<td>Floating Compare Equal</td>
<td>219</td>
</tr>
<tr>
<td>fcgt</td>
<td>Floating Compare Greater Than</td>
<td>221</td>
</tr>
<tr>
<td>fcmeq</td>
<td>Floating Compare Magnitude Equal</td>
<td>220</td>
</tr>
<tr>
<td>fcmgt</td>
<td>Floating Compare Magnitude Greater Than</td>
<td>222</td>
</tr>
<tr>
<td>fesd</td>
<td>Floating Extend Single to Double</td>
<td>218</td>
</tr>
<tr>
<td>fi</td>
<td>Floating Interpolate</td>
<td>212</td>
</tr>
<tr>
<td>fm</td>
<td>Floating Multiply</td>
<td>199</td>
</tr>
<tr>
<td>fma</td>
<td>Floating Multiply and Add</td>
<td>201</td>
</tr>
<tr>
<td>fms</td>
<td>Floating Multiply and Subtract</td>
<td>205</td>
</tr>
<tr>
<td>fnms</td>
<td>Floating Negative Multiply and Subtract</td>
<td>203</td>
</tr>
<tr>
<td>frds</td>
<td>Floating Round Double to Single</td>
<td>217</td>
</tr>
<tr>
<td>frest</td>
<td>Floating Reciprocal Estimate</td>
<td>208</td>
</tr>
<tr>
<td>frsqest</td>
<td>Floating Reciprocal Absolute Square Root Estimate</td>
<td>210</td>
</tr>
<tr>
<td>fs</td>
<td>Floating Subtract</td>
<td>197</td>
</tr>
<tr>
<td>tsccrd</td>
<td>Floating-Point Status and Control Register Write</td>
<td>223</td>
</tr>
<tr>
<td>tsccrw</td>
<td>Floating-Point Status and Control Register Read</td>
<td>224</td>
</tr>
<tr>
<td>fsm</td>
<td>Form Select Mask for Words</td>
<td>82</td>
</tr>
<tr>
<td>fsmb</td>
<td>Form Select Mask for Bytes</td>
<td>80</td>
</tr>
<tr>
<td>fsmbi</td>
<td>Form Select Mask for Bytes Immediate</td>
<td>51</td>
</tr>
<tr>
<td>fsmh</td>
<td>Form Select Mask for Halfwords</td>
<td>81</td>
</tr>
<tr>
<td>gb</td>
<td>Gather Bits from Words</td>
<td>85</td>
</tr>
<tr>
<td>gbb</td>
<td>Gather Bits from Bytes</td>
<td>83</td>
</tr>
<tr>
<td>gbh</td>
<td>Gather Bits from Halfwords</td>
<td>84</td>
</tr>
<tr>
<td>hbr</td>
<td>Hint for Branch (r-form)</td>
<td>186</td>
</tr>
<tr>
<td>hbra</td>
<td>Hint for Branch (a-form)</td>
<td>187</td>
</tr>
<tr>
<td>hbrrr</td>
<td>Hint for Branch Relative</td>
<td>188</td>
</tr>
<tr>
<td>heq</td>
<td>Halt If Equal</td>
<td>145</td>
</tr>
<tr>
<td>heqi</td>
<td>Halt If Equal Immediate</td>
<td>146</td>
</tr>
<tr>
<td>hgt</td>
<td>Halt If Greater Than</td>
<td>147</td>
</tr>
<tr>
<td>hgti</td>
<td>Halt If Greater Than Immediate</td>
<td>148</td>
</tr>
<tr>
<td>hgtg</td>
<td>Halt If Logically Greater Than</td>
<td>149</td>
</tr>
<tr>
<td>hgtgi</td>
<td>Halt If Logically Greater Than Immediate</td>
<td>150</td>
</tr>
<tr>
<td>ii</td>
<td>Immediate Load Word</td>
<td>48</td>
</tr>
<tr>
<td>ila</td>
<td>Immediate Load Address</td>
<td>49</td>
</tr>
<tr>
<td>ilh</td>
<td>Immediate Load Halfword</td>
<td>46</td>
</tr>
<tr>
<td>ilhu</td>
<td>Immediate Load Halfword Upper</td>
<td>47</td>
</tr>
<tr>
<td>Mnemonic</td>
<td>Instruction</td>
<td>Page</td>
</tr>
<tr>
<td>----------</td>
<td>--------------------------------------------------</td>
<td>------</td>
</tr>
<tr>
<td>iohl</td>
<td>Immediate Or Halfword Lower</td>
<td>50</td>
</tr>
<tr>
<td>iret</td>
<td>Interrupt Return</td>
<td>174</td>
</tr>
<tr>
<td>lnop</td>
<td>No Operation (Load)</td>
<td>228</td>
</tr>
<tr>
<td>lqa</td>
<td>Load Quadword (a-form)</td>
<td>31</td>
</tr>
<tr>
<td>lqd</td>
<td>Load Quadword (d-form)</td>
<td>29</td>
</tr>
<tr>
<td>lqr</td>
<td>Load Quadword Instruction Relative (a-form)</td>
<td>32</td>
</tr>
<tr>
<td>lxq</td>
<td>Load Quadword (x-form)</td>
<td>30</td>
</tr>
<tr>
<td>mfspr</td>
<td>Move from Special-Purpose Register</td>
<td>232</td>
</tr>
<tr>
<td>mpy</td>
<td>Multiply</td>
<td>67</td>
</tr>
<tr>
<td>mpya</td>
<td>Multiply and Add</td>
<td>71</td>
</tr>
<tr>
<td>mpyh</td>
<td>Multiply High</td>
<td>72</td>
</tr>
<tr>
<td>mpyhh</td>
<td>Multiply High High</td>
<td>74</td>
</tr>
<tr>
<td>mpyhha</td>
<td>Multiply High High and Add</td>
<td>75</td>
</tr>
<tr>
<td>mpyhhu</td>
<td>Multiply High High Unsigned</td>
<td>77</td>
</tr>
<tr>
<td>mpyi</td>
<td>Multiply Immediate</td>
<td>69</td>
</tr>
<tr>
<td>mpyu</td>
<td>Multiply Unsigned</td>
<td>68</td>
</tr>
<tr>
<td>mpyui</td>
<td>Multiply Unsigned Immediate</td>
<td>70</td>
</tr>
<tr>
<td>mtspr</td>
<td>Move to Special-Purpose Register</td>
<td>233</td>
</tr>
<tr>
<td>nand</td>
<td>Nand</td>
<td>107</td>
</tr>
<tr>
<td>nop</td>
<td>No Operation (Execute)</td>
<td>229</td>
</tr>
<tr>
<td>nor</td>
<td>Nor</td>
<td>108</td>
</tr>
<tr>
<td>or</td>
<td>Or</td>
<td>97</td>
</tr>
<tr>
<td>orbi</td>
<td>Or Byte Immediate</td>
<td>99</td>
</tr>
<tr>
<td>orc</td>
<td>Or with Complement</td>
<td>98</td>
</tr>
<tr>
<td>orhi</td>
<td>Or Halfword Immediate</td>
<td>100</td>
</tr>
<tr>
<td>ori</td>
<td>Or Word Immediate</td>
<td>101</td>
</tr>
<tr>
<td>orx</td>
<td>Or Across</td>
<td>102</td>
</tr>
<tr>
<td>rchcnt</td>
<td>Read Channel Count</td>
<td>236</td>
</tr>
<tr>
<td>rdch</td>
<td>Read Channel</td>
<td>235</td>
</tr>
<tr>
<td>rot</td>
<td>Rotate Word</td>
<td>124</td>
</tr>
<tr>
<td>roth</td>
<td>Rotate Halfword</td>
<td>122</td>
</tr>
<tr>
<td>rothi</td>
<td>Rotate Halfword Immediate</td>
<td>123</td>
</tr>
<tr>
<td>rothmi</td>
<td>Rotate and Mask Halfword</td>
<td>131</td>
</tr>
<tr>
<td>rothm</td>
<td>Rotate and Mask Halfword</td>
<td>132</td>
</tr>
<tr>
<td>roti</td>
<td>Rotate Word Immediate</td>
<td>125</td>
</tr>
</tbody>
</table>
### Table B-1. Instructions Sorted by Mnemonic (Page 5 of 6)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Instruction</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>rotm</code></td>
<td>Rotate and Mask Word</td>
<td>133</td>
</tr>
<tr>
<td><code>rotma</code></td>
<td>Rotate and Mask Algebraic Word</td>
<td>142</td>
</tr>
<tr>
<td><code>rotmah</code></td>
<td>Rotate and Mask Algebraic Halfword</td>
<td>140</td>
</tr>
<tr>
<td><code>rotmahi</code></td>
<td>Rotate and Mask Algebraic Halfword Immediate</td>
<td>141</td>
</tr>
<tr>
<td><code>rotmai</code></td>
<td>Rotate and Mask Algebraic Word Immediate</td>
<td>143</td>
</tr>
<tr>
<td><code>rotmi</code></td>
<td>Rotate and Mask Word Immediate</td>
<td>134</td>
</tr>
<tr>
<td><code>rotqbi</code></td>
<td>Rotate Quadword by Bits</td>
<td>129</td>
</tr>
<tr>
<td><code>rotqbi</code></td>
<td>Rotate Quadword by Bits Immediate</td>
<td>130</td>
</tr>
<tr>
<td><code>rotqby</code></td>
<td>Rotate Quadword by Bytes</td>
<td>126</td>
</tr>
<tr>
<td><code>rotqbyb</code></td>
<td>Rotate Quadword by Bytes from Bit Shift Count</td>
<td>128</td>
</tr>
<tr>
<td><code>rotqby</code></td>
<td>Rotate Quadword by Bytes Immediate</td>
<td>127</td>
</tr>
<tr>
<td><code>rotqmb</code></td>
<td>Rotate and Mask Quadword by Bits</td>
<td>138</td>
</tr>
<tr>
<td><code>rotqmbi</code></td>
<td>Rotate and Mask Quadword by Bits Immediate</td>
<td>139</td>
</tr>
<tr>
<td><code>rotqmb</code></td>
<td>Rotate and Mask Quadword by Bytes</td>
<td>135</td>
</tr>
<tr>
<td><code>rotqmbi</code></td>
<td>Rotate and Mask Quadword Bytes from Bit Shift Count</td>
<td>137</td>
</tr>
<tr>
<td><code>rotqmb</code></td>
<td>Rotate and Mask Quadword by Bytes Immediate</td>
<td>136</td>
</tr>
<tr>
<td><code>selb</code></td>
<td>Select Bits</td>
<td>110</td>
</tr>
<tr>
<td><code>sf</code></td>
<td>Subtract From Word</td>
<td>59</td>
</tr>
<tr>
<td><code>sfh</code></td>
<td>Subtract From Halfword</td>
<td>57</td>
</tr>
<tr>
<td><code>sfhi</code></td>
<td>Subtract From Halfword Immediate</td>
<td>58</td>
</tr>
<tr>
<td><code>sfi</code></td>
<td>Subtract From Word Immediate</td>
<td>60</td>
</tr>
<tr>
<td><code>sfx</code></td>
<td>Subtract From Extended</td>
<td>64</td>
</tr>
<tr>
<td><code>shl</code></td>
<td>Shift Left Word</td>
<td>115</td>
</tr>
<tr>
<td><code>shlh</code></td>
<td>Shift Left Halfword</td>
<td>113</td>
</tr>
<tr>
<td><code>shlhi</code></td>
<td>Shift Left Halfword Immediate</td>
<td>114</td>
</tr>
<tr>
<td><code>shli</code></td>
<td>Shift Left Word Immediate</td>
<td>116</td>
</tr>
<tr>
<td><code>shlqbi</code></td>
<td>Shift Left Quadword by Bits</td>
<td>117</td>
</tr>
<tr>
<td><code>shlqbi</code></td>
<td>Shift Left Quadword by Bits Immediate</td>
<td>118</td>
</tr>
<tr>
<td><code>shlqby</code></td>
<td>Shift Left Quadword by Bytes</td>
<td>119</td>
</tr>
<tr>
<td><code>shlqbyb</code></td>
<td>Shift Left Quadword by Bytes from Bit Shift Count</td>
<td>121</td>
</tr>
<tr>
<td><code>shlqby</code></td>
<td>Shift Left Quadword by Bytes Immediate</td>
<td>120</td>
</tr>
<tr>
<td><code>shufb</code></td>
<td>Shuffle Bytes</td>
<td>111</td>
</tr>
<tr>
<td><code>stop</code></td>
<td>Stop and Signal</td>
<td>226</td>
</tr>
<tr>
<td><code>stopd</code></td>
<td>Stop and Signal with Dependencies</td>
<td>227</td>
</tr>
<tr>
<td><code>stqa</code></td>
<td>Store Quadword (a-form)</td>
<td>35</td>
</tr>
<tr>
<td><code>stqd</code></td>
<td>Store Quadword (d-form)</td>
<td>33</td>
</tr>
<tr>
<td><code>stqr</code></td>
<td>Store Quadword Instruction Relative (a-form)</td>
<td>36</td>
</tr>
</tbody>
</table>
### Table B-1. Instructions Sorted by Mnemonic (Page 6 of 6)

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Instruction</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>stqx</td>
<td>Store Quadword (x-form)</td>
<td>34</td>
</tr>
<tr>
<td>sumb</td>
<td>Sum Bytes into Halfwords</td>
<td>88</td>
</tr>
<tr>
<td>sync</td>
<td>Synchronize</td>
<td>230</td>
</tr>
<tr>
<td>wrch</td>
<td>Write Channel</td>
<td>237</td>
</tr>
<tr>
<td>xor</td>
<td>Exclusive Or</td>
<td>103</td>
</tr>
<tr>
<td>xorbi</td>
<td>Exclusive Or Byte Immediate</td>
<td>104</td>
</tr>
<tr>
<td>xorhi</td>
<td>Exclusive Or Halfword Immediate</td>
<td>105</td>
</tr>
<tr>
<td>xori</td>
<td>Exclusive Or Word Immediate</td>
<td>106</td>
</tr>
<tr>
<td>xsbh</td>
<td>Extend Sign Byte to Halfword</td>
<td>89</td>
</tr>
<tr>
<td>xshw</td>
<td>Extend Sign Halfword to Word</td>
<td>90</td>
</tr>
<tr>
<td>xswd</td>
<td>Extend Sign Word to Doubleword</td>
<td>91</td>
</tr>
</tbody>
</table>
Appendix C. Details of the Compute-Mask Instructions

The tables in this section show the details of the masks that are generated by the eight Compute Mask instructions. The masks that are shown are intended for use as the RC operand of the Shuffle Bytes, `shufb`, instruction. Each row in a table shows the rightmost 4 bits of the effective address. An x in the first column indicates an ignored bit. Blanks within the “created mask” are shown only to improve clarity.

For **byte** insertion:

**Table C-1. Byte Insertion: Rightmost 4 Bits of the Effective Address and Created Mask**

<table>
<thead>
<tr>
<th>Rightmost 4 Bits of the Effective Address</th>
<th>Created Mask</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000</td>
<td>03 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0001</td>
<td>10 03 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0010</td>
<td>10 11 03 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0011</td>
<td>10 11 12 03 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0100</td>
<td>10 11 12 13 03 15 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0101</td>
<td>10 11 12 13 14 03 16 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0110</td>
<td>10 11 12 13 14 15 03 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>0111</td>
<td>10 11 12 13 14 15 16 03 17 18 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>1000</td>
<td>10 11 12 13 14 15 16 17 03 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>1001</td>
<td>10 11 12 13 14 15 16 17 03 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>1010</td>
<td>10 11 12 13 14 15 16 17 18 03 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>1011</td>
<td>10 11 12 13 14 15 16 17 18 19 03 19 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>1100</td>
<td>10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 03 1f</td>
</tr>
<tr>
<td>1101</td>
<td>10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 03 1f</td>
</tr>
<tr>
<td>1110</td>
<td>10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 03 1f</td>
</tr>
<tr>
<td>1111</td>
<td>10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 03</td>
</tr>
</tbody>
</table>

For **halfword** insertion:

**Table C-2. Halfword Insertion: Rightmost 4 Bits of the Effective Address and Created Mask**

<table>
<thead>
<tr>
<th>Rightmost 4 Bits of the Effective Address</th>
<th>Created Mask</th>
</tr>
</thead>
<tbody>
<tr>
<td>000x</td>
<td>0203 1213 1415 1617 1819 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>001x</td>
<td>1011 0203 1415 1617 1819 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>010x</td>
<td>1011 1213 0203 1617 1819 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>011x</td>
<td>1011 1213 1415 0203 1819 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>100x</td>
<td>1011 1213 1415 1617 0203 1a 1b 1c 1d 1e 1f</td>
</tr>
<tr>
<td>101x</td>
<td>1011 1213 1415 1617 0203 1c 1d 1e 1f</td>
</tr>
<tr>
<td>110x</td>
<td>1011 1213 1415 1617 1819 1a 1b 0203 1e 1f</td>
</tr>
<tr>
<td>111x</td>
<td>1011 1213 1415 1617 1819 1a 1b 1c 1d 0203</td>
</tr>
</tbody>
</table>
For **word** insertion:

### Table C-3. Word Insertion: Rightmost 4 Bits of the Effective Address and Created Mask

<table>
<thead>
<tr>
<th>Rightmost 4 Bits of the Effective Address</th>
<th>Created Mask</th>
</tr>
</thead>
<tbody>
<tr>
<td>00xx</td>
<td>00010203 14151617 18191a1b 11111111</td>
</tr>
<tr>
<td>01xx</td>
<td>10111213 00010203 18191a1b 11111111</td>
</tr>
<tr>
<td>10xx</td>
<td>10111213 14151617 00010203 11111111</td>
</tr>
<tr>
<td>11xx</td>
<td>10111213 14151617 18191a1b 00010203</td>
</tr>
</tbody>
</table>

For **doubleword** insertion:

### Table C-4. Doubleword Insertion: Rightmost 4 Bits of Effective Address and Created Mask

<table>
<thead>
<tr>
<th>Rightmost 4 Bits of the Effective Address</th>
<th>Created Mask</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xxx</td>
<td>0001020304050607 18191a1b1c1d1e1f1</td>
</tr>
<tr>
<td>1xxx</td>
<td>101112130304050607 0001020304050607</td>
</tr>
</tbody>
</table>
## Revision Log

<table>
<thead>
<tr>
<th>Revision Date</th>
<th>Contents of Modification</th>
</tr>
</thead>
<tbody>
<tr>
<td>August 1, 2005</td>
<td>Initial public release.</td>
</tr>
</tbody>
</table>