These slides are mostly taken verbatim, or with minor changes, from those prepared by Mary Jane Irwin (www.cse.psu.edu/~mji) of The Pennsylvania State University.

Key to the Slides

- The source of each slide is coded in the footer on the right side:
  - **Irwin CSE331** = slide by Mary Jane Irwin from the course CSE331 (Computer Organization and Design) at The Pennsylvania State University.
  - **Irwin CSE431** = slide by Mary Jane Irwin from the course CSE431 (Computer Architecture) at The Pennsylvania State University.
  - **Hegner UU** = slide by Stephen J. Hegner at Umeå University.
The Big Picture: Where are We Now?

- **Multiprocessor** – a computer system with at least two processors

  ![Multiprocessor Diagram]

- Can deliver high throughput for independent jobs via job-level parallelism or process-level parallelism
- And improve the run time of a *single* program that has been specially crafted to run on a multiprocessor - a parallel processing program
Multicores Now Common

- The power challenge has forced a change in the design of microprocessors
  - Since 2002 the rate of improvement in the response time of programs has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year

- Today’s microprocessors typically contain more than one core – **Chip Multicore microProcessors (CMPs)** – in a single IC
  - The number of cores is expected to double every two years

<table>
<thead>
<tr>
<th>Product</th>
<th>AMD Barcelona</th>
<th>Intel Nehalem</th>
<th>IBM Power</th>
<th>Sun Niagara</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cores per chip</td>
<td>4</td>
<td>4</td>
<td>2</td>
<td>8</td>
</tr>
<tr>
<td>Clock rate</td>
<td>2.5 GHz</td>
<td>~2.5 GHz?</td>
<td>4.7 GHz</td>
<td>1.4 GHz</td>
</tr>
<tr>
<td>Power</td>
<td>120 W</td>
<td>~100 W?</td>
<td>~100 W?</td>
<td>94 W</td>
</tr>
</tbody>
</table>
Other Multiprocessor Basics

- Some of the problems that need higher performance can be handled simply by using a cluster – a set of independent servers (or PCs) connected over a local area network (LAN) functioning as a single large multiprocessor
  - Search engines, Web servers, email servers, databases, …

- A key challenge is to craft parallel (concurrent) programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale
  - Scheduling, load balancing, time for synchronization, overhead for communication
Encountering Amdahl’s Law

- Speedup due to enhancement E is

\[
\text{Speedup w/ E} = \frac{\text{Exec time w/o E}}{\text{Exec time w/ E}}
\]

- Suppose that enhancement E accelerates a fraction \(F\) (\(F < 1\)) of the task by a factor \(S\) (\(S > 1\)) and the remainder of the task is unaffected.

ExTime w/ E = ExTime w/o E ×

\[
\text{Speedup w/ E} = \frac{\text{ExTime w/o E}}{\text{ExTime w/ E}}
\]
Encountering Amdahl’s Law

- Speedup due to enhancement E is

\[
\text{Speedup w/ E} = \frac{\text{Exec time w/o E}}{\text{Exec time w/ E}}
\]

- Suppose that enhancement E accelerates a fraction F (F < 1) of the task by a factor S (S > 1) and the remainder of the task is unaffected

\[
\text{ExTime w/ E} = \text{ExTime w/o E} \times ((1-F) + F/S)
\]

\[
\text{Speedup w/ E} = \frac{1}{(1-F) + F/S}
\]
Example 1: Amdahl’s Law

Speedup w/ E =

- Consider an enhancement which runs 20 times faster but which is only usable 25% of the time.
  Speedup w/ E =

- What if its usable only 15% of the time?
  Speedup w/ E =

- Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!

- To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less
  Speedup w/ E =

Example 1: Amdahl’s Law

\[ \text{Speedup w/ E} = \frac{1}{(1-F) + \frac{F}{S}} \]

- Consider an enhancement which runs 20 times faster but which is only usable 25% of the time.
  
  \[ \text{Speedup w/ E} = \frac{1}{(.75 + .25/20)} = 1.31 \]

- What if its usable only 15% of the time?
  
  \[ \text{Speedup w/ E} = \frac{1}{(.85 + .15/20)} = 1.17 \]

- Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!

- To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less
  
  \[ \text{Speedup w/ E} = \frac{1}{(.001 + .999/100)} = 90.99 \]
Example 2: Amdahl’s Law

Speedup w/ E = 1 / ((1-F) + F/S)

- Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors
  Speedup w/ E =

- What if there are 100 processors?
  Speedup w/ E =

- What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors?
  Speedup w/ E =

- What if there are 100 processors?
  Speedup w/ E =
Example 2: Amdahl’s Law

\[
\text{Speedup w/ } E = \frac{1}{(1-F) + \frac{F}{S}}
\]

- Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors

  \[
  \text{Speedup w/ } E = \frac{1}{(0.091 + \frac{0.909}{10})} = \frac{1}{0.1819} = 5.5
  \]

- What if there are 100 processors?

  \[
  \text{Speedup w/ } E = \frac{1}{(0.091 + \frac{0.909}{100})} = \frac{1}{0.10009} = 10.0
  \]

- What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors?

  \[
  \text{Speedup w/ } E = \frac{1}{(0.001 + \frac{0.999}{10})} = \frac{1}{0.1009} = 9.9
  \]

- What if there are 100 processors?

  \[
  \text{Speedup w/ } E = \frac{1}{(0.001 + \frac{0.999}{100})} = \frac{1}{0.01099} = 91
  \]
Scaling

- To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.
  - **Strong scaling** – when speedup can be achieved on a multiprocessor without increasing the size of the problem
  - **Weak scaling** – when speedup is achieved on a multiprocessor by increasing the size of the problem proportionally to the increase in the number of processors

- Load balancing is another important factor. Just a single processor with twice the load of the others cuts the speedup almost in half
Multiprocessor/Clusters Key Questions

- Q1 – How do they share data?
- Q2 – How do they coordinate?
- Q3 – How scalable is the architecture? How many processors can be supported?
**Shared Memory Multiprocessor (SMP)**

- Q1 – Single address space shared by all processors
- Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores)
  - Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time
- They come in two styles
  - Uniform memory access (**UMA**) multiprocessors
  - Nonuniform memory access (**NUMA**) multiprocessors
- Programming NUMAs is harder
- But NUMAs can scale to larger sizes and have lower latency to local memory
Summing 100,000 Numbers on 100 Proc. SMP

- Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, \( P_n \) is the processor’s number, \( i \) is a private variable)

  \[
  \text{sum}[P_n] = 0; \\
  \text{for} \ (i = 1000*P_n; \ i < 1000*(P_n+1); \ i = i + 1) \\
  \quad \text{sum}[P_n] = \text{sum}[P_n] + A[i];
  \]

- The processors then coordinate in adding together the partial sums (\( \text{half} \) is a private variable initialized to 100 (the number of processors)) – reduction

  repeat
    \text{synch();} /* synchronize first
    \text{if} \ (\text{half} \% 2 \neq 0 \ \&\& \ P_n == 0) \\
    \quad \text{sum}[0] = \text{sum}[0] + \text{sum}[\text{half}-1]; \\
    \quad \text{half} = \text{half}/2 \\
    \text{if} \ (P_n < \text{half}) \ \text{sum}[P_n] = \text{sum}[P_n] + \text{sum}[P_n+\text{half}] \\
  \text{until} \ (\text{half} == 1); /* final sum in \text{sum}[0]
An Example with 10 Processors

\[ \text{sum}[P0] \text{sum}[P1] \text{sum}[P2] \text{sum}[P3] \text{sum}[P4] \text{sum}[P5] \text{sum}[P6] \text{sum}[P7] \text{sum}[P8] \text{sum}[P9] \]

P0  P1  P2  P3  P4  P5  P6  P7  P8  P9  half = 10
An Example with 10 Processors

\begin{align*}
\text{sum}[P0]\text{sum}[P1]\text{sum}[P2] & \text{sum}[P3]\text{sum}[P4]\text{sum}[P5]\text{sum}[P6] \text{sum}[P7]\text{sum}[P8] \text{sum}[P9] \\
\end{align*}

\begin{figure}
\begin{center}
\begin{tikzpicture}
    \node [circle, draw] (P0) at (0,0) {P0};
    \node [circle, draw] (P1) at (1,1) {P1};
    \node [circle, draw] (P2) at (2,2) {P2};
    \node [circle, draw] (P3) at (3,3) {P3};
    \node [circle, draw] (P4) at (4,4) {P4};
    \node [circle, draw] (P5) at (5,5) {P5};
    \node [circle, draw] (P6) at (6,6) {P6};
    \node [circle, draw] (P7) at (7,7) {P7};
    \node [circle, draw] (P8) at (8,8) {P8};
    \node [circle, draw] (P9) at (9,9) {P9};

    \draw (P0) -- (P1);
    \draw (P1) -- (P2);
    \draw (P2) -- (P3);
    \draw (P3) -- (P4);
    \draw (P4) -- (P5);
    \draw (P5) -- (P6);
    \draw (P6) -- (P7);
    \draw (P7) -- (P8);
    \draw (P8) -- (P9);

    \draw [red] (P0) -- (P1);
    \draw [red] (P1) -- (P2);
    \draw [red] (P2) -- (P3);
    \draw [red] (P3) -- (P4);
    \draw [red] (P4) -- (P5);
    \draw [red] (P5) -- (P6);
    \draw [red] (P6) -- (P7);
    \draw [red] (P7) -- (P8);
    \draw [red] (P8) -- (P9);

    \node at (10,10) {half = 10};
    \node at (11,11) {half = 5};
    \node at (12,12) {half = 2};
    \node at (13,13) {half = 1};
\end{tikzpicture}
\end{center}
\end{figure}
Process Synchronization

- Need to be able to coordinate processes working on a common task

- Lock variables (**semaphores**) are used to coordinate or synchronize processes

- Need an architecture-supported arbitration mechanism to decide which processor gets access to the lock variable
  - Single bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus wins

- Need an architecture-supported operation that locks the variable
  - Locking can be done via an **atomic swap operation** (on the MIPS we have **l1** and **sc** one example of where a processor can both read a location **and** set it to the locked state – **test-and-set** – in the same bus operation)
Spin Lock Synchronization

The \textit{single winning} processor will succeed in writing a 1 to the lock variable - all others processors will get a return code of 0.
Review: Summing Numbers on a SMP

- Pn is the processor’s number, vectors A and sum are shared variables, i is a private variable, half is a private variable initialized to the number of processors.

```c
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1)
    sum[Pn] = sum[Pn] + A[i];
/* each processor sums its
/* subset of vector A

repeat
    /* adding together the
    /* partial sums
    synch(); /* synchronize first
    if (half%2 != 0 && Pn == 0)
        sum[0] = sum[0] + sum[half-1];
    half = half/2
    if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1); /* final sum in sum[0]
```
An Example with 10 Processors

- `synch()`: Processors must synchronize before the “consumer” processor tries to read the results from the memory location written by the “producer” processor.

- **Barrier synchronization** – a synchronization scheme where processors wait at the barrier, not proceeding until every processor has reached it.

```
P0  P1  P2  P3  P4  P5  P6  P7  P8  P9
```
Barrier Implemented with Spin-Locks

- \( n \) is a shared variable initialized to the number of processors, \( \text{count} \) is a shared variable initialized to 0, \( \text{arrive} \) and \( \text{depart} \) are shared spin-lock variables where \( \text{arrive} \) is initially unlocked and \( \text{depart} \) is initially locked.

procedure \text{synch}()

lock(\text{arrive});
    \text{count} := \text{count} + 1; /* count the processors as if \text{count} < n /* they arrive at barrier then unlock(\text{arrive})
    else unlock(\text{depart});

lock(\text{depart});
    \text{count} := \text{count} - 1; /* count the processors as if \text{count} > 0 /* they leave barrier then unlock(\text{depart})
    else unlock(\text{arrive});
Spin-Locks on Bus Connected ccUMAs

- With a bus based cache coherency protocol (write invalidate), spin-locks allow processors to wait on a local copy of the lock in their caches
  - Reduces bus traffic – once the processor with the lock releases the lock (writes a 0) all other caches see that write and invalidate their old copy of the lock variable. Unlocking restarts the race to get the lock. The winner gets the bus and writes the lock back to 1. The other caches then invalidate their copy of the lock and on the next lock read fetch the new lock value (1) from memory.

- This scheme has problems scaling up to many processors because of the communication traffic when the lock is released and contested
## Aside: Cache Coherence Bus Traffic

<table>
<thead>
<tr>
<th>Proc P0</th>
<th>Proc P1</th>
<th>Proc P2</th>
<th>Bus activity</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Has lock</td>
<td>Spins</td>
<td>Spins</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>Releases lock (0)</td>
<td></td>
<td></td>
<td>Bus services</td>
<td>P0’s invalidate</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Bus services</td>
<td>P2’s cache miss</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Response to</td>
<td>Update lock in memory from P0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>P2’s cache miss</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Response to</td>
<td>Sends lock variable to P1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>P1’s cache miss</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Swap</td>
<td>Response to</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>P1’s cache miss</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

2. Proc P0 releases lock (0), Proc P1 spins, Proc P2 spins. Bus activity: P0’s invalidate.
4. Proc P0 waits, Proc P1 reads lock (0), Proc P2 cache miss. Response to P2’s cache miss: Update lock in memory from P0.
5. Proc P0 reads lock (0), Proc P1 swaps lock (ll,sc of 1), Proc P2 cache miss. Bus activity: P1’s cache miss.
6. Proc P0 swaps lock (ll,sc of 1), Proc P1 swap succeeds, Proc P2 cache miss. Response to P1’s cache miss: Sends lock variable to P1.
Message Passing Multiprocessors (MPP)

- Each processor has its own private address space

- Q1 – Processors share data by *explicitly* sending and receiving information (*message passing*)

- Q2 – Coordination is built into message passing primitives (*message send* and *message receive*)
Summing 100,000 Numbers on 100 Proc. MPP

- Start by distributing 1000 elements of vector $A$ to each of the local memories and summing each subset in parallel

  ```
  sum = 0;
  for (i = 0; i<1000; i = i + 1)
      sum = sum + A[i]; /* sum local array subset
  ```

- The processors then coordinate in adding together the sub sums ($P_n$ is the number of processors, send($x$, $y$) sends value $y$ to processor $x$, and receive() receives a value)

  ```
  half = 100;
  limit = 100;
  repeat
      half = (half+1)/2; /*dividing line
      if (P_n>= half && P_n<limit) send(P_n-half,sum);
      if (P_n<(limit/2)) sum = sum + receive();
      limit = half;
  until (half == 1); /*final sum in P0’s sum
  ```
An Example with 10 Processors

sum  sum  sum  sum  sum  sum  sum  sum  sum  sum

P0  P1  P2  P3  P4  P5  P6  P7  P8  P9  half = 10
An Example with 10 Processors

sum    sum    sum    sum    sum    sum    sum    sum    sum    sum    sum

P0      P1      P2      P3      P4      P5      P6      P7      P8      P9

half = 10
limit = 10

P0      P1      P2      P3      P4

half = 5
limit = 5

P0      P1      P2

half = 3
limit = 3

P0

half = 2
limit = 2

P0

half = 1
Pros and Cons of Message Passing

- Message sending and receiving is much slower than addition, for example.
- But message passing multiprocessors and much easier for hardware designers to design.
  - Don’t have to worry about cache coherency for example.
- The advantage for programmers is that communication is explicit, so there are fewer “performance surprises” than with the implicit communication in cache-coherent SMPs.
- However, it’s harder to port a sequential program to a message passing multiprocessor since every communication must be identified in advance.
  - With cache-coherent shared memory the hardware figures out what data needs to be communicated.
Networks of Workstations (NOWs) Clusters

- Clusters of off-the-shelf, whole computers with multiple private address spaces connected using the I/O bus of the computers
  - lower bandwidth than multiprocessor that use the processor-memory (front side) bus
  - lower speed network links
  - more conflicts with I/O traffic

- Clusters of N processors have N copies of the OS limiting the memory available for applications

- Improved system availability and expandability
  - easier to replace a machine without bringing down the whole system
  - allows rapid, incremental expandability

- Economy-of-scale advantages with respect to costs
## Commercial (NOW) Clusters

<table>
<thead>
<tr>
<th></th>
<th>Proc</th>
<th>Proc Speed</th>
<th># Proc</th>
<th>Network</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dell PowerEdge</td>
<td>P4 Xeon</td>
<td>3.06GHz</td>
<td>2,500</td>
<td>Myrinet</td>
</tr>
<tr>
<td>eServer IBM SP</td>
<td>Power4</td>
<td>1.7GHz</td>
<td>2,944</td>
<td></td>
</tr>
<tr>
<td>VPI BigMac</td>
<td>Apple G5</td>
<td>2.3GHz</td>
<td>2,200</td>
<td>Mellanox Infiniband</td>
</tr>
<tr>
<td>HP ASCI Q</td>
<td>Alpha 21264</td>
<td>1.25GHz</td>
<td>8,192</td>
<td>Quadrics</td>
</tr>
<tr>
<td>LLNL Thunder</td>
<td>Intel Itanium2</td>
<td>1.4GHz</td>
<td>1,024*4</td>
<td>Quadrics</td>
</tr>
<tr>
<td>Barcelona</td>
<td>PowerPC 970</td>
<td>2.2GHz</td>
<td>4,536</td>
<td>Myrinet</td>
</tr>
</tbody>
</table>
Multithreading on A Chip

- Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions.

- **Hardware multithreading** – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor.
  - Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread.
  - The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly).
  - The memory can be shared through virtual memory mechanisms.
  - Hardware must support efficient thread context switching.
Types of Multithreading

- **Fine-grain** – switch threads on every instruction issue
  - Round-robin thread interleaving (skipping stalled threads)
  - Processor must be able to switch threads on every clock cycle
  - Advantage – can hide throughput losses that come from both short and long stalls
  - Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads

- **Coarse-grain** – switches threads only on costly stalls (e.g., L2 cache misses)
  - Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread
  - Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss
    - Pipeline must be flushed and refilled on thread switches
Multithreaded Example: Sun’s Niagara (UltraSparc T2)

- Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction)

<table>
<thead>
<tr>
<th>Feature</th>
<th>Niagara 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data width</td>
<td>64-b</td>
</tr>
<tr>
<td>Clock rate</td>
<td>1.4 GHz</td>
</tr>
<tr>
<td>Cache</td>
<td>16K/8K/4M</td>
</tr>
<tr>
<td></td>
<td>(I/D/L2)</td>
</tr>
<tr>
<td>Issue rate</td>
<td>1 issue</td>
</tr>
<tr>
<td>Pipe stages</td>
<td>6 stages</td>
</tr>
<tr>
<td>BHT entries</td>
<td>None</td>
</tr>
<tr>
<td>TLB entries</td>
<td>64I/64D</td>
</tr>
<tr>
<td>Memory BW</td>
<td>60+ GB/s</td>
</tr>
<tr>
<td>Transistors</td>
<td>??? million</td>
</tr>
<tr>
<td>Power (max)</td>
<td>&lt;95 W</td>
</tr>
</tbody>
</table>
Niagara Integer Pipeline

- Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient

From *MPR*, Vol. 18, #9, Sept. 2004
Simultaneous Multithreading (SMT)

- A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP)
  - Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP)
  - With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them
    - Need separate rename tables (RUUs) for each thread or need to be able to indicate which thread the entry belongs to
    - Need the capability to commit from multiple threads in one cycle

- Intel’s Pentium 4 SMT is called hyperthreading
  - Supports just two threads (doubles the architecture state)
Threading on a 4-way SS Processor Example

Thread A → Thread B
Issue slots →

Coarse MT

Fine MT

SMT

Thread C → Thread D

Time ↓
Threading on a 4-way SS Processor Example

Issue slots →

Thread A  Thread B

Thread C  Thread D

Coarse MT  Fine MT  SMT

Time ↓
Q1 – How do they share data?

Q2 – How do they coordinate?

Q3 – How scalable is the architecture? How many processors?

<table>
<thead>
<tr>
<th>Communication model</th>
<th>Message passing model</th>
<th># of Proc</th>
</tr>
</thead>
<tbody>
<tr>
<td>Physical connection</td>
<td>Network</td>
<td>8 to 256</td>
</tr>
<tr>
<td></td>
<td>Bus</td>
<td>2 to 36</td>
</tr>
<tr>
<td></td>
<td>Shared address</td>
<td></td>
</tr>
<tr>
<td></td>
<td>NUMA</td>
<td>8 to 256</td>
</tr>
<tr>
<td></td>
<td>UMA</td>
<td>2 to 64</td>
</tr>
<tr>
<td></td>
<td></td>
<td>8 to 2048</td>
</tr>
</tbody>
</table>
Flynn’s Classification Scheme

- **SISD** – single instruction, single data stream
  - aka uniprocessor - what we have been talking about all semester

- **SIMD** – single instruction, multiple data streams
  - single control unit broadcasting operations to multiple datapaths

- **MISD** – multiple instruction, single data
  - no such machine (although some people put vector machines in this category)

- **MIMD** – multiple instructions, multiple data streams
  - aka multiprocessors (SMPs, MPPs, clusters, NOWs)

- Now obsolete terminology except for . . .
SIMD Processors

- Single control unit (one copy of the code)
- Multiple datapaths (Processing Elements – PEs) running in parallel
  - Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit
  - Q2 – Each PE performs the same operation on its own local data
## Example SIMD Machines

<table>
<thead>
<tr>
<th>Maker</th>
<th>Year</th>
<th># PEs</th>
<th># b/PE</th>
<th>Max memory (MB)</th>
<th>PE clock (MHz)</th>
<th>System BW (MB/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Illiac IV</td>
<td>1972</td>
<td>64</td>
<td>64</td>
<td>1</td>
<td>13</td>
<td>2,560</td>
</tr>
<tr>
<td>DAP</td>
<td>1980</td>
<td>4,096</td>
<td>1</td>
<td>2</td>
<td>5</td>
<td>2,560</td>
</tr>
<tr>
<td>MPP</td>
<td>1982</td>
<td>16,384</td>
<td>1</td>
<td>2</td>
<td>10</td>
<td>20,480</td>
</tr>
<tr>
<td>CM-2</td>
<td>1987</td>
<td>65,536</td>
<td>1</td>
<td>512</td>
<td>7</td>
<td>16,384</td>
</tr>
<tr>
<td>MP-1216</td>
<td>1989</td>
<td>16,384</td>
<td>4</td>
<td>1024</td>
<td>25</td>
<td>23,000</td>
</tr>
</tbody>
</table>

※ Did SIMDs die out in the early 1990s ??
Multimedia SIMD Extensions

- The most widely used variation of SIMD is found in almost every microprocessor today – as the basis of MMX and SSE instructions added to improve the performance of multimedia programs
  - A single, wide ALU is partitioned into many smaller ALUs that operate in parallel
    - Loads and stores are simply as wide as the widest ALU, so the same data transfer can transfer one 32 bit value, two 16 bit values or four 8 bit values

- There are now hundreds of SSE instructions in the x86 to support multimedia operations
Vector Processors

- A vector processor (e.g., Cray) pipelines the ALUs to get good performance at lower cost. A key feature is a set of vector registers to hold the operands and results.
  - Collect the data elements from memory, put them in order into a large set of registers, operate on them sequentially in registers, and then write the results back to memory
  - They formed the basis of supercomputers in the 1980’s and 90’s

- Consider extending the MIPS instruction set (VMIPS) to include vector instructions, e.g.,
  - `addv.d` to add two double precision vector register values
  - `addvs.d` and `mulvs.d` to add (or multiply) a scalar register to (by) each element in a vector register
  - `lv` and `sv` do vector load and vector store and load or store an entire vector of double precision data
MIPS vs VMIPS DAXPY Codes:  \( Y = a \times X + Y \)

```
l.d  $f0,a($sp) ; load scalar a
addiu  r4,$s0,#512 ; upper bound to load to
loop: l.d  $f2,0($s0) ; load X(i)
mul.d  $f2,$f2,$f0 ; a \times X(i)
l.d  $f4,0($s1) ; load Y(i)
add.d  $f4,$f4,$f2 ; a \times X(i) + Y(i)
s.d  $f4,0($s1) ; store into Y(i)
addiu  $s0,$s0,#8 ; increment X index
addiu  $s1,$s1,#8 ; increment Y index
subu  $t0,r4,$s0 ; compute bound
bne  $t0,$zero,loop ; check if done
```
### MIPS vs VMIPS DAXPY Codes: \( Y = a \times X + Y \)

```assembly
l.d   $f0,a($sp); load scalar a
addiu r4,$s0,#512; upper bound to load to
loop: l.d   $f2,0($s0); load X(i)
mul.d $f2,$f2,$f0; a \times X(i)
l.d   $f4,0($s1); load Y(i)
add.d $f4,$f4,$f2; a \times X(i) + Y(i)
s.d   $f4,0($s1); store into Y(i)
addiu $s0,$s0,#8; increment X index
addiu $s1,$s1,#8; increment Y index
subu  $t0,r4,$s0; compute bound
bne   $t0,$zero,loop; check if done
```

```assembly
l.d   $f0,a($sp); load scalar a
lv   $v1,0($s0); load vector X
mulvs.d $v2,$v1,$f0; vector-scalar multiply
lv   $v3,0($s1); load vector Y
addv.d $v4,$v2,$v3; add Y to a \times X
sv   $v4,0($s1); store vector result
```
Vector versus Scalar

- Instruction fetch and decode bandwidth is dramatically reduced (also saves power)
  - Only six instructions in VMIPS versus almost 600 in MIPS for 64 element DAXPY

- Hardware doesn’t have to check for data hazards within a vector instruction. A vector instruction will only stall for the first element, then subsequent elements will flow smoothly down the pipeline. And control hazards are nonexistent.
  - MIPS stall frequency is about 64 times higher than VMIPS for DAXPY

- Easier to write code for data-level parallel app’s

- Have a known access pattern to memory, so heavily interleaved memory banks work well. The cost of latency to memory is seen only once for the entire vector
## Example Vector Machines

<table>
<thead>
<tr>
<th>Maker</th>
<th>Year</th>
<th>Peak perf.</th>
<th># vector Processors</th>
<th>PE clock (MHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>STAR-100</td>
<td>1970</td>
<td>??</td>
<td>113</td>
<td>2</td>
</tr>
<tr>
<td>ASC</td>
<td>1970</td>
<td>20 MFLOPS</td>
<td>1, 2, or 4</td>
<td>16</td>
</tr>
<tr>
<td>Cray 1</td>
<td>1976</td>
<td>80 to 240 MFLOPS</td>
<td></td>
<td>80</td>
</tr>
<tr>
<td>Cray Y-mp</td>
<td>1988</td>
<td>333 MFLOPS</td>
<td>2, 4, or 8</td>
<td>167</td>
</tr>
<tr>
<td>Earth Simulator</td>
<td>2002</td>
<td>35.86 TFLOPS</td>
<td>8</td>
<td></td>
</tr>
</tbody>
</table>

Did Vector machines die out in the late 1990s ??
Graphics Processing Units (GPUs)

- GPUs are accelerators that supplement a CPU so they do not need to be able to perform all of the tasks of a CPU. They dedicate *all* of their resources to graphics
  - CPU-GPU combination – *heterogeneous* multiprocessing

- Programming interfaces that are free from backward binary compatibility constraints resulting in more rapid innovation in GPUs than in CPUs
  - Application programming interfaces (APIs) such as OpenGL and DirectX coupled with high-level graphics shading languages such as NVIDIA’s Cg and CUDA and Microsoft’s HLSL

- GPU data types are vertices (x, y, z, w) coordinates and pixels (red, green, blue, alpha) color components

- GPUs execute many threads (e.g., vertex and pixel shading) in parallel – *lots* of data-level parallelism
Typical GPU Architecture Features

- Rely on having enough threads to hide the latency to memory (not caches as in CPUs)
  - Each GPU is highly multithreaded

- Use extensive parallelism to get high performance
  - Have extensive set of SIMD instructions; moving towards multicore

- Main memory is bandwidth, not latency driven
  - GPU DRAMs are wider and have higher bandwidth, but are typically smaller, than CPU memories

- Leaders in the marketplace (in 2008)
  - NVIDIA GeForce 8800 GTX (16 multiprocessors each with 8 multithreaded processing units)
  - AMD’s ATI Radeon and ATI FireGL
  - Watch out for Intel’s Larrabee
Supercomputer Style Migration (Top500)

Uniprocessors and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%. Now it is 98% Clusters and MPPs.
Review: Shared Memory Multiprocessors (SMP)

- Q1 – Single address space shared by all processors
- Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores)
  - Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time
  
  ![Diagram of processors and caches](image)

- They come in two styles
  - Uniform memory access (UMA) multiprocessors
  - Nonuniform memory access (NUMA) multiprocessors
Message Passing Multiprocessors (MPP)

- Each processor has its own private address space
- Q1 – Processors share data by explicitly sending and receiving information (message passing)
- Q2 – Coordination is built into message passing primitives (message send and message receive)
Communication in Network Connected Multi’s

- Implicit communication via loads and stores
  - Hardware designers have to provide coherent caches and process (thread) synchronization primitive (like `ll` and `sc`)
  - Lower communication overhead
  - Harder to overlap computation with communication
  - More efficient to use an address to remote data when *needed* rather than to send for it in case it *might* be used

- Explicit communication via sends and receives
  - Simplest solution for hardware designers
  - Higher communication overhead
  - Easier to overlap computation with communication
  - Easier for the programmer to optimize communication
IN Performance Metrics

- **Network cost**
  - number of switches
  - number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor)
  - width in bits per link, length of link wires (on chip)

- **Network bandwidth (NB)** – represents the best case
  - bandwidth of each link * number of links

- **Bisection bandwidth (BB)** – closer to the worst case
  - divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line

- **Other IN performance issues**
  - latency on an unloaded network to send and receive messages
  - throughput – maximum # of messages transmitted per unit time
  - # routing hops worst case, congestion control and delay, fault tolerance, power efficiency
- N processors, 1 switch (●), 1 link (the bus)
- Only 1 simultaneous transfer at a time
  - NB = link (bus) bandwidth * 1
  - BB = link (bus) bandwidth * 1
N processors, N switches, 2 links/switch, N links

N simultaneous transfers
- NB = link bandwidth * N
- BB = link bandwidth * 2

If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case
Fully Connected IN

- N processors, N switches, N-1 links/switch, (N*(N-1))/2 links
- N simultaneous transfers
  - \( NB = \text{link bandwidth} \times \frac{(N \times (N-1))}{2} \)
  - \( BB = \text{link bandwidth} \times \frac{(N/2)^2}{ } \)
Crossbar (Xbar) Connected IN

- N processors, $N^2$ switches (unidirectional), 2 links/switch, $N^2$ links
- N simultaneous transfers
  - $NB = \text{link bandwidth} \times N$
  - $BB = \text{link bandwidth} \times N/2$
Hypercube (Binary N-cube) Connected IN

- N processors, N switches, logN links/switch, \((N\log N)/2\) links
- N simultaneous transfers
  - \(NB = \text{link bandwidth} \times (N\log N)/2\)
  - \(BB = \text{link bandwidth} \times N/2\)
N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4 N/2 links or 6 N/2 links

N simultaneous transfers

- NB = link bandwidth \* 4N or link bandwidth \* 6N
- BB = link bandwidth \* 2 N^{1/2} or link bandwidth \* 2 N^{2/3}
## Comparison

- For a 64 processor system

<table>
<thead>
<tr>
<th></th>
<th>Bus</th>
<th>Ring</th>
<th>Torus</th>
<th>6-cube</th>
<th>Fully connected</th>
</tr>
</thead>
<tbody>
<tr>
<td>Network bandwidth</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bisection bandwidth</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total # of switches</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Links per switch</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total # of links (bidi)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
## IN Comparison

- For a 64 processor system

<table>
<thead>
<tr>
<th></th>
<th>Bus</th>
<th>Ring</th>
<th>2D Torus</th>
<th>6-cube</th>
<th>Fully connected</th>
</tr>
</thead>
<tbody>
<tr>
<td>Network bandwidth</td>
<td>1</td>
<td>64</td>
<td>256</td>
<td>192</td>
<td>2016</td>
</tr>
<tr>
<td>Bisection bandwidth</td>
<td>1</td>
<td>2</td>
<td>16</td>
<td>32</td>
<td>1024</td>
</tr>
<tr>
<td>Total # of switches</td>
<td>1</td>
<td>64</td>
<td>64</td>
<td>64</td>
<td>64</td>
</tr>
<tr>
<td>Links per switch</td>
<td></td>
<td>2+1</td>
<td>4+1</td>
<td>6+7</td>
<td>63+1</td>
</tr>
<tr>
<td>Total # of links (bidi)</td>
<td>1</td>
<td>64+64</td>
<td>128+64</td>
<td>192+64</td>
<td>2016+64</td>
</tr>
</tbody>
</table>
“Fat” Trees

- Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.

- Any time A wants to send to C, it ties up the upper links, so that B can't send to D.
  - The bisection bandwidth on a tree is horrible - 1 link, at all times.

- The solution is to 'thicken' the upper links.
  - Have more links as you work towards the root of the tree increases the bisection bandwidth.

- Rather than design a bunch of N-port switches, use pairs of switches.
- N processors, $\log(N-1) * \log N$ switches, 2 up + 4 down = 6 links/switch, $N \times \log N$ links
- N simultaneous transfers
  - NB = link bandwidth $\times N \log N$
  - BB = link bandwidth $\times 4$