1. (10%) Suppose that a disk drive has 1000 cylinders, numbered 0 to 999. The drive is currently serving a request at cylinder 200. The queue of pending requests, in FIFO order, is 80, 910, 190, 120, 10, 280, 900

Starting from the current head position, what is the reading order from each of the following disk-scheduling algorithm? (no need to compute the total distance)
(a) SSTF
(b) SCAN (move to higher number first)
(c) C-SCAN (move to higher number first)
Note that SSTF stands for shortest-seek-time-first and C-SCAN stands for circular SCAN.

(a) $200 \to 190 \to 120 \to 80 \to 10 \to 280 \to 900 \to 910$
(b) $200 \to 280 \to 900 \to 910 \to 190 \to 120 \to 80 \to 10$
(c) $200 \to 280 \to 900 \to 910 \to 10 \to 80 \to 120 \to 190$

2. (4%) Figures (a) and (b) below depict two different RAID designs where $X_p$ is the parity bit of data $X.$ Explain what is the main advantage of the RAID in Fig.(b) over the RAID in Fig.(a).

3. (10%) Suppose we have a demand paged memory. In the system, we try to reduce the page faults with LRU and FIFO replacement schemes, respectively. Consider the following two C programs: Program A and Program B. Assume that the number N in the program is sufficiently large.

Program A

.wp-block-code{border:0;padding:0}.wp-block-code>div{overflow:auto}.shcb-language{border:0;clip:rect(1px,1px,1px,1px);-webkit-clip-path:inset(50%);clip-path:inset(50%);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px;word-wrap:normal;word-break:normal}.hljs{box-sizing:border-box}.hljs.shcb-code-table{display:table;width:100%}.hljs.shcb-code-table>.shcb-loc{color:inherit;display:table-row;width:100%}.hljs.shcb-code-table .shcb-loc>span{display:table-cell}.wp-block-code code.hljs:not(.shcb-wrap-lines){white-space:pre}.wp-block-code code.hljs.shcb-wrap-lines{white-space:pre-wrap}.hljs.shcb-line-numbers{border-spacing:0;counter-reset:line}.hljs.shcb-line-numbers>.shcb-loc{counter-increment:line}.hljs.shcb-line-numbers .shcb-loc>span{padding-left:.75em}.hljs.shcb-line-numbers .shcb-loc::before{border-right:1px solid #ddd;content:counter(line);display:table-cell;padding:0 .75em;text-align:right;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;white-space:nowrap;width:1%}#include <stdio.h>

double A[N][N];
double Anew[N][N];

Compute() {
int i, j;

for (i = 1; i < (N-1); i++)
for (j = 1; j < (N-1); j++)
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i+2] + A[j][i-1]);
}
Code language: C++ (cpp)

Program B

#include <stdio.h>

double A[N][N];
double Anew[N][N];

Compute() {
int i, j;

for (j = 1; i < (N-1); j++)
for (i = 1; j < (N-1); i++)
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i+2] + A[j][i-1]);
}
Code language: C++ (cpp)

(a) Assume LRU replacement algorithm is used. Which program will be likely with less page faults? Please explain the reason.
(b) Assume FIFO replacement scheme is used. Which program will be likely with less page faults? Please explain the reason.

(a) Program B will occur less page faults than Program A, because Program B is accessing the array in row-major format, so there will be only N-2 page faults in total.
(b) 同上
Note: 都是 Compulsory miss 用 LRU, FIFO沒有差別。

4. (4%) Which of the following programming techniques and structures are good for a demand-paged environment? Which are “not good”? Explain your answers.

(a) Stack
(b) Sequential Search
(c) Vector operations
(d) Indirection

5. (9%) Consider a system running two I/O-bound processes, P1 and P2, and one CPU-bound process, P3. Assume that each I/O-bound process issues an I/O operation after 0.9 millisecond of CPU computing and that each I/O operation takes 8.9 milliseconds to complete. Assume multiple I/O operations can be executed simultaneously without causing any delay. Also assume that the context switching takes 0.1 millisecond and that all processes are long running tasks. Suppose the Round-Robin scheduler is used and the time quantum is 3.9 millisecond.

(a) Suppose initially the processes in the waiting queue are P1, P2, and P3. Draw the Gantt chart for the first 20 milliseconds.
(b) What is the CPU utilization in the long run?
(c) What is the throughput of I/O operations (per second) in the long run?

(a)

(b) $\dfrac{13.5}{13.8} \approx 0.98$
(c) 145 I/O operations per second

6. (8%) Synchronization barrier is a common paradigm in many parallel applications. A barrier is supposed to block the calling thread until all N threads have reached the barrier. The following code uses semaphores to implement the barrier function. The initial values of variables are

N = 2 (which means there are two processes, P1 and P2.)
count = 0
R1 = Semaphore(1)
R2 = Semaphore(0)

R1.wait()
count = count + 1
if (count == N) R2.signal()
R2.wait()
R2.signal()
R1.signal()
Code language: C++ (cpp)

(a) Show that the execution sequence P1(1), P1(2), P1(3), P1(4), P2(1) causes a deadlock, where Pi(j) means process i executes line j.
(b) Draw the Resource-Allocation Graph for the system after the execution sequence.
(c) Does your Resource-Allocation Graph have a cycle? If yes, label each edge with an execution code Pi(j) that creates the edge. If no, explain why your Resource-Allocation Graph cannot detect the deadlock.
(d) Rearrange the above code to produce a correct implementation of synchronization barrier without any deadlock for arbitrary execution order of P1 and P2. Give your answer in a permutaion of number 1 to 6.

(a) P1 will stuck at line 4 waiting someone to signal R2, but R1 won’t be signaled, so deadlock occurs. P2 stuck at line 1.
(b)

(c) No, to use the RAG to detect deadlock, we need to make sure all resources have only one instance. In this case, resource R2 have no instances, that’s why we can’t detect deadlock with this RAG.
(d) 1, 2, 6, 3, 4, 5.

7. (8%) Give the table of latencies of individual stages of a MIPS instruction, please answer the following questions:

(a) What is the clock cycle time in a nonpipelined and pipelined processor?
(b) What is the total latency of a lw instruction in a nonpipelined and pipelined processor?
(c) If the time for an ALU operation can be shortened/increased by 25%, how much speedup/slowdown will it be from pipelining? Explain the reason.
(d) If you can split one stage into two, each with half the original latency, to improve the pipelining performance. Which stage would you split and what is the new clock cycle time in a pipelined processor?

(a) Nonpipelined: 1050ps, Pipelined: 400ps.
(b) Nonpipelined: 1050ps, Pipelined: 2000ps.
(c) Shortening/Increasing the ALU operation will not affect the speedup/slowdown obtained from pipelining. It would not affect the clock cycle.
(d) Data Access stage. New clock cycle time: 250ps.

(a) Describe the following terminologies:

• Memory-mapped I/O.
• I/O instruction.
• Device pooling.
• Interrupt-driven communication.

(b) Which communication pattern is most appropriate for a “Video Game Controller”? Explain.
(c) Prioritize the following interrupts caused by different devices:

• Mouse Controller.
• Reboot.
• Overheat.

(a)
Memory-mapped I/O: An I/O scheme in which portions of address space are assigned to I/O devices and reads and writes to those addresses are interpreted as commands to the I/O device.
I/O instruction: A dedicated instruction that is used to give a command to an I/O device and that specifies both the device number and the command word (or the location of the command word in memory).
Device polling: The process of periodically checking the status of an I/O device to determine the needs to service the device.
Interrupt-driven communication: An I/O scheme that employs interrupts to indicate to the processor that an I/O device needs attention.
(b) Device polling
(c) Overheat: 1, Reboot: 2, Mouse Controller: 3.

9. (5%) In general, a fully associative cache is better than a direct-mapped cache in term of miss rate. However, it is not always the case for a cache, especially for a small cache. Please design an example to demonstrate that a direct-mapped cache outperforms a fully associative cache in term of miss rate under the replacement policy, Least Recently Used (LRU).

This example shows the miss rate for fully associative cache is higher than that of the direct-mapped cache.

10. (10%) Consider adding a new index addressing mode to the machine A such that the following code sequence

ADD R1, R1, R2 // R1 = R1 + R2

LW  Rd, 0(R1)  // LoadCode language: C++ (cpp)

can be combined as

LW  Rd, 0(R1 + R2)Code language: C++ (cpp)

(a) Assume that the load and store instructions are 10% and 20% of all the benchmarks considered, respectively, and this new addressing mode can be used for 5% of both of them. Determine the ratio of instruction count on the machine A before adding the new addressing mode and after adding the new addressing mode.
(b) Assume that adding this new addressing mode also increases the clock cycle by 3%, which machine (either machine A before adding the new addressing mode or machine A after adding the new addressing mode) will be faster and by how much?

(a) $\dfrac{100}{98.5} \approx 1.015$
(b) $1.01455$

11. (8%) The following Boolean equation describe a 3-output logic function.

O1 = A'BC'D' + AB'C + FGH + E'

O2 = BCD + CDE'H + FGH'

O3 = AB'CD + ABEF + F'G'Code language: C++ (cpp)

(a) Draw a circuit diagram for O1 using only 4-input NAND gates.
(b) If the 3-output function is implemented with a read-only memory, what size of ROM is needed?

(a)

(b) $2^8 \times 3 \text{ bits} = 768 \text{ bits}$

12. (8%) Assume the floating point is stored in a 32 bits wide format. The leftmost bit is the sign bit, the exponent of 16 in excess-64 format is 7 bit wide, and the mantissa is 24 bits long and a hidden 1 is assumed.

(a) What is the smallest positive number representable in this format (write the answer in hex and decimal)?
(b) What is the largest positive number representable (write the answer in hex and decimal)?
(c) Assume $A = 2.34375 \times 10^{-2}$ and $B = 2.9015625 \times 10^2.$ Write down the bit pattern using this format for A and B.
(d) Using the format presented in (c), calculate the $A – B.$

13. (8%) Assume garage collection comprises 30% of the cycles of a program. There are two methods used to speed up the program. The first one would be automatically handle garbage collection in hardware. This causes an increase in cycle time by a factor of 1.4. The second one provides new hardware instructions that could be used during garbage collection. This would use only one-third of instructions needed for garbage collections but increase the cycle time by 1.2. Compute the program execution time for both methods and show which one is better.