1. (7%) Consider the mechanisms needed to support a system call.

(a) Normally, a TRAP instruction is used to make a system call. Explain the steps involved in making a system call.
(b) How does the OS know which system call is being made?
(c) Explain why a subroutine CALL/RETURN instruction can not be used to make/return from a system call.

(a)
1. Push parameters on stack.
2. Invoke the system call.
3. Put code for system call on register.
4. Trap to the kernel
5. Since a number is associated with each system call, system call interface invokes/dispatch intended system call in OS kernel and return status of the system call and any return value.
6. Increment stack pointer.
(b) Using a cause register and OS can check that with the interrupt vector.
(c) Because the returned process will be executed in kernel mode.

2. (6%) Consider process scheduling algorithms FCFS (first-come first-serve), SJF (shortest-job first), and Round-Robin (RR). Each one may be preemptive or nonpreemptive.

(a) Which algorithm(s) can potentially cause starvation? Explain by describing a scenario. Be sure to state whether preemptive or nonpreemptive.
(b) Which two of the six algorithms (preemptive or nonpreemptive FCFS, SJF, and RR) are essentially the same? Why?

(a) SJF is nonpreemptive and may cause starvation. For example: A short job comes into the ready queue before the current running job is finished, then when the running job is finished, the short job will be executed before those jobs that are already in the ready queue.
(b) Preemptive FCFS and nonpreemptive FCFS are the same, because no matter preemptive or nonpreemptive, that’s not going to change the arrival time, and FCFS only see the arrival time as the dispatch order.

3. (6%) Consider a semaphore API where

semaphore S(n)

declares a semaphore named S initialized to the integer n, and functions wait(S) and signal(S) can be invoked on the semaphore.

(a) Show how to declare a semaphore L so that it works like a lock (mutex). How can acquire(L) and release(L) be implemented?
(b) Show how a semaphore can be used to ensure precedence (statement S1 of process P1 executes before statement S2 of process P2). Your pseudocode should declare a semaphore x and add the appropriate signal(x) or wait(x) call before or after statement S1 of P1 or S2 of P2.

(a)

semaphore L: 1

acquire(L) {
wait(L);
}

release(L) {
signal(L);
}

(b)

semaphore x: 0

------------------------------

P1           P2

S1         wait(x);

signal(x);      S2

------------------------------

4. (6%) Consider virtual memory support by OS.

(a) Does paging have internal fragmentation? External fragmentation? Explain.
(b) Virtual memory does not have to be implemented by paging. Describe one alternative way to implement virtual memory and give one advantage and one disadvantage compared to paging.

(a) Paging only have internal fragmentation, since every page size is fixed, and not every process uses exactly the same memory space as the multiple of page size.
(b) We could use segmentation. Advantages: Doesn’t have internal fragmentation, better for protection. Disadvantages: External fragmentation.

5. (8%) Coroutines are subroutines or tasks executing in turn. They are suitable for cooperative tasks, exceptions, event loops, and pipes. Design a coroutine in C with setjmp and longjmp functions. Try to use setjmp and longjmp to support the context-switching of a coroutine. Note that C is with setjmp and longjmp functions.

• Setjmp(jmp_buf env) function saves the calling environment in env.
• Longjmp(jmp_buf env, int val) function restores the environment saved by the most recent invocation of the respective setjmp() function. The parameter “val” is the value for the continuation of the program.

6. (4%) Grand Central Dispatch (GCD) is a technology for Apple Mac OS X and iOS. GCD identified extensions to the C and C++ language known as blocks. Explain how the GCD scheme works and the benefits of the GCD scheme.

7. (8%) Consider the OpenMP approch for process synchronization. Code fragments in Fig. 1 and Fig. 2 are examples used. Assume “result” is a shared variable by all threads and already properly declared. Please answer the following questions.

(a) Please identify the race condition in Fig. 1 and explain why the code fragment in Fig. 2 can handle race conditions.
(b) Further optimize the OpenMP code fragment in Fig. 2 to reduce the amount of critical sections.

Fig 1: Code fragment with OpenMP

void update(int value) {

...

#pragma omp parallel for shared(result)

for (i = 0; i < 1000; i++)
result += (2*value*value + 5) * i;

}

Fig 2: Code fragment with OpenMP critical linguistics

void update(int value) {

...

#pragma omp parallel for shared(result)

for (i = 0; i < 1000; i++)
#pragma omp critical
{
result += (2*value*value + 5) * i;
}

}

(a) When doing the addition assignment, it consists of 3 operations: 1. Load data into register 2. Addition from ALU 3. Write back, so if these operations are not executed atomically, race condition might occur.
(b)

void update(int value) {

int sum = 0;

for (i = 0; i < 1000; i++)
sum += (2*value*value + 5) * i;

#pragma omp critical

result = sum;

}

8. (5%) Explain Amdahl’s law and describe how it guides the process of system performance improvement.

9. (6%) Assume the required number of parameters and required number of FP multiplication (FP MUL) computation in a convolution are expressed as below.

$$\#parameter = kernel_{size}^2 \times channel_{in} \times channel_{out} \\ \# FP \, MUL \, computation = \#parameter \times x \times y$$

Assume that a convolution layer L has the following parameters.

The convolution includes three operations: FP multiplication (FP MUL), FP addition (FP ADD) and FP Load/Store (FP L/S). This convolution has the operation type breakdown as shown below. Assume the convolution is run at a processor with 2GHz clock. Define CPO as the cycles per operation.

How much is the computation time of this convolution?

10. (a) (4%) What is the function for the following circuit diagram?

(b) (4%) What is the shortest operation time for the circuit if you designed for a 64-bit operation? Assume the delay time for D Flip-Flops, NOT, AND, OR, and XOR gates are 180ps, 4ps, 120ps, 100ps, 140ps, respectively. You may ignore the flip-flop setup time.

11. (6%) Given an assembly code as below

find: move $t0,$a0
sll  $t1,$a1, 2
add  $t1,$t1, $a0 loop: beq$t0, $t1, done lw$t2, 4($t0) sw$t2, 0($t0) skip: addi$t0, $t0, 4 b loop done: jr$ra

(a) Write the function of the above MIPS assembly code into a pointer-based C code.
(b) What is the number of temporary registers used in the above MIPS code?

(a)

movement(int *array, int size)
{

int *p;

for (p = &array[0]; p < &array[size]; p++)
*p = *(p + 1);

}

(b) 3

12. (6%) Let $x = 1.9760 \times 10^4, \, y = -4.052734375 \times 10^{-2}.$ Calculate $x \times y.$

Assuming x, and y are stored in the 16-bit format with sign, exponent and mantissa. The leftmost bit is the sign bit, the exponent is 5 bits wide with a bias of 31, and the mantissa is 10 bits long. A hidden 1 is assumed. Assume 1 guard, 1 round bit and 1 sticky bit, and round to the nearest even. Write the answer in 16-bit floating point format in hexadecimal representation.

13. (6%) Consider a direct-mapped cache design with a 64-bit byte-address of the following format:

(a) What is the cache size in kibibytes (i.e., KiB) for the data storage? (note: 1KiB = $2^{10} \text{ bytes}$)
(b) What is the ratio between the actual memory size over the data storage size in (a)?
(c) Given the cache size in (a) as the fixed design parameter, you want to explore the different combination of Index and Offset widths. With a larger block size, will the ratio in (b) be larger or smaller? You should state a brief reason.

14. (6%) Consider the target application with the following statistics for your team to design a cache:

230 data reads per 1000 instructions;
120 data writes per 1000 instructions.

With the block size of 128 bytes, the instruction cache miss rate is 0.4%, and the data cache miss rate is 2%. Note that the minimum CPI is 1 (i.e., at most one instruction is fetched per cycle). Assume that each miss generates a request for one block.

(a) The read and write bandwidths between RAM and the cache is 1 byte/cycle in the initial design. For a write-through, write-allocate cache, what is the expected CPI?
(b) Your team wants to improve the CPI to less than 1.5 by increasing the read and write bandwidth between RAM and the cache. However, the bandwidth is restricted to $2^k$ bytes/cycle, where $k$ is an integer and $k \geq 0.$ What is the smallest $k$ to achieve the goal? What is the resultant CPI?

(a) $1+0.512+0.589+0.307+0.472=2.88$
(b) Smallest k = 2, resultant CPI = $1+(0.512+0.589+0.307+0.472)/4=1.47$

15. (6%) For one specific computation load, and execution time of the parallelizable part is $6t^2,$ and that of the sequential part is 3t, where t denotes the problem size. Assume that the parallelizable part of the computation workload can be ideally speeded up by increasing the number of processors. The execution time of the sequential part is not affected by increasing the number of processors.

(a) For the problem size t = 10, what is the speedup with the 15-processor implementation (i.e., n = 15) as compared with the single processor implementation?
(b) For (a), assume that the load is not well balanced. Two of the processors equally share 20% of the parallelizable load while all the others share the rest. What is the speedup?

16. (6%) For popular multistage network topologies of multiprocessor nodes, a fully connected network (or crossbar network) allows any processor node to communicate with any other node in one pass through the network. For a crossbar network of eight processor nodes, 64 switches are needed. Considering a switch box with two inputs and two outputs in the Omega network as the following diagram.

(a) Draw the Omega network topology of eight processor nodes (i.e., P0, P1, …, P7) with switch boxes.
(b) What is the advantages and disadvantages of the Omega network as compared with the crossbar network in terms of the number of switches and the functionality?
(c) The simple analysis of these networks with the number of switches ignores important practical considerations in the construction of a network. List two practical considerations that will affect the realistic implementation due to the different distance of each link in a network.

(a)

(b) Omega network uses less hardware than the crossbar network but contention can occur between messages. For example, the Omega network in answer (a) cannot send a message from P0 to P6 at the same time that it sends a message from P1 to P4.
(c) 1. The longer the distance, the more expensive it is to run at a high clock rate. 2. Shorter wires are cheaper than longer wires.

10 留言

1. Ben

想請問15(b)的詳解。 我的想法是可平行化的部分20%由2顆processor負責，剩下的80%由13顆processor負責，所以相較single processor的speedup = 630 / (120/2 + 480/13 + 30) ~= 4.96。不知道想法是否有誤？

• 文章作者的留言

mt

正確的算法應該是630 / (max(120/2, 480/13) + 30) = 630 / (60 + 30) = 7，原因是120/2跟480/13是同時做的，所以不用分開來算～

2. QooQ

想請問第9題的分子部分是怎麼算的?

• 文章作者的留言

mt

分子就是這個convolution所需要的clock cycles數量，我們可以先從題目給的資訊算出FP MUL的數量是5 * 5 * 3 * 100 * 200 * 200 = 300000000，再從下表中算出所有的FP operations有300000000 / 0.8 = 375000000個，就可算出總clock cycles數量為375000000 * 0.8 * 1 + 375000000 * 0.15 * 0.8 + 375000000 * 0.05 * 0.75 = 359062500。

了解，感謝

4. Davis

想請問一下第 14 題 (a) 的 0.472 是怎麼來的？有查到網路上有人說是 data cache write hit，然後寫：0.12*0.98*4，請問那個 *4 是怎麼來的呢？非常感謝

• 文章作者的留言

mt

*4是因為會寫入1 word = 4 bytes的data喔～

感謝解釋 ！