2 An Example: Matrix Multiplication
The most commonly used program as shown in Figure 2 in parallel processing has a uniform workload. Since matrix a rows and matrix b columns are referenced constantly, and the elements of both matrices are not modified, a local cache if available will be very useful to the system. This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3).
for (i = 0; i < N; i ++ ) /* can be parallel zed */
for (j = 0; j < M; j++ ){ /* can be parallel zed */
c[i] [j] = 0;
for (k = 0; k < P; k++)
c[i] [j] = c[i] [j] + a[i] [k] * b [k] [j];
}
Figure 2: A kernel of Matrix Multiplication.
The serial program computes a result matrix result by multiplying two input matrices, a and b Matrix a consists of N rows by P columns and matrix b contains P rows by M columns. These sizes yield a result matrix c of N rows by M columns. The serial version of the program was quite straightforward.
Let ' s consider what we need to change in order to use PVM. The first activity is to partition the problem so each slave can work on its own assignment in parallel. For matrix multiplication, the smallest sensible unit of work is the computation of one element in the result matrix. It is possible to divide the work in to even smaller chunks, but any finer division would not be useful. For example, the umber of processor is not enough to process, i.e., n2 processors are needed.
The matrix multiplication algorithm is implemented in PVM using the master-slave paradigm. The master task is called master_mm_pvm, and the slave tasks are called slave_mm_pvm. The master reads in the input data, which includes the umber of slaves to be spawned, nTasks. After registering with PVM and receiving a taskid or tid it spawns nTasks instances of the slave program slave_mm_pvm and the distributes the input graph information to each of them. As a result of the spaw function, the master obtains the tids of each of the slaves.Since each slave needs to work on a distinct subset of the set of matrix elements, they need to be assigned instance IDs in the range (0,...,nTask -1). The tids assigned to them by the PVM library do not lie in this range, so the master needs to assigned the instance Ids to the slaves a d se d that information along with the input matrix. The slaves also need to know the total number of slaves in the program, and this information is passed o to them by the master process as an argument to the spaw function since, unlike the instance IDs, this umber is the same for all nTasks slaves.
To send the input data and instance ID information, the master process packs these into the active send buffer, and the invokes the send function. It the waits to receive partial results from each of the slaves. The slaves register with the PVM environment, and the wait for input data from the master, using a wildcard in the receive function to receive a message from any source. Once a message is received, each slave determines the master's tid from the received message buffer properties. Alternatively, the slaves could have determined the master ' s tid by calling the pvm parent() function, which they could have used as the source in their receive function. On receiving the message from the master that contains the input matrix, a slave unpacks this data from the active receive buffer. Each slave the works on its input partition, and send its partial results to the master when it is done. The the master collects these partial results into a output matrix and outputs the results.
I the slave program, we keep the basic structure of the sequential program intact. But now the routine to multiply the two matrices, the main program of slave mm_pvm does not do the actual work itself, only performs the loop partition for each individual portion. In stead, the slave program calls a function matrix multiple to perform real matrix multiplication. The individual slaves each then perform a portion of the matrix multiplication as shown in Figure 3.

Figure 3:The block version of partitioning.
3 Experimental Result
Our SMP cluster is show in Figure 4 that consists of nine PC-based multiprocessors connected by a switched Hub with Fast Ethernet interface. There are one server node a d eight computing nodes. The server node has two Intel Pentium-!!! 667MHz processors and 256MBytes of shared local memory. Each computing node has two Celeron processors and 196MBytes of shared local memory. Each Pentium-!!! has 32K on-chip instruction and data caches (L1 cache), a 256K on-chip four-way second-level cache with full speed of CPU. Each Celeron also has 32K on-chip instruction a d data caches (L1 cache), a 128K on-chip four-way second-level cache with full speed of CPU. The individual processors are rated at 495MHz, and the system bus has a clock rate of 110 MHz.

Figure 4: The snapshot of SMP-based cluster.