Loaded onto the system are two parallel programming environments, PVM and MPI. These parallel programming environment perform similar functions. We tend to recommend the use of the MPI system as it is based on industry standards and has a cleaner API.
The hardware of the system comprises a number of compute nodes, one termed the host node or server and the remainder slave nodes. Both server and nodes are dual processor Pentium II based and interconnected via a 100MBit/s ethernet links.
The user is presented with an X-Windows GUI upon login using the fvwm-95 window manager. If required the user can use the KDE desktop environment. In order to use KDE, login, run the command "/opt/kde/bin/usekde" and re-login again. Remote users can either login via an rlogin or telnet session or via an X-Windows login from an X-Terminal, Unix Workstation or Microsoft PC using an X-Windows emulator. To create an X-Windows session from a remote Linux box the user could type the following command: X :1 -query <host node name>
The programming model used is the MIMD (Multiple Instruction Multiple Data) one with a large number of communicating sequential tasks. These tasks are distributed over the processors in the machine and they intercommunicate via Ethernet connections and shared memory. This requires the division of programming work amongst these intercommunicating tasks.
A conventional programming model used is to have a master task and a large number of identical slave tasks. The master task provides a user interface and partitions the work up for the slave tasks. It sends work packets with data to each slave and collates the results. Each slave receives work packets, performs the appropriate calculations and returns the result to the master task.
From the programmers perspective each task is a separate program, these programs communicate via the subroutines provided in the PVM or MPI libraries. The diagram below illustrates a conventional program that repeatedly calls a procedure to perform a calculation and a parallel version that generates the same results.
Any programming language can be used to create the tasks, however 'C', 'C++' and Fortran are the favourites and are thus better supported.
Host nodes directories mounted on nodes n1-n*
|Host Directory||Node Directory|
Each node has a large disk drive fitted which is used for virtual memory swapping and in certain circumstances could be used to locally store data.
In both PVM and MPI it is necessary to boot the virtual parallel processing machine before performing any work. The parallel virtual machine consists of the nodes in the beowulf cluster and a daemon management program is started on each node. By default all of the clusters nodes are listed in the file /usr/beowulf/nodes.
Once the parallel virtual machine has been booted users can run parallel tasks over the nodes in the system. There are two ways of accomplishing this:
In this case the master task spawns the slave tasks as required.The MPI and PVM systems work in this manner, however details of operation are different. The source code for many examples is located below the directory
This area is obviously the same as for conventional single processor programming. All of the standard compiler optimizations and coding optimizations should be used. Normally the program runs a very simple algorithm over an over again, so spending time optimising this algorithm can produce large performance gains. Modern day pipe lined processors are generally best programed using optimising compilers, however in certain circumstances the programer could use assembler code to seed up some operations. For example use could be made of the Pentium MMX extensions.
In a parallel computer there is a performance trade off between computer time verses communication time. For any given application these need to be balanced for the machine the application is running on. For example on a machine with slow communications it is better to have tasks performing more work per communication than tasks doing little work per communication. The attempt will be to try an make sure all processors are kept busy performing the algorithm and not sitting around awaiting communications.
The number of tasks to run depends on the number of processors in the system, the type of processing performed and the latency in the system. In general for an application with identical slave tasks it is best to have two tasks running per processor (On Beam beowulf systems there are two processors per node so 4 tasks should be run per node). The reason for running two tasks per processor is to avoid latency delays causing processors to sit around idle. If one task was run per processor it would send a result packet to the master task and wait a new work packet. During this time the CPU would sit idle. If two tasks were running per CPU then the second task could run during this idle time.The BEAM bmon application can be used to aid in performance tuning. This application will show the usage of each CPU in the system and the network usage to each node. PVM and MPI also have trace applications (xpvm and xmpi) which can perform task communications tracing.
If too many tasks are run per CPU then the CPU would spend at lot of wasted time switching between tasks.