BEAM BEOWULF

Tutorials

Introduction

A new system can be somewhat daunting. In order to gain a degree of confidence in the system we suggest that you have a play with a few of our basic tutorials. These tutorials are not intended to explain complex programmed issues, rather to provide some confidence in the building and running of applications.

Loaded onto the system are two parallel programming environments, PVM and MPI. These parallel programming environment perform similar functions. We tend to recommend the use of the MPI system as it is based on industry standards and has a cleaner API.

The hardware of the system comprises a number of compute nodes, one termed the host node or server and the remainder slave nodes. Both server and nodes are dual processor Pentium II based and interconnected via a 100MBit/s ethernet links.

Login and the user environment

User interaction with the cluster is usually made via keyboard/monitor/mouse connected directly to the host server node n0 or by remote login from some detached remote workstation. For diagnostic purposes the server keyboard and monitor can be switched directly onto the nodes. Day to day admin or use of the nodes can be performed via remote login or by use of the remote shell rsh command. Each node is deemed to be equivalent for the purpose of password authentication.




The user is presented with an X-Windows GUI upon login using the fvwm-95 window manager. If required the user can use the KDE desktop environment. In order to use KDE, login, run the command "/opt/kde/bin/usekde" and re-login again. Remote users can either login via an rlogin or telnet session or via an X-Windows login from an X-Terminal, Unix Workstation or Microsoft PC using an X-Windows emulator. To create an X-Windows session from a remote Linux box the user could type the following command: X :1 -query <host node name>

Developing parallel programs.

The programming of parallel programs using both MPI and PVM requires the developer to partition the job. Typically identifying compute intensive aspects of the task, the data required and some method of splitting this data and the calculations. As they say "Therein lies the art". With respect to performance there is a trade off between communication time and processing time. If the compute time is low yet the data required large then little, if any performance gain might be realized. The algorithms best served are those that require a large compute time and little data.

The programming model used is the MIMD (Multiple Instruction Multiple Data) one with a large number of communicating sequential tasks. These tasks are distributed over the processors in the machine and they intercommunicate via Ethernet connections and shared memory. This requires the division of programming work amongst these intercommunicating tasks.

A conventional programming model used is to have a master task and a large number of identical slave tasks. The master task provides a user interface and partitions the work up for the slave tasks. It sends work packets with data to each slave and collates the results. Each slave receives work packets, performs the appropriate calculations and returns the result to the master task.

From the programmers perspective each task is a separate program, these programs communicate via the subroutines provided in the PVM or MPI libraries. The diagram below illustrates a conventional program that repeatedly calls a procedure to perform a calculation and a parallel version that generates the same results.


Any programming language can be used to create the tasks, however 'C', 'C++' and Fortran are the favourites and are thus better supported.

Starting a parallel application

Most developers will be familiar with the invocation of a single application on a single machine. Again most are familiar with the concept of file servers where a program file stored remotely can be executed on the users local machine. In the case of a parallel applications on the beowulf system the starting up of the tasks on remote machines is performed automatically, the binary task images made available by the simple expedient of NFS mounting the server development directories onto identically named directories on the nodes.

Host nodes directories mounted on nodes n1-n*
 
 
Host Directory Node Directory
n0:/usr  /usr
n0:/home  /home
n0:/data /data

Each node has a large disk drive fitted which is used for virtual memory swapping and in certain circumstances could be used to locally store data.

In both PVM and MPI it is necessary to boot the virtual parallel processing machine before performing any work. The parallel virtual machine consists of the nodes in the beowulf cluster and a daemon management program is started on each node. By default all of the clusters nodes are listed in the file /usr/beowulf/nodes.

Once the parallel virtual machine has been booted users can run parallel tasks over the nodes in the system. There are two ways of accomplishing this:

Configuration program is used

In this case an application schema file is created which lists the tasks to be run and possibly which nodes to run them on. The application is then run by using a "run configuration" program. In MPI this is called mpirun.

Master Task spawns slave tasks

In this case the master task spawns the slave tasks as required.
The MPI and PVM systems work in this manner, however details of operation are different. The source code for many examples is located below the directory
/usr/beowulf/examples
Please see the tutorials on each system for more information.

Performance of Parallel Applications

The performance of a parallel application is obviously very important as generally the reason to use parallel processing is to do something faster than a single processor could do it. There are a few things to consider in getting the maximum performance out of a beowulf cluster, the main areas are:

Compiler Optimizations

This area is obviously the same as for conventional single processor programming. All of the standard compiler optimizations and coding optimizations should be used. Normally the program runs a very simple algorithm over an over again, so spending time optimising this algorithm can produce large performance gains. Modern day pipe lined processors are generally best programed using optimising compilers, however in certain circumstances the programer could use assembler code to seed up some operations. For example use could be made of the Pentium MMX extensions.

Compute Time vs Communication Time

In a parallel computer there is a performance trade off between computer time verses communication time. For any given application these need to be balanced for the machine the application is running on. For example on a machine with slow communications it is better to have tasks performing more work per communication than tasks doing little work per communication. The attempt will be to try an make sure all processors are kept busy performing the algorithm and not sitting around awaiting communications.

Number of Tasks and Latency

The number of tasks to run depends on the number of processors in the system, the type of processing performed and the latency in the system. In general for an application with identical slave tasks it is best to have two tasks running per processor (On Beam beowulf systems there are two processors per node so 4 tasks should be run per node). The reason for running two tasks per processor is to avoid latency delays causing processors to sit around idle. If one task was run per processor it would send a result packet to the master task and wait a new work packet. During this time the CPU would sit idle. If two tasks were running per CPU then the second task could run during this idle time.
If too many tasks are run per CPU then the CPU would spend at lot of wasted time switching between tasks.
The BEAM bmon application can be used to aid in performance tuning. This application will show the usage of each CPU in the system and the network usage to each node. PVM and MPI also have trace applications (xpvm and xmpi) which can perform task communications tracing.

Debugging Parallel Applications

This can be difficult ! It depends to a large extent on the nature of the application. If the applications consists of a master task and a number of identical worker tasks debugging is fairly easy. In this case you can test and debug the master task and slave tasks separately. Normal Unix debuggers can be used, we would recommend the use of the ddd debugger. debugging the inter task communications can be more of a problem with the possibility of dead locks and other parallel programming problems.