Pilot Testing of Beowulf Cluster - a PC cluster for Supporting Parallel Computing

With the continuous reduction in cost and increase in processing power of PCs, there has been a notable development in many institutions in the world to set up large clusters of PCs running the Linux operating system to complement their installations of the traditional packaged supercomputers. These systems are generally called Beowulf Clusters because this design was originally conceived by the Beowulf project at NASA Goddard Space Flight Center in the U.S.

The Computer Centre is now working on a pilot implementation of the Beowulf cluster of Linux systems to enable our users to test out the usefulness of this kind of computer system.  We would expect that Beowulf clusters could be an alternative solution to augment our existing high performance computing facility based on an IBM SP2 parallel computer. The cluster being built comprises of eight dual Pentium 800MHz CPU nodes (16 CPUs altogether), each of which is configured with 1GB of RAM and 30GB of local disk. These processor nodes will be interconnected by a dedicated high-speed Gigabit Ethernet switch for inter-processor communications. The system software used for setting up this cluster are the RedHat Linux 6.2 and Portland Group's Cluster Development Kit Software  (http://www.pgroup.com/prodcdk.htm).

We are now in the process of assembling the hardware and software that have been delivered.  The various software are being installed and integration testing is being made.  More progress of this development will be reported in the next issue of Computer News.

For readers interested in reviewing the basic concepts of parallel computing, we have attached below an article on Parallel Computing Concepts which is borrowed from the Paralogic Inc., Bethlehem, PA, USA.

If you require more information about the Beowulf cluster being implemented in the Computer Centre, please contact the undersigned.

Kwan Wing Keung
Tel: 2857-8631
E-mail: kwk@cc.hku.hk


[Contents] [Next] [Previous]
 

Parallel Computing Concepts

Due to the use of multi-tasking Operating Systems, it is possible to do several things at once. This is a natural "parallelism" that is easily exploited by more than one low cost CPU.  Depending on the application, parallel computing can speed things up by any where from 2 to 500 times faster (in some cases even faster). Such performance is not available using a single processor. Let's look at an example of a "parallel computing problem" which we are familiar with while waiting in long lines at a store.

The Parallel Computing Store

Consider a big store with 8 cash registers grouped together in the front of the store.  Assume each cash register/cashier is a CPU and each customer is a computer program. The size of the computer program (amount of work) is the size of each customer's order. The following analogies can be used to illustrate parallel computing concepts.

1. Single-tasking Operating System:

One cash register open (is in use) and must process each customer one at a time. Example: MS DOS

2. Multi-tasking Operating System:

One cash register open, but now we process only a part of each order at a time, move to the next person and process some of their order. Everyone "seems" to be moving through the line together, but if no one else is in the line, you will get through the line faster.  Example: UNIX, NT using a single CPU

3. Multitasking Operating Systems with Multiple CPUs:

Now we open several cash registers in the store. Each order can be processed by a separate cash register and the line can move much faster. This is called SMP - Symmetric Multi-processing. Although there are extra cash registers open, you will still never get through the line any faster than just you and a single cash register.  Example: UNIX and NT with multiple CPUs

4. Threads on a Multitasking Operating Systems extra CPUs:

If you "break-up" the items in your order, you might be able to move through the line faster by using several cash registers at one time. First, we must assume you have a large amount of goods, because the time you invest "breaking up your order" must be regained by using multiple cash registers. In theory, you should be able to move through the line "n" times faster than before; where "n" is the number of cash registers. When the cashiers need to get sub- totals, they can exchange information quickly by looking and talking to all the other "local" cash registers. They can even snoop around the other cash registers to find information they need to work faster. There is a limit, however, as to how many cash registers the store can effectively locate in any one place.  Example: UNIX or NT with extra CPU on the same motherboard running multithreaded programs.

5. Sending Messages on Multitasking Operating Systems with extra CPUs:

In order to improve performance, the store adds 8 cash registers at the back of the store. Because the new cash registers are far away from the front cash registers, the cashiers must call on the phone to send their sub-totals to the front of the store. This distance adds extra overhead (time) to communication between cashiers, but if communication is minimized, it is not a problem. If you have a really big order, one that requires all the cash registers, then as before your speed can be improved by using all cash registers at the same time, the extra overhead must be considered. In some cases, the store may have single cash registers (or islands of cash registers) located all over the store - each cash register (or island) must communicate by phone. Since all the cashiers working the cash registers can talk to each other by phone, it does not matter too much where they are.  Example: One or several copies of UNIX or NT with extra CPUs on the same or different motherboard communicating through messages.

The above scenarios, although not exact, are a good representation of constraints placed on parallel systems. Unlike a single CPU (or cash register) communication is an issue.

The common methods and architectures of parallel computing are presented below. While this description is by no means exhaustive, it is enough to understand the basic issues involved with Beowulf design.

Hardware Architectures

There are basically two ways parallel computer hardware is put together:

I) Distributed memory machines that communicate by messages (Beowulf Clusters)
    example at HKU: the Linux cluster and SP2 supercomputer
II) Shared memory machines that communicate through shared memory (SMP machines)
    example at HKU: the E10000 system with HKUSUA, HKUSUB etc.)
A typical clustered machine is a collection of single CPU machines connected using fast Ethernet and is, therefore, a distributed memory machine.

A 4 way SMP box is a shared memory machine and can be used for parallel computing - parallel applications communicate using shared memory. Just as in the computer store analogy, local memory machines (individual cash registers) can be scaled up to large numbers of CPUs, while the number of CPUs shared memory machines (the number of cash registers you can place in one spot) can have is limited due to memory contention.

It is possible, however, to connect many shared memory machines to create a "hybrid" shared memory machine. These hybrid machines "look" like a single large SMP machine to the user and are often called NUMA (non uniform memory access) machines because the global memory seen by the programmer and shared by all the CPUs can have different latencies. At some level, however, a NUMA machine must "pass messages" between local shared memory pools.

It is also possible to connect SMP machines as distributed memory compute nodes. Typical CLASS I motherboards have either 2 or 4 CPUs and are often used as a means to reduce the overall system cost. The Linux internal scheduler determines how these CPUs get shared. The user cannot (at this point) assign a specific task to a specific SMP processor.  The user can however, start two independent processes or threaded processes and expect to see a performance increase over a single CPU system.

Software API Architectures

There are basically two ways to "express" concurrency in a program:

    1. Using Messages sent between processors
    2. Using operating system Threads

Other methods do exist, but these are the two most widely used. It is important to remember that the expression of concurrency is not necessarily controlled by the underlying hardware. Both Messages and Threads can be implemented on SMP, NUMA-SMP, and clusters, with efficiency and portability being important issues.

Copyright (c) 1999 by Paralogic Inc., Bethlehem, PA, USA, All Rights Reserved.



[Contents] [Next] [Previous]