|
|
The free availability, high reliability, and relative efficiency of Linux has been a boon to computational scientists. Using Linux, scientists have been able to turn off-the-shelf personal computers (PCs) into effective Unix workstations suitable for a number of tasks including number crunching for scientific models. Beowulf-style cluster computing--pioneered by Thomas Sterling, Donald Becker, et. al., at NASA's Goddard Space Flight Center (see www.beowulf.org)--has extended the utility of Linux to the realm of high performance parallel computing. Further, the Open Source nature of Linux (see www.opensource.org) has allowed programmers to add features directly to the operating system to meet the unique needs of cluster computing. A collection of these enhancements is now distributed under the name Extreme Linux (see www.extremelinux.org)--It's Hot and It's Cool--by Red Hat Software, Inc. (see www.redhat.com).
Figure 1: Forrest Hoffman (left) and Bill Hargrove sitting on the Stone SouperComputer. |
We became involved in cluster computing in the summer of 1997 by leading the development of a proposal to fund the construction of a then-large Beowulf-style cluster of new PCs. The proposed cluster would support the development of parallel environmental applications, and would be used for research runs by Laboratory staff. This proposal was rejected. With significant effort already expended toward the design of a new high-resolution landscape ecology application, we decided to build one using the resources that were readily available: surplus Intel 486 PCs destined for salvage. Commandeering a nearly-abandoned computer room and scavenging as many surplus machines as possible--from Oak Ridge National Laboratory, the Y-12 production plant, and the former K-25 site (all federal facilities in Oak Ridge, Tennessee)--we setup a "chop shop" to process machines, and by September 16, 1997, we had a functional parallel computer system built out of no-cost hardware.
Aptly named the Stone SouperComputer, after the children's fable entitled Stone Soup (see The Story of Stone Soup below), the highly heterogeneous cluster grew slowly to 126 nodes as surplus PCs became available. The nodes, arranged in back-to-back rows on the raised floor of the old computer room, contain a wide variety of motherboards, processors (of varying speed and design, primarily 66 MHz 486DX2s, some 100 MHz 486DX4s, and 16 Pentiums), controllers, disk drives, and cases (see Figure 1). Each node has at least 20 MB of memory (most have 32 MB), at least 400 MB of disk space (for booting and local file access), and is connected to a private 10 Mb/s Ethernet network for inter-node communications. The first node in the cluster is also connected to the external network through a second Ethernet card for logins and file transfers.
Like the Borg warping through the galaxy looking for species to assimilate, we are always looking for unused PC hardware which can be added to the Stone SouperComputer. When a surplus PC is collected, it undergoes a "triage" process to determine if it can be made into a node. Component parts taken from a large number of "organ donor" machines in a "morgue" are combined to build a node meeting the minimum criteria for inclusion in the collective machine. We've developed a "triage disk" containing utilities which can be used to boot prospective nodes, determine their CPU characteristics, and configure their network cards. Masking tape "toe tags" on the top of each CPU case allow for easy identification of internal components. Because nodes are "headless" (i.e., they do not have their own keyboards and monitors), a pair of "crash carts"--each containing a monitor, a keyboard, a triage disk, and Linux boot disks--are wheeled around to "sick" nodes for diagnosis or to load Linux onto new nodes (see Figure 2).
If a new machine or an existing node offers resistance--if it behaves erratically, becomes unstable, or exhibits signs of hardware failure, we cannibalize it, thereby making it an "organ donor," and retire it to the "morgue." In this world, hardware is disposable and no maintenance contracts are required. As new versions of Microsoft Windows are released, better hardware becomes available for assimilation into the cluster, since users must frequently buy new desktop PCs in order to run the latest release. Staying just behind the curve means the Stone SouperComputer will have an endless supply of upgrades.
Red Hat Linux is loaded onto new or replacement nodes over the private Ethernet network from the first node in the collective. Unlike the configuration of some Beowulf clusters in which the root filesystem is NFS-mounted from a central server, a complete operating system is loaded onto every node of the Stone SouperComputer in order to minimize unnecessary traffic on the private network and to give each node some autonomy. Swap partitions twice the real memory size are created on the local disk of each node for fastest virtual memory access. After the base operating system is loaded onto a node, an "assimilation script" is executed which loads additional components and configures the node for coordinated cluster operation. While some large disks scattered throughout the cluster are NFS-mounted onto every node, each node has its own work disk which is used by parallel applications for local storage. This requires additional work when setting up each node, but results in much better application performance. Large disks are used for long term storage of model results or for large input datasets which cannot easily be distributed or held in memory.
We have encountered very few problems in the construction and operation of the Stone SouperComputer. Red Hat Linux is extremely stable, and nodes rarely crash or drop out due to software problems. Occasionally, as expected with such a large number of machines, nodes experience hardware failures--most often a hard disk crash or a fried Ethernet card. Such nodes receive a "organ transplant," are closed back up, and slid back into line.
Management of such a large, heterogeneous cluster poses many challenges. In order to keep the machine synchronized as users, NFS disks, and software are added or removed, a script is run by each node every hour. This script does a number of housekeeping tasks, including updating the password and group files, checking for new NFS disk resources to mount, creating directories on the local work disk for users, copying configuration files, and executing any other series of commands which are added to the script kept on the master node. Script execution on the other nodes is staggered in time so that all the nodes are not trying to get information from the master node simultaneously. While it isn't elegant, this method of keeping nodes in sync is fairly efficient and eliminates the need for multiple daemons which would have to run continuously. Cluster management tools are now being developed in many Computer Science departments to assist with these routine tasks.
System backups are performed only on home directories and NFS filesystems which contain valuable datasets. Since local work disks on nodes are used only for temporary storage, these partitions are not backed up. When these disks fail, we merely replace, reformat, and reinstall them.
The most unusual problem we've encountered is an intermittent loss of performance on certain 486 systems with Northgate or Mylex BIOS versions 6.xx on the motherboard. This problem drove us to develop a very simple benchmark, which performs many multiplication operations, just to check CPU performance on a regular basis. While most machines in the cluster run this benchmark in approximately the same amount of time on every execution, these machines "flip" between a normal user time and a much longer user time, up to 14 times slower. These unusual nodes alternate between normal and slow execution apparently at random, and never exhibit intermediate execution times. Rebooting these nodes does not necessarily improve the sluggish behavior of the machines, and within 24 hours all nodes can be observed to operate slowly at least once. Fortunately, we have never experienced this kind of problem with AMI BIOS 486 motherboards or any kind of Pentium motherboard. We no longer include these questionable motherboards in the Stone SouperComputer.
|
By tracking system performance in compiling and running the benchmark, we can identify realized CPU speeds and, to a lesser extent, input/output speeds. In this heterogeneous environment, these speeds fall into a few recognizable tiers representing the various CPU types (66 MHz 486s, 100 MHz 486s, 90 MHz Pentiums, 120 MHz Pentiums, 166 MHz Pentiums, etc.). Small differences within these tiers reflect configuration differences or minor hardware differences. The benchmark results are often useful in identifying poorly configured machines, and are also used to generate a list of nodes sorted by approximate performance from fastest to slowest. Parallel applications use this "machines" list to choose which nodes will be used for model runs. Since the list is sorted by speed, if 10 nodes are desired for a particular run, the top ten performers are automatically selected and used.
The Stone SouperComputer uses the GNU C (gcc), C++, and FORTRAN (g77) compilers as well as a commercial Pro Fortran compiler provided by Absoft Corporation (see www.absoft.com). For inter-node communication within parallel applications, we use either PVM (Parallel Virtual Machine, see www.epm.ornl.gov/pvm) or MPI (Message Passing Interface, see www.mcs.anl.gov/mpi ). These freely-available application programming interfaces (APIs) allow the programmer to define the set of cluster nodes to be used for an application and to make simple function calls in order to pass data between the nodes during computation.
Because ecological applications tend to be statistical in nature, they are particularly amenable to parallelization. This means that these kinds of problems can often be decomposed into smaller pieces which are solved independently. As a result, we frequently write parallel applications in a master/slave organization in which a single node (the master) distributes the work to many slave nodes. The analogy of a card game is useful here. The master node acts as a card dealer and the card deck represents the entire problem which is divisible into smaller parcels of work--the cards. This kind of organization is particularly good for a heterogeneous cluster of nodes if the work is split into many more parcels than there are players (slave nodes), because as players finish processing their cards, they return to the dealer for another "hit." This results in dynamic load balancing because the faster nodes end up doing more work than the slower nodes, and the problem finishes much sooner than it would if the work were evenly distributed to all nodes. Moreover, the heterogeneity works in our favor because no two machines will finish at exactly the same time. This reduces network contention for communications with the master.
We generally use the Single Program Multiple Data (SPMD) method for writing parallel codes. This means that a single program is run on all nodes, but the program branches to different instructions depending on whether it is being run by a master or a slave. SPMD codes are usually easier to write, run, and debug than codes which use separate programs for master and slave operations. We prefer writing parallel codes in C using MPI for message passing. Codes developed on the Stone SouperComputer are directly portable to even the largest and most expensive parallel computers because C and MPI are available on all such systems.
Since its creation, the Stone SouperComputer has been successfully used for applications as diverse as temperature and humidity interpolation, multivariate geographic clustering, individual-based fish simulation modeling, finite-element groundwater simulation, and continental vegetation modeling. A few of these applications are not inherently parallel, but because they must be run on multiple datasets or multiple times with varying parameters, they benefit by the availability of many nodes on this parallel system. Although denigrated by some, this mode of operation requires virtually no additional coding time and enjoys nearly perfect linear speedup.
While not offering the performance of newer Beowulf clusters, this machine is an excellent platform for developing parallel models which will port directly to other systems and for solving easily parallelizable problems like multivariate geographic clustering (see www.esd.ornl.gov/~hnw/esri98). The Stone SouperComputer has proven to be a fast, cheap, and robust scalable parallel machine which is dedicated to our ecological applications and controlled by our own priorities. Moreover, it has inspired a number of small universities and colleges to build their own clusters using existing equipment. Working with a highly heterogeneous machine where node memory is tight forces one to produce clean, efficient, load-balancing, fault-tolerant parallel code. The advantages of being forced into good coding practices are myriad.
We continue upgrading the Stone SouperComputer as newer surplus hardware becomes available, and expect to add new hardware in the near future. In addition, new environmental applications are being developed for the system. The latest description and photos of the machine as well as more information about its applications are available at www.esd.ornl.gov/facilities/beowulf.
Commodity cluster computing is moving into the mainstream. Today hardware vendors--including VA Research (see www.varesearch.com) and others--are offering a range of commodity computer systems sporting the Linux operating system, and many are even selling fully configured parallel cluster systems. These offerings, in combination with the availability of easy-to-use tools and the increasing popularity of the Linux operating system, mean that grass-roots cluster computing is now accessible to industry as well as academia.
|
|
__________
*Oak Ridge National Laboratory, managed by Lockheed
Martin Energy Research Corp. for the U.S. Department of Energy under
contract number DE-AC05-96OR22464.
"The submitted manuscript has been authored by a contractor of the U.S. Government under contract No. DE-AC05-96OR22464. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes." |