Logo GISdevelopment.net

GISdevelopment > Proceedings > ACRS > 2002


1989 | 1990 | 1991 | 1992 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2002
Sessions

GIS, GPS & Data Integration

Land Use Land Cover

Hazard Mitigation and Disaster Management

Photogrammetry

Forestry

Earth Observation from Space

Mountain Environment and Mapping

Data processing, Algorithm and Modelling

Urban Mapping

Hyperspectral Data Acquisition and Systems

AIT: Digital Asia

SAR / InSAR

Very High Resolution Mapping

Soil and Agriculture

Water Resources

Geology / Geomorphology

Education

Ecology, Environment & Carbon Cycle

Infrastructure Planning and Management

Oceanography and Coastal Zone Monitoring

Poster Sessions

Poster 1

Poster 2

Poster 3



ACRS 2002


Data Processing, Algorithm and Modelling


PC clusters as a platform for image processing: promises and pitfalls


Basic Architecture
The notion of obtaining high-powered computing capabilities based on inexpensive PC's congealed somewhere during the late 1990's, in various U.S. universities and government labs, into the Beowulf architecture (Sterling et al., 1999). Since that time, use of this architecture has spread around the world, so that today several of the world's fastest "supercomputers" are actually large Beowulf clusters. In fact, there are other possible architectures for a PC cluster; Beowulf just happens to be the easiest and least expensive.


Figure 1: Basic Cluster Architecture

The Hardware
The basic design objective of a Beowulf is that it should use only mass produced, commonly available hardware and software. Thus the hardware, as shown in the illustration, is usually standard PC's connected by high-speed ethernet. (Of course, the cheapest computers are those you already own, so if you happen to have a bunch of Sparc's or Macintosh Unix machines, those would be fine.) Traditionally high-speed ethernet switches have been fairly expensive, but prices are coming down rapidly and, in any case, you only need one of them.

The PC's, or nodes, do not have to be all the same, either in hardware or operating system, although system administration can get to be a headache if they're very different. Generally, the software which needs to be executed is installed in advance on each node to avoid the delays of loading across the network.

One of the nodes is special, as shown, in that it connects both to the cluster and to the "outside world". This double-headed node acts as a firewall: it filters and controls all data arriving from the outside. This allows the other nodes to run with all security restrictions turned off. The double-headed node also frequently handles administration, status monitoring, and other chores.

We should emphasize that, despite the term "high speed ethernet", the use of ethernet for communication is actually the slowest part of the system. Its major virtue is that it is much less expensive (both in money and effort) than the next best choice. However, this fact imposes some constraints on what types of problems are suitable for Beowulf processing. The good news is that there are still many suitable computational tasks in remote sensing and GIS.

The Operating System
There are a number of possible operating systems available, but Linux is almost always the final choice. There are several reasons for this:
  • Linux is free. This is a strong argument, although Linux is not the only free OS.
  • Linux runs on most processor types that you might want to use, including Intel, Alpha, Sparc, and PowerPC. In fact, Linux was the first operating system available for the Itanium IA-64.
  • There is a variety of special software and knowledge available for Linux from the various Beowulf projects, particularly in the area of optimized network drivers.
  • Linux is well known. Since Linux is essentially Unix, many people have experience working with it. Of course, especially in an organization which is not particularly computer-science-oriented, Unix expertise is not as common as Windows expertise. Nevertheless, the other considerations listed usually rule out a Windows -based approach.
  • Finally, as discussed below in more detail, the typical Unix style of software, based on independent tools and utilities, make it particularly easy to use already-existing applications and components.
Normally, the term Beowulf is only applied to clusters running Linux.

The Software
Not every computing task or algorithm is appropriate for a cluster. A task may not be parallelizable because its essential nature requires that each step be completed before the next step is begun; there is no work that can accomplished in parallel by multiple processors. It is also possible that a task might be parallelizable but not appropriate to the slow Beowulf communication.

The most common question, however, assuming you are not planning to develop the software yourself, is whether the software you actually have available can be used in a parallel fashion. In essence, software that is parallelizable needs to consist of separate components which can run semi-independently on each node in the cluster, plus software to coordinate the different processes and partial results. (Semi-independently means, among other things, that the components should not require much user interaction.)

In some cases, you may be able to find software (such as our Parallel Dragon) which includes all the necessary components together. However, you can also create your own parallel system using available processing modules and script-based coordination. Alternatively, if you have the source code for a software package, you may be able to parallelize it yourself, or hire a consultant to do it.

Multiple Simultaneous computations with process-level parallelism
There is an important difference between the two examples described earlier. In the first example, the only concern was throughput. Each computer was operating independently, converting one single disk file (a piece of music) into one new disk file. It did not matter when each file got finished, only that the full set of them should completed as soon as possible. Thus, coordination among the computers was the only issue.

The task in the second example was more difficult (and accordingly more interesting). In that case, a single task was to be completed more quickly by using multiple computers. The need, therefore, was to be able to break that single task apart, somehow, and then eventually to reassemble the pieces in a useful way.

The first example is an illustration of process-level parallelism (or what is sometimes called an embarrassingly parallel task). (We use the term process to refer to a single indivisible executable such as a Windows .EXE file, which accomplishes a single objective.) This type of parallelism requires:
  • Some already existing program which can perform one instance of an operation (such as compressing music to MPEG format) which you need done multiple times, and which can be run without user intervention (i.e. can be given its input parameters on the command line, in a file, or in some other non-interactive way). Notice that this is exactly the same program that you would run on a single computer, without modifications. You don't need to write the program, rewrite the program, have the source code, or even know anything about how the program does its job. (You do, of course, need a license to run it on multiple computers.)

  • A program or script which can distribute the particular parameters for each instance out to the cluster nodes, and (possibly) collect the results when done.

    This control/administrative program does need to know about your cluster, but it does not need to know anything about the nature of the task you are trying to accomplish. Therefore this program is the added value which you, or somebody, must supply in order for your cluster to function.
By referring to the control program as a single program, we are of course glossing over many possible variations in functionality and organization. You might choose to write this yourself, for a single purpose (e.g. converting music) and for a known number of nodes. It might look like this:

rsh node1.mycluster.myhome.edu bladeenc Stones_LetItBleed.wav
rsh node2.mycluster.myhome.edu bladeenc Beatles_Revolver.wav
rsh node3.mycluster.myhome.edu bladeenc RaviShankar_Duets.wav

(This is a script which invokes the Linux rsh utility to run the bladeenc music conversion program on different computers, using different input files).

A more general solution would be able to adapt to different kinds of tasks and to different and changing cluster configurations, and would provide administrative, job control, and status reporting capabilities. The PVM (standing for Parallel Virtual Machine) package (Geist et al., 1997) provides one approach and set of facilities for creating such control programs. Perl also has such capabilities available. Given an adequately sophisticated control and administration paradigm, process level parallelism can be extended beyond simple multiple execution of independent instances, to include various pipeline and dataflow tasks. The only requirement would be that each pipeline or dataflow stage be capable of being executed as a stand-alone process. However, this is too complex a topic to consider in detail here.

Page 2 of 3
| Previous | Next |

Applications | Technology | Policy | History | News | Tenders | Events | Interviews | Career | Companies | Country Pages | Books | Publications | Education | Glossary | Tutorials | Downloads | Site Map | Subscribe | GIS@development Magazine | Updates | Guest Book