High availability GIS: Beyond the application to the operating environment
High Availability enables recovery from server failure
While recovery from disk failure and network failure must be an integral part of any High
Availability solution, the scope of this paper is specifically recovery from server failure.
Consider the impact of losing a GIS application server or data server: all clients accessing
the spatial applications and files on that server are affected; all users’ activities are
suspended until the server problem is diagnosed and resolved. High Availability
solutions are designed for computing installations that require critical applications to be
automatically and seamlessly restarted in the event of server failure, ensuring that data
remains accessible and that applications be kept running, even during a prolonged failure
of the second and third tiers of a client/server implementation.
When considering a High Availability solution, one often thinks of expensive and custom
hardware, and sophisticated, costly application design. But the more practical and robust
High Availability GIS implementation is a combination of commercial off-the-shelf
(COTS) hardware and software, and multiple instances of hardware (redundant standard
servers, storage, and network adapters) and shared data (RAID). Such a combination
enables recovery from failure of any critical component.
Consider the following analogy: A farmer has a sled that must be pulled across the
country, and can choose between using one horse or several dogs to do the pulling.
Assuming either choice would provide sufficient power to do the job, is either preferable?
If one of the farmer’s requirements is that the sled always keep moving, then the
“multiple dog” choice is better. If one dog becomes disabled, the remaining ones can
work a bit harder, even though the sled may move a bit more slowly until the one dog
recovers or is replaced. On the other hand, if the one horse becomes disabled, the sled
will go nowhere for the duration. One less than the optimum number of dogs is better
than no horse!
When configuring a computing environment to run an enterprise’s set of applications, a
similar decision must be made regarding appropriate systems: one “large” one, or
multiple “smaller” ones. While there are valid reasons for both approaches, a High
Availability environment is strongly biased toward the “multiple smaller server”
approach. Even when one smaller system fails, the application environment is still
available to the users as long as the surviving servers cooperate and take on the work of
the failed one.
Indeed, the multiple servers can work both cooperatively – when one server fails, the
survivors take on the workload of their failed counterpart; and independently – under
normal circumstances, each is running its own set of applications or serving its own set of
users. In the eventuality of failure, the software component of a High Availability
solution can detect the failure and automatically impose the cooperation – those
applications running on the failed server are started up on (“failed over” to) the surviving
servers, and those users connected to those applications are switched over as well.
The multiplicity of servers, and the software that enables the cooperation among them, is
known as clustering, and should be implemented at both the middle and back tiers of a
three-tier client/server configuration for complete High Availability. Rather than
implement a mission-critical GIS on a hardware platform that incurred the engineering
cost of fault-tolerant design, an organization can much more economically configure its
GIS on a cluster of two (or more) standard, inexpensive and cooperating servers. In the
event of failure on one node, the surviving nodes can take on the workload of the failed
system until the problem is diagnosed and resolved.
Since the nodes of a cluster are typically configured to run independent workloads under
normal circumstances, the nodes are not merely redundant. Thus an additional benefit of
clustering, besides providing the basis of an affordable High Availability solution, is
scalability: an organization can add members to a cluster over time to run the enterprise’s
various tasks and meet its computing needs. (A farmer can add more dogs to pull a
heavier sled, or to pull the same sled faster, or temporarily to pull the same sled up a
steep hill!)
And the benefits of scaling are cyclic: the more nodes in a cluster, the better the
performance of the cluster after a system failure. When one system fails in a two-node
configuration, the survivor must take on, in addition to its own normal workload, 100%
of the load of those applications on the failed server that have been set up as being highly
available. In a four-node configuration, the three survivors can share the additional load
of those highly available applications from the failed node, each taking on one-third of
the load. Thus as the computing environment scales, the burden of cooperation is shared,
and the performance degradation of any of the survivors during the outage of the failed
server can be minimized. Indeed, the burden per node is inversely proportional to the
number of servers in the cluster. (The more dogs pulling the sled, the less the absence of
one will be noticed!)
A key component in the design of High Availability clustering is its ability to detect
individual system failure, and automatically and seamlessly start up the affected
applications on surviving nodes. If the GIS server fails, the GIS application can be
automatically started up on a cooperating server, and client processes transparently
switched over with it. The simplest High Availability configuration requires two
clustered servers; to be sure, any discussion pertaining to a two-node cluster also applies
to clusters of more than two servers. Under normal circumstances, one server within a
cluster could run the GIS application (and possibly additional applications as well), and
other servers could run other unrelated applications. Alternatively, both servers could run
the GIS application, with each instance serving a subset, or partition, of the GIS data, or a
subset of the clients.
Where is the High Availability?
High Availability can either be built in at the operating environment level, or can be
engineered by the application software vendor into the GIS itself. High Availability as an
integral component of the operating environment benefits both the user (most
applications running on the cluster can easily take advantage of all the High Availability
features) and the application software vendor (no incremental cost and complexity of
engineering High Availability into the application). GIS developers can focus on their
expertise of developing GIS solutions, and need not be concerned with the details of
detecting system failure and the implementation details of seamless and transparent
failover among cooperating servers.
While UNIX systems provide a solid foundation, meeting the availability requirements of
mission-critical applications demands a more comprehensive and dependable clustering
solution. And to be truly affordable, such a solution must require no unique system
configurations, specialized operating system variants, or proprietary storage components:
the clustering environment must use the same standard server platforms, operating
environment, storage architecture, fibre channel, disk controllers, and network adapters as
other systems.
One model for High Availability designed into the operating environment is Compaq
Computer Corporation’s Tru64TM UNIX operating system and TruClusterTM Available
Server, which provides a complete and robust High Availability environment on the
Compaq AlphaServerTM platform.
|