GISdevelopment.net ---> GITA 2000 ---> System Architecture

Up-Time All the Time: Designing GIS for High Availability

Don Brady
Director, GIS Segment
High Performance Computing
Compaq Computer Corporation
Marlboro, MA
Email : don.brady@compaq.com


Introduction
GIS has expanded into the information mainstream as a core enterprise technology. As organizations re-engineer core applications to be spatially enabled, and as the whole enterprise becomes spatially enabled, we are witnessing rapid growth in the amount of spatial data and in the number of GIS users. And as GIS integrates what used to be islands of data into large – sometimes tens of terabytes! – enterprise databases housing both spatial and tabular data, the domain of mission-critical applications expands to spatial applications.

With these changes come several major challenges: GIS data and applications are being treated as a corporate resource, just like more traditional IT implementations; and spatial applications are now commonly subjected to many of the same design principles as traditional enterprise-wide, mission-critical applications. But computer hardware can fail, and such failures are costly to an organization if mission-critical applications cannot be kept running effectively – that is, “available”.
    Your GIS server experiences a hardware fault or a power failure, but your Emergency 911 system is dependent on continued operation of your computer system...

    Your customers report a power outage in their neighborhood, just as you experience a network failure, or your server crashes from a software problem. Will your work crew be able to locate the source of the power outage? ...

    One of the disks storing your GIS database crashes. Can your users continue to work productively and without interruption? …

    There’s a network failure, but your field personnel need uninterrupted access to your AM/FM data…
GIS user applications are "available" only if they allow users to access the GIS server applications and the GIS data files. High Availability environments are designed for computing installations that require critical systems to be automatically and seamlessly restarted in the event of a hardware failure. They can ensure that data remains accessible, and that applications can be kept running, even during a prolonged hardware failure.

This paper will investigate the nature and architecture of a High Availability GIS: the use of standard hardware and software components to provide automatic failover and continuous operation in the event of system failure. It will describe new features of the Tru64 UNIX operating system and TruCluster software from Compaq Computer Corporation that automatically enable this functionality in any supported application environment. At the same time, it will demonstrate how High Availability was earlier implemented by two major GIS software vendors on the Compaq AlphaServer platform, to minimize interruptions to applications and to keep file systems continuously available.

The GIS Environment
GIS implementations today are prototypical client/server environments, minimally consisting of clients, servers, storage, and a network. On the front end are the clients, or users, accessing applications and data. Clients’ systems most commonly run the Windows NT operating system on an Intel platform. Clients can be “thin”, loosely defined as using a graphical user interface (GUI) to access applications that execute on a server; or “fat”, meaning that the spatial applications process on the client hardware. On the back end are data servers, providing access to what are often very large data sets of both spatial and tabular data through a standard programming or user interface. The hardware platforms for these data servers are quite often larger servers, running UNIX or Windows NT. These same hardware systems may also run the enterprise’s applications, including the GIS. Increasingly though, the trend in GIS development is the introduction of a middle tier, forming a three-tier client/server implementation. In such a configuration, the GIS and other applications would be physically moved off the data server and placed on the middle tier.

Two important benefits of this type of configuration are
  1. Each system can be scaled and tuned to most efficiently provide the type of service required of it;
  2. Functional applications can be run in a different operating environment than the database.
In either client/server model, data is stored centrally on a server system. Clients access the data by making requests over the network to a program on the server. The server program coordinates the clients’ access to the data, and satisfies clients’ requests by accessing the data store and responding to the clients.

Risks to the Environment
Inherent to any application environment are various potential system failures that can cause the applications and the data to not be accessible to their users. These failures can be caused by hardware crashes, software faults, or environmental problems. They can occur on the client systems, on the data or application servers, on the storage systems, or on the network. It is incumbent on those responsible for implementing a GIS – or any mission-critical application – to consider all potential causes of failure, assess their impact, and plan accordingly. In the context of a High Availability solution, our interest lies with the continued accessibility of applications and data served by the second and third tiers of a client/server implementation.

Conversely, a High Availability solution does not specifically address the loss of a client, storage device, or network. A client failure, while annoying, is generally not critical, and easily recoverable. Often a reboot solves the problem, or the user can move to another system. Whatever the case, other users are totally unaffected by the situation.

A disk failure can cause loss of data, and will affect however many users access any files stored on the failed volume. It is decidedly more critical than a client failure, but is solved by RAID (Redundant Array of Independent Disks) technology, with mirroring and striping. While a RAID solution is an integral and essential component of a complete High Availability solution, it is not a topic unique to High Availability, and hence is not a focus of this paper.

Similarly, network failure can be catastrophic. Clearly if a network becomes unavailable, users cannot access the application or data servers. Proper planning to minimize its occurrence and reduce its impact is necessary for any implementation that requires continuous uptime. However, treatment of redundant networks is also a separate topic, and is not treated by this paper.

High Availability configuration
High Availability addresses the perils caused by the failure of the server node which is running an enterprise’s GIS application or data. Consider the impact of losing a server: all clients which are accessing the server application and files on that server are impacted. All users’ activities are suspended until the server problem is diagnosed and resolved, and the system is rebooted.

High Availability is a combination hardware and software solution that minimizes downtime of the complete GIS environment. It requires redundant hardware (multiple servers and network) and redundant and shared data (RAID). It also requires clustering software and services.

The simplest High Availability configuration requires two clustered server systems, both connected to a shared SCSI bus which connects RAID storage volumes that are visible to both servers: both servers can share access to all volumes. The servers do not need to be similar models or similarly configured. Depending on the implementation, each server may have its own system disk (not on the shared SCSI bus), or there may be a single system image for each of the nodes in the cluster.

One server, call it Server1, runs the GIS application (and possible additional applications as well), and the other server, call it Server2, runs other unrelated applications. Alternatively, both servers could be running the GIS application, with each instance serving a subset, or partition, of the GIS data, or a subset of the clients.

Detection of server failure
The two servers would be connected both to the main network, and also to a typically point-to-point private network used primarily by the High Availability features of the clustering software. The clustering software uses the network to regularly “ping” each of the servers in the configuration: each response to the ping is indication that the particular responding server is running and available; a lack of response from any particular server is indication that that server is not available. Detection of such a situation will initiate the High Availability services. Assume the server running the GIS application does not respond to a ping: the GIS server has failed, and thus the GIS application is not available.

High Availability services
As previously indicated, all applications and data served by the failed system will be (temporarily) unavailable, and all its users and clients will remain idle until the problem is resolved. The High Availability services seamlessly take three essential actions in order to restore normal service to the affected users:
  1. Storage accessed by the failed server is accessed by the surviving server;
  2. The GIS application that was running on the failed server is started on the surviving server;
  3. Client requests to the failed server are re-directed to the surviving server.
These actions are taken not by the applications or by the clients, but seamlessly, automatically, and transparent to the users by the High Availability services of the clustering software. Only after these events are carried out will the clients be able to resume normal operations. While this is occurring, a message may appear on the users’ monitors informing them of a temporary problem that is being resolved. The time required to carry out the automatic failover will vary depending on a number of factors; a minimum of 15-30 seconds is a realistic expectation.

During the time that the failed server is out of commission, the surviving server may suffer a performance degradation due to its increased workload. It is imperative that resource planning exercises take this into account when implementing a GIS. It is important to note that while this discussion has alluded to a two-server configuration, a High Availability solution is not limited to only two servers. In fact, in a many-server configuration, failover can occur to multiple nodes, so that the load of the failed server can be shared by multiple surviving servers, thereby spreading out the increased load over several systems.

When the failed server is repaired and brought back online into cluster environment, the pinging by the High Availability services will then receive a response from that server and recognize it is once again available. “Failback” can now be initiated automatically or manually: those applications and users that failed over to other nodes can be returned, achieving a load balancing across the entire cluster. The failback will require the same seamless and automatic steps (and delay to the clients) as the initial failover, but in reverse from the surviving server to the resurrected server.

Automatic failover can also be imposed manually on a given node. This action is useful under two circumstances:
  1. Load balancing: if one node in a cluster becomes over-burdened, one or more of its applications can be failed over to other nodes;
  2. Planned upgrades: to perform a hardware or software upgrade on a node, all applications running on that node can be failed over to other nodes to allow it to be temporarily taken out of service.
Automatic failover
Traditionally a client is aware of a number of physical server systems available on a network. But if considered logically, the client sees the network as providing a set of application services, such as e-mail, accounts and the GIS application.

The clients don’t need to know on which hardware systems these services are running; the clients locate and access the services by using a logical network address assigned to each service.

The logical network address for each service is in turn mapped to the network address of the physical server (the hardware) on which the service is running. In the event of say Server1 failing, its GIS application will be relocated to Server2, and the GIS service (virtual server) will be re-mapped from Server1 to Server2. From the client’s perspective, the client is accessing the logical address of the GIS service, which does not change regardless of which physical server the GIS application runs on.

For clarity, suppose two applications (services) are running on one server. Each application would have a unique logical network address, which the clients would use to access either of the applications. The server would have a unique physical network address. Both logical addresses of the two applications would map to the same physical address of the server. If the server failed, the High Availability services would fail over both applications to another server, which would have its own unique physical network address. The High Availability services would re-map both applications’ logical network addresses to the physical network address of the surviving server.

Summary
We have tested and fully characterized High Availability services as described in this paper with two GIS software platforms and standard RDBMS on TruCluster Available Server for Compaq Tru64 UNIX. Both application environments implemented differential SCSI, a RA 7000 array and RAID 0 partitioning for the database. Clients were mixed UNIX and Windows NT. The cluster interconnect was Memory Channel from Compaq.

We have demonstrated and verified an affordable and easily repeatable solution of standard hardware and software components for Municipal Governments and Utilities that provides transparent failover to clients and serves as a model for future solutions.

© GISdevelopment.net. All rights reserved.