High availability GIS: Beyond the application to the operating environment
Today every business relies on “place” information. Immediate or real-time access to
place information is essential, either because the spatial data itself is critical, or because
the spatial information is a tightly integrated component of a broader mission-critical
application. Utility companies need to locate the source of a power failure when
customers report an outage; a package delivery franchise needs to optimize delivery
routes. What will be the effect on those businesses if their data server hangs as a result of
too much load? Or if a system disk crashes?
Don Brady Compaq Computer Corporation 200 Forest Street MRO1-3/K14, Marlboro, MA 01752, USA don.brady@compaq.com As organizations spatially enable core applications, and integrate large – sometimes tens of terabytes! – enterprise databases housing both spatial and tabular data, the domain of mission-critical applications expands to spatial applications. With these changes come several major challenges: GIS data and applications are being treated as a corporate resource, just like more traditional IT implementations; and spatial applications are now commonly subjected to many of the same design principles as traditional enterprise-wide, mission-critical applications. But computers can fail: a High Availability GIS solution ensures that spatial data remains available and spatial applications continue running, even during prolonged hardware failure. Standalone computer systems typically can achieve 98 to 99 percent “uptime” – about three and a half to seven days downtime per year -- which for non-critical computing environments is generally acceptable. But mission-critical, can’t-do-business-withoutmy- computer-system environments can tolerate no more than a few hours per year of downtime, and in intervals of no more than a minute or so per instance. This is the essence of High Availability, and a primary concern of core business operations: 99.9 percent uptime, and downtime lasting for not more than a few seconds to a minute at a time. It is important to note that “High Availability” does not mean “Fault Tolerant”. A High Availability solution in fact tolerates hardware and software failure; but through proper planning and contingency measures it provides quick and seamless recovery to such failure without incurring the added cost of a fault tolerant implementation. So even though a particular component (i.e., server) of a computing environment may not be available for an extended period, the entire GIS remains fully available to all users. And such failure must be transparent to the clients: when a server fails, users continue their work as if no problem occurred, even if their application was running on that server, or their data was stored on that server. An implication here is that some other server must substitute for the one that failed. Further, users should not even be aware that a problem occurred, requiring that the substitution be swift. This requirement is what produces the metric that inaccessibility of applications and data must not exceed a few seconds to a minute at a time: anything longer would exceed the user’s threshold for system performance, and consequently would not be transparent. High Availability and the GIS environment Large GIS implementations have evolved into prototypical client/server environments. On the front end are the clients, or users, accessing applications and data. On the back end are data servers, providing access to what are often very large sets of both spatial and tabular data through a standard programming or user interface. Data server hardware platforms are quite often large servers, usually running UNIX, or perhaps Windows NT. These same hardware systems may also run the enterprise’s applications, including the GIS, but increasingly the trend is toward the addition of a middle tier, creating a three-tier client/server environment. In such a configuration, the GIS as well as any of the other applications would be physically relocated to the middle tier. Two important benefits of a three-tier configuration are that each system can be scaled and tuned to most efficiently provide the type of service required of it (that is, for example, one system tuned for database access, another for web-serving or GIS applications); and functional applications can be run in a different operating environment (UNIX, NT) than the database. In either model, data is stored centrally on a server system. Clients access the data by making requests over the network to a program (database manager) on the server. Inherent to any application environment are various potential failures – the result of hardware crashes, software faults, or environmental problems -- that can cause the applications and the data to not be accessible to their users. They can occur on the client systems, on the data or application servers, on the storage systems, or on the network. It is incumbent on those implementing a GIS – or any mission-critical application – to consider all potential causes of failure, assess their impact, and plan accordingly. High Availability enables recovery from server failure While recovery from disk failure and network failure must be an integral part of any High Availability solution, the scope of this paper is specifically recovery from server failure. Consider the impact of losing a GIS application server or data server: all clients accessing the spatial applications and files on that server are affected; all users’ activities are suspended until the server problem is diagnosed and resolved. High Availability solutions are designed for computing installations that require critical applications to be automatically and seamlessly restarted in the event of server failure, ensuring that data remains accessible and that applications be kept running, even during a prolonged failure of the second and third tiers of a client/server implementation. When considering a High Availability solution, one often thinks of expensive and custom hardware, and sophisticated, costly application design. But the more practical and robust High Availability GIS implementation is a combination of commercial off-the-shelf (COTS) hardware and software, and multiple instances of hardware (redundant standard servers, storage, and network adapters) and shared data (RAID). Such a combination enables recovery from failure of any critical component. Consider the following analogy: A farmer has a sled that must be pulled across the country, and can choose between using one horse or several dogs to do the pulling. Assuming either choice would provide sufficient power to do the job, is either preferable? If one of the farmer’s requirements is that the sled always keep moving, then the “multiple dog” choice is better. If one dog becomes disabled, the remaining ones can work a bit harder, even though the sled may move a bit more slowly until the one dog recovers or is replaced. On the other hand, if the one horse becomes disabled, the sled will go nowhere for the duration. One less than the optimum number of dogs is better than no horse! When configuring a computing environment to run an enterprise’s set of applications, a similar decision must be made regarding appropriate systems: one “large” one, or multiple “smaller” ones. While there are valid reasons for both approaches, a High Availability environment is strongly biased toward the “multiple smaller server” approach. Even when one smaller system fails, the application environment is still available to the users as long as the surviving servers cooperate and take on the work of the failed one. Indeed, the multiple servers can work both cooperatively – when one server fails, the survivors take on the workload of their failed counterpart; and independently – under normal circumstances, each is running its own set of applications or serving its own set of users. In the eventuality of failure, the software component of a High Availability solution can detect the failure and automatically impose the cooperation – those applications running on the failed server are started up on (“failed over” to) the surviving servers, and those users connected to those applications are switched over as well. The multiplicity of servers, and the software that enables the cooperation among them, is known as clustering, and should be implemented at both the middle and back tiers of a three-tier client/server configuration for complete High Availability. Rather than implement a mission-critical GIS on a hardware platform that incurred the engineering cost of fault-tolerant design, an organization can much more economically configure its GIS on a cluster of two (or more) standard, inexpensive and cooperating servers. In the event of failure on one node, the surviving nodes can take on the workload of the failed system until the problem is diagnosed and resolved. Since the nodes of a cluster are typically configured to run independent workloads under normal circumstances, the nodes are not merely redundant. Thus an additional benefit of clustering, besides providing the basis of an affordable High Availability solution, is scalability: an organization can add members to a cluster over time to run the enterprise’s various tasks and meet its computing needs. (A farmer can add more dogs to pull a heavier sled, or to pull the same sled faster, or temporarily to pull the same sled up a steep hill!) And the benefits of scaling are cyclic: the more nodes in a cluster, the better the performance of the cluster after a system failure. When one system fails in a two-node configuration, the survivor must take on, in addition to its own normal workload, 100% of the load of those applications on the failed server that have been set up as being highly available. In a four-node configuration, the three survivors can share the additional load of those highly available applications from the failed node, each taking on one-third of the load. Thus as the computing environment scales, the burden of cooperation is shared, and the performance degradation of any of the survivors during the outage of the failed server can be minimized. Indeed, the burden per node is inversely proportional to the number of servers in the cluster. (The more dogs pulling the sled, the less the absence of one will be noticed!) A key component in the design of High Availability clustering is its ability to detect individual system failure, and automatically and seamlessly start up the affected applications on surviving nodes. If the GIS server fails, the GIS application can be automatically started up on a cooperating server, and client processes transparently switched over with it. The simplest High Availability configuration requires two clustered servers; to be sure, any discussion pertaining to a two-node cluster also applies to clusters of more than two servers. Under normal circumstances, one server within a cluster could run the GIS application (and possibly additional applications as well), and other servers could run other unrelated applications. Alternatively, both servers could run the GIS application, with each instance serving a subset, or partition, of the GIS data, or a subset of the clients. Where is the High Availability? High Availability can either be built in at the operating environment level, or can be engineered by the application software vendor into the GIS itself. High Availability as an integral component of the operating environment benefits both the user (most applications running on the cluster can easily take advantage of all the High Availability features) and the application software vendor (no incremental cost and complexity of engineering High Availability into the application). GIS developers can focus on their expertise of developing GIS solutions, and need not be concerned with the details of detecting system failure and the implementation details of seamless and transparent failover among cooperating servers. While UNIX systems provide a solid foundation, meeting the availability requirements of mission-critical applications demands a more comprehensive and dependable clustering solution. And to be truly affordable, such a solution must require no unique system configurations, specialized operating system variants, or proprietary storage components: the clustering environment must use the same standard server platforms, operating environment, storage architecture, fibre channel, disk controllers, and network adapters as other systems. One model for High Availability designed into the operating environment is Compaq Computer Corporation’s Tru64TM UNIX operating system and TruClusterTM Available Server, which provides a complete and robust High Availability environment on the Compaq AlphaServerTM platform. High Availability provided by the operating environment TruCluster Available Server provides multi-host access to shared disks and a generic failover mechanism, making applications and data highly available. Individual servers in a cluster each run an independent workload of applications or network services. Each system monitors the health of the others by watching for "heartbeat" signals sent over both network and SCSI channels, ensuring reliable failure detection, while differentiating among network, I/O, and host failures. If one of the systems stops signaling, High Availability services detect the problem and automatically initiates a failover of applications to the remaining systems. As failover is initiated, the recovering system takes over the failed system’s network identity and storage devices, using either the Advanced File System (AdvFS) built into Tru64 UNIX or a standard database management system (DBMS). The recovery takes just a few seconds for AdvFS; DBMS recovery time will vary, depending on the DBMS and the size of the database involved. TruCluster Available Server configurations generally accomplish failover within 15 to 30 seconds.
As has been stated, the surviving server(s) may suffer a performance degradation during the time that the failed server is out of commission, due to the increased workload of the applications that had been running on the failed server. It is imperative that resource planning exercises take this into account when implementing a GIS. In a many-server configuration, High Availability services can be configured to set multiple specific nodes as failover targets, thereby spreading out the load of the failed server among multiple surviving servers. When the problem with the failed server is resolved, and the server brought back online, it resumes sending “heartbeats”, which will signal its restored availability to the High Availability software of the cluster. All other members of the cluster will automatically be notified, and failback will be automatically initiated: those applications and users that failed over to other nodes will be returned, re-achieving a load balancing across the entire cluster. The failback will be achieved in an even more seamless and transparent manner: the GIS application will be started up on its resurrected node while the failed-over instance is still serving users on the cooperating server. Then the time to switch a user context from one server application instance back to the original is negligible. The advantages of engineering High Availability into the operating environment are not limited to failover being available to all applications and users. Indeed there are additional benefits that would not be possible if High Availability were merely a design feature of the GIS itself. Through the system administration interface, user-transparent application failover can be invoked manually, an action which is common practice for planned maintenance; hardware and software upgrades; and to re-assign a server to a set of high priority tasks that may need dedicated resources. When a node of a cluster needs to be taken out of service for any reason, the system administrator can do so in a manner transparent to users, invoking the same seamless failover that occurs automatically with unexpected server failure. All applications running on the node will be failed over to other members of the cluster to allow it to be taken out of service. As with normal failback, the applications will be started up on the target nodes even while the node to be brought down continues to serve users, making the overall process even more transparent. Summary We have looked at a High Availability solution for GIS, enabled by Tru64 UNIX. Clusters of standard systems, rather than specialized and expensive fault-tolerant hardware, are the platform on which the environment runs; service failover is expedient and transparent to all client users. In most cases users will not even be aware that a server has failed. All applications – not only the GIS -- installed on the cluster can easily avail themselves of these features, with no engineering cost to the software vendor or end user. Application failover can also be invoked manually to facilitate load balancing, software upgrades, or system maintenance. The heterogeneous nodes of the cluster work both cooperatively and independently. They take on the load of a failed member, but also run their own workloads and can be added dynamically to the cluster to scale the environment as the customer’s needs grow. As an organization grows a cluster to increase scalability, the environment becomes even more available as failover load can be spread among surviving nodes; conversely, if an organization grows a cluster to ease the burden of recovery from a server failure, the environment scales proportionately. AlphaServer, TruCluster, and Tru64 are registered U.S. Patent and Trademark Office. Windows is a registered trademark of Microsoft Corporation in the United States and/or other countries. All other product names mentioned herein may be trademarks or registered trademarks of their respective companies. | ||
| © GISdevelopment.net. All rights reserved. |