GISdevelopment.net ---> GITA 1999 ---> User Perspectives

Outage Management Architecture Decisions at Niagara Mohawk

Brian A. Detota
IT GIS Project Manager
Niagara Mohawk Power Corporation
300 Erie Boulevard West
Syracuse, New York 13202
Phone: (315) 428-6691
Email: detotab@nimo.com


Introduction
Niagara Mohawk Power Corporation (NMPC) is a public-owned electric and gas utility with a service territory of 21,000 square miles reaching throughout Upstate New York. It serves 1.5 million electric and 500,000 gas customers. Its main office and computer center is located in Syracuse, New York. The geographic area spans 37 counties including 670 cities, towns and villages. A sampling of the electric facilities profile shows: 131,000 circuit miles, 2,200 feeders, 1.2 million poles and 345,000 transformers.

NMPC in 1999 will complete a three-year effort to convert data from various disjointed systems into one seamless GIS database. As data for a region is converted it is made available in production where GIS users have access to a suite of applications such as work order design, estimating, records management, etc. An extensive QAQC plan was developed to ensure network connectivity throughout the electric model down to individual customer locations. This intensive effort continues realizing this connectivity is the foundation for the Outage Management System NMPC envisions.

Niagara Mohawk, like most utilities, is under transformation with re-engineering efforts addressing the way the business can respond to the competitive marketplace. Several of the outcomes from these efforts can have a significant impact on the architecture decisions to support the Outage Management System. For instance, when performance feasibility studies were done at NMPC in early 1997 there were 4 Regional Control Centers (RCC) that managed Outages.
  1. Central area RCC located in Syracuse.
  2. Western area RCC located in Buffalo.
  3. Eastern area RCC located in Albany.
  4. Northern area RCC located in Watertown.
The plan then was to have a distributed environment with each center being capable of stand-alone operation. Since that time the Northern RCC has been closed and the management of its geographic area consolidated with the Central RCC. Now there’s strong indication that over the next two years all centers will be consolidated into the centralized location. However, a requirement surfaced that suggests 10 to 12 district offices may require access to the system during certain storms. Additionally, NMPC is planning on implementing a new Customer Information System in 1999 which is the entry point for customer trouble calls, so we need to plan on handling calls from the old and new CIS systems. So we’re seeing the Outage Management business processes demanding the need for an adaptive Outage Management technical architecture which can support the changing requirements of the business, yet integrate with the enterprisewide technical architecture.

Outage System Requirements
The NMPC Business Requirements outlined performance and technical criteria for the Outage System. NMPC has chosen an Outage Management package where certain technical criteria were included with the functionality of the product. The rest of the criteria would be dependent on factors within the NMPC environment such as interfacing with the Customer Information System. The following requirements focus on those that are pertinent in considering the technical architecture for implementing the system.

Business Requirements
The system needs to provide the capability for assignments of dispatchers within Dispatch Areas by geographic location with the ability to easily shift a dispatcher from one area to another. Additionally, if one of the regional control centers operations is lost another center can take over dispatching for the lost center dispatch areas. In certain storms the system must be made available to district offices. The system will provide customer outage feedback information to the Customer Information System.

Performance Requirements
Many Outage Management Systems are hobbled by the time to build and manage updates for the dynamic electric model required for an Outage System from a relatively static GIS model. This system must be able to manage incremental updates from GIS on a nightly basis without interrupting work on an active Outage. Other performance considerations include:
  • The prediction engine processing 30,000 trouble calls per hour from the CIS interface.
  • Graphics rendering and screen redraws must be fast enough so as not to impact dispatcher performance.
  • 100 switching operations per hour.
  • 1,000 dispatch requests per hour.
Technical Requirements
Being a mission critical system, the ability to provide a Highly Available 7 X 24 operation is critical. This would include automatic detection of background process failures, such as the process that retrieves trouble tickets from CIS. Additionally, the system is required to run the NT Operating System with dual-head monitors on the client-side. This is a must to integrate with the NMPC desktop strategy.

Requirements Analysis
The outage package NMPC selected met the requirements that were attributed to its functionality. Based on some feasibility tests the vendor felt a distributed architecture would satisfi the performance requirements and also provide multiple stand-alone environments that would offer built-in redundancy if a site were lost. For example, if a WAN link was lost to a site it could continue to operate.

As NMPC IT reviewed the requirements and suggested architecture several issues arose as to the cost and deviation from our current technical architecture that would be required to implement the distributed architecture. Although, two years ago we envisioned implementing the distributed environment our thinking changed as we examined the changing business and what our current infrastructure could support. We felt the package was flexible enough for us to consider a centralized environment approach, which would meet the business requirements and leverage our existing infrastructure and tool sets.

Decisions to Support a Centralized Architecture

Detailed Analvsis
In early 1998 detailed discussions with different groups from IT began to review the technical requirements. Individuals from the outage project team, enterprise technology assessment group, distributed computing, network services, database administration and mid-range technical services were represented. Previously, discussions took place with these areas on feasibility issues but now were the formal discussions to determine a supportable architecture.

The project team approached these discussions with a bias toward a distributed architecture that at the time would be implemented at four regional control centers. Each site would consist of two UNIX servers each managing different database engines and an NT server to run background trouble call retrieval and prediction processes. Database replication services would be used to synchronize the data between the four sites, so each site would have access to updates from other sites. This provided the benefit of making each site capable of stand-alone operation in the event of a failure such as losing a WAN link.

Additionally, the possibility of at least one center being available in a catastrophic event would be near 100 percent. If a remote site required access to the system, that could be accomplished by running products such as Citrix WinFrame or Terminal Server with the remote site being served from the closet control center. This architecture seemed it would satis~ the requirements, particularly the flexibility in having centers take over dispatching operations for other centers and mitigate the risk of not having an operating center available in the event of a catastrophe.

As the group mapped the components to support the distributed environment it became clear this approach did not fit with the NMPC enterprisewide technical architecture. We did not have a support staff to administer the UNIX servers in the regional control centers. UNIX is our mid-range standard and although we are looking at NT servers for particular applications, we didn’t feel it would be appropriate for this mission critical system. The DBA group had not worked with database replication and from their research weren’t comfortable with its reliability and resilience for this application in our environment.

Another major consideration is that our Customer Information System is centralized running on mainframe databases. This system is the primary point for the entry of trouble calls by customer representatives or through voice response units and without this the use of the Outage Management System is limited. So, this identified a weak link where, although each control center could operate stand-alone, it may not be able to get trouble calls and communicate with the CIS.

At this time we were getting information from re-engineering that the system would be required in 10 to 12 district offices for certain storm situations, but that the regional control centers would move toward consolidation over the next two years. In evaluating this we would need to buy UNIX servers for each site and administer them or try and run an emulation product such as Citrix. We felt an emulation product was not the direction we’d want to take at this time. Therefore, for each additional site we would incur the cost of buying servers, replication licenses and support services. For each site we calculated the server hardware and database software including replication would be around $75,000. We didn’t try very hard to calculate a support cost because we knew we had to investigate a centralized option which could still meet the requirements.

As we considered a centralized approach we saw issues in the requirements for regional control centers taking over dispatch operations for each other and performance, particularly with the graphics rendering and screen redraws. We had always thought in terms that it would bean advantage to manage an outage from the regional control center closet to it, in the event communication was lost across the WAN. However, after reviewing this with the business analysts again, it was clarified that this was not a must, but that the requirement was that any other center could take over.

This was a key factor because it meant a centralized architecture could still meet this requirement as long as the central computer center was operational and at least one WAN link to an RCC was functional. NMPC has considerable experience and metrics for supporting mission critical systems from a central computer center. One of these systems is the mainframe Storm Restoration System that has been in operation for 12 years that is being replaced by the Outage Management System. Investments have been made in hardware, software and a fail-over site to provide high availability for the computer center. This has led to a 99.9% availability metric for the mission critical systems.

NMPC feels comfortable with the risk of the centralized approach and continues to invest in fail-over strategies, particularly in the mid-range area. NMPC is also investing strategically in high-speed networking with redundancy to the major offices. An ATM backbone is already in place to the existing regional control centers with 0C3 (155mb) or DS3 (45mb) speeds. This strategy has addressed the risk of losing all WAN communications and concerns with performance. Our LAN standard is 100mb Ethernet so running at 0C3 across the WAN should effectively be the same. This seemed to cover the major centers but we still needed to address the performance for the 10 to 12 district offices that may require access to the system that would likely be existing T1 lines. The decision was made to upgrade those sites to DS3 or OC 12 since it would fit with the overall NMPC network upgrade strategy, except that the cost would be incurred sooner than expected. For this approach we would buy two higher-end UNIX servers for the central computer center at a cost of roughly $120,000. These servers would have a minimum of 1 gig of memory and dual 180-mhz processors. Additional costs would occur as decisions were made on high availability hardware and/or software products. But, these decisions would be based on utilizing existing products or leveraging new products as part of an enterprise mid-range HA strategy.

The outcome of these sessions was to pursue a centralized architecture, develop and evaluate benchmarks for various network and hardware configurations. We also realized that as a result of the benchmarks it was still a possibility we would need to move to a distributed architecture. But, we felt the costs we would incur in trying the centralized approach would not be wasted because everything we were doing aligned with the NMPC enterprise architecture strategy.

Implementation Plan
The plan to implement this system consisted of a phased approach beginning with a pilot area of 50,000 customers handled by the Syracuse RCC. This went live this past December and is running in parallel with the mainframe Storm Restoration System (SRS).

The SRS is coupled with the CIS system and is used by customer representatives to create trouble tickets from customer call-ins. It prints the trouble tickets in the control centers where the dispatcher reviews the tickets and manages the rest of the tasks associated with an outage manually. Although limited and now technically outdated, this system has been critical to NMPC and the process of managing outages. It provides a contingency for the new Outage System that will allow the dispatchers to work with the new system while there-engineered processes mature and the software is customized from their feedback.

Through the first six months of 1999 the use of the system will be phased in for all operating districts managed by the Syracuse RCC (about 400,000 customers), still running in parallel with the SRS. This plan affords us the opportunity to gain productional experience with the system, develop in-house expertise for supporting and customizing the software, and validate the architecture decisions we envisioned while still having a fallback system. The rest of 1999 will phase in the operating districts managed by the western RCC for a total of about 1 million customers. Since the GIS data for the remaining eastern areas will not be converted until the end of 1999 we don’t expect to be fully implemented until Q2 of 2000. Built into the SRS interface is the ability to turn off printing the trouble tickets by operating district at the point we’re confident of the new system’s operation and process for an area.

Customer Svstem Interface
As mentioned, customer representatives create trouble tickets from customer call-ins using the current SRS and CIS systems. These are mainframe systems based on IMS databases. In 1999 the CIS will be replaced with a new system based on mainframe DB2 databases. This new system will continue to work in conjunction with the SRS system. However, it will assume all customer contact user interface functionality previously handled by the Storm Restoration System. The SRS will still be responsible for printing the trouble tickets via an interface with the new CIS.

Working with analysts from CIS we worked on building an architecture that would meet the interface requirements of:
  • Accessing trouble tickets for the Outage system under the old SRS - CIS interface.
  • Accessing trouble tickets under the new CIS for Outage and for SRS printing to allow the systems to run in parallel.
  • Operation of the new CIS - SRS interface for areas where Outage is not implemented.
  • The Outage System or SRS providing outage status information for the customer representative.
  • Eliminating the SRS interface when the Outage system is fully implemented and does not need to run in parallel (Q3, 2000).


Figure 1- CIS - Outage Interface

Figure 1 outlines the architecture that was implemented December 1998 to interface the current CIS with the Outage System. The major changes were the addition of two new DB2TMtables for trouble ticket data and outage feedback data. At the time a trouble ticket is created the ticket would be printed to the RCC along with writing the data to the SRS database and new DB2 table. A polling process would then read this new table for tickets that are in operating districts where the Outage System is operational and write data to an Oracle table that would trigger the prediction process. As status information for an outage was changed a process would update the outage feedback DB2 table for customer representative communications with affected customers.

For the new CIS system the Outage System interface with the new mainframe tables remains the same. But, changes are made to the new CIS system to write to the trouble ticket table and read from the outage feedback table via its API. At this point the SRS system is separate from the new CIS and a new SRS process polls the trouble ticket table for trouble tickets to print. A third DB2 control table was added, so the SRS process could tell it if the Outage System was active for an operating district and if it should print the trouble ticket. We would update this control table to turn off the printing when running in parallel is no longer required. This design met the business requirements for supporting the old and new CIS interface with a minimum of rework. But, we also had to decide on the communication strategy between the different platforms. As everyone has probably experienced there is an enormous number of gateway products on the market that seemingly can do anything. It becomes overwhelming trying to compare products and understand the support issues involved.

NMPC has a strategic direction to reduce the number of gateways in its enterprise, so we wanted to work with a product that would be included in that direction. For the pilot we wanted something we could develop quickly with less consideration given to reliability. We used DB2 Connect, based on ODBC, running on a dedicated NT workstation and were successful in its use for the pilot implementation. Based on stress tests we were able to pull 15,000 calls from the mainframe trouble ticket table in an hour, exceeding our pilot requirements.

Beyond the pilot we needed a solution that would provide reliability and recoverability. A problem with the DB2 Connect approach is its synchronous operation, where the Outage polling process updates the mainframe trouble ticket table and has to wait for acknowledgement before reading the next ticket. Our solution for this was to use MQSeries that is used by other applications within NMPC. This is a commercial messaging product that provides asynchronous operation so communication between the platforms is not dependent on responses from each other. If the network connection becomes unavailable the messaging knows where to pickup to assure once-only delivery of messages (i.e. a trouble ticket) when it becomes available again.

To implement this we needed to establish message queues on the mainframe and an NT workstation and write some application code to read and write to the queues. This method eliminates the need for polling, only using the network when a new message needs to be delivered. We haven’t stress tested this approach yet but vendor benchmarks indicate its capable of processing 1 million messages per hour.

High Availability Amwoach
Since our implementation plan provides for running parallel systems most of 1999 we have delayed the drill down on HA strategies to Q1 of 1999. However, we have identified some weak links and have some ideas on addressing them.

For several critical processes that run on NT we’re planning on running them under NT server software which integrates better with our corporate System Management Software. Between some custom code and the SMS software, failures can be alerted to the NMPC command center for appropriate action. Additionally, we’re investigating NT clustering technologies. Another area to address is how to detect if any of the MQSeries components have failed for the CIS interface.

This project and other projects are looking to solve this problem by integrating with the SMS software. On the UNIX master server side we’re investigating enterprise clustering and database parallel systems solutions. For the network, we’re comfortable with the NMPC network redundancy strategy. Finally, NMPC needs to implement a corporate disaster recovery plan for clientiserver and web applications which it has done a good job on for mainframe applications.

Future Plans
With the goal of providing more efficient response and restoration for outage occurrences, along with the ability to provide timely and accurate status information back to the customer, we plan on integration with other systems. Through fimctionality provided in the Outage product and NMPC customization we’ll build interfaces with our SCADA, Automated Meter Reading, Computer-Aided Dispatch, Materials, Work Management and Information Warehouse Systems.

A focus of the SCADA and AMR interfaces would be to push the known outage information to the CIS system so customer representatives would have up-to-date information regarding affected customers. When the Outage System is filly implemented we’ll probably rework the CIS interface architecture to eliminate the DB2 tables created to support the scenario described under the Customer System Interface section. The CIS trouble ticket entry process can write directly to a MQSeries queue and CIS may be able to get outage feedback data directly from the Outage System. Additionally, we see certain Outage System functions being done via the NMPC Intranet with lightweight clients. We see any opportunity to meet requirements via a Web interface being a major thrust into the future.

© GISdevelopment.net. All rights reserved.