Turning legacy into useable reality
Neil Tansley
Severn Trent Water,Waterworks Road Edgbaston, Birmingham B16 9DD Albert Sarvis Stoner Associates, P.O. Box 86 Carlisle, PA 17013 Introduction Analysis of the Severn Trent Water wastewater data migration project can be viewed as a case study in legacy system migration. The general strategies and methodologies used in this project can be applied to a variety of legacy migration projects; and the details of this migration are quite similar to other projects of its kind. The objective of this paper is to provide an overview of the Severn Trent Water data migration project, the challenges to migration, and the business implications inherent in legacy system migration. Discussion of the initial stages of the migration project require a clear description of the legacy and target systems data models and a thorough mapping of how data was moved between the two models. An emphasis has been placed on retaining mission-critical data, system functionality, and the customized rules found in the legacy system. Following explanation of these specifications, the actual translation stages and the validation of the converted data are described. Project Background Severn Trent Water (STW) is the second largest water utility in the United Kingdom (UK). With an area of 22,000 square kilometers, the STW service area is located in the Midlands of the country, surrounding the City of Birmingham. This massive data conversion project was a key component of a £32 million pound (approx. $54 million) GIS implementation project, which commenced in 1998. The project scope included development and implementation of GIS applications and enhanced business processes for more than 1,000 of STW's 4,500 employees. The scope of the data conversion included both STW's service areas for clean water distribution (41,800 km) and wastewater collection (52,300 km). While the clean water data conversion used mostly paper map sources, STW wastewater mapping was carried out using a proprietary AM/FM package called Thesis, licensed from Oscar Faber Water (UK). The data in the wastewater mapping system had been diligently maintained in Thesis since 1993 by 'agents' of STW and managed by local government organizations. Thesis is a stand-alone, DOS-based GIS that uses Dbase files to hold attribute data. STW and its 'agents' used all modules of the Thesis software program in daily operation. These modules provided view, query, edit, plot and digitizing functionality. The entire STW business relied on Thesis through the complete utilization and integration; this provided a sizable challenge for system migration from volumetric, technical, and business-process standpoints. Project Challenges There were many unique challenges to the Thesis migration project. One notable challenge was the existence of stand-alone Thesis systems originally maintained by 68 individual 'agencies' and seven in-house STW operational sites. Translating from many separate sources, even though they resided the same system, required finding the subtle differences in the way data was maintained. These differences included use of non-valid values, liberal use of a 'remarks' field to store key attributes, and shortcuts in digitizing. A Several non-alphanumeric database fields were also a unique aspect of this project. All coordinate data, and many other graphic description fields, used a form of binary storage that required specialized coding from the software vendor. Subsequent to decrypting the data, interpretation of the resulting values also proved difficult. A detailed discussion of these challenges is included in the section on data modeling. One of the more common challenges faced in any legacy system migration is the existence of proprietary, undocumented data models. Thesis data was transferred from one Thesis machine to another through an exported/imported ASCII file only readable by Thesis. Additionally, migration of this data was carried out using the raw database files and not the native transfer files. Crucial first steps included discovering in which tables the attributes, graphic coordinates, and valid value lists were stored. Lack of a documented data model meant that all of these tasks were performed blindly, requiring a type of reverse engineering. Data model interpretation will be discussed later in the paper with Data Model Development and Data Mapping. Data Freeze and Change Tracking An important business consideration for a legacy system migration is what to do once data has been handed off to the migration vendor. This data must essentially be frozen at a point in time. Business does not stop during this period and changes to the system begin to accumulate. For traditional migration projects, change orders that arose during migration either caused a tremendous backlog of work, or were literally hand-drawn onto paper plots and then set aside until data was ready for updates. In an ideal migration project, the ability to carry out a database comparison or use data in a Records Management System would provide better tools to keep track of normal changes. By comparing the legacy database to the target database, differences could be detected and updated in the target system. In many legacy migration projects, however, this comparison takes more effort than re-creating the data from scratch. . Another method that can be considered is to use date stamp attributes on the legacy system to extract changes that occurred after a defined date, and add those changes to the target system following migration. Unfortunately, this option rarely exists in true legacy migrations, and the STW project was no exception. The STW solution sought to reduce the amount of data tracking necessary by minimizing the time between freeze and delivery of migrated data. Since the real issue concerning change tracking is the amount of data that accumulates over the elapsed time between data collection and migration, arranging for quick turn around almost negates the need for change tracking. The Thesis migration project adapted well to this technique because so many separate systems existed. The files associated with the many users of Thesis were relatively small and constituted neatly organized batches of data for migration. This allowed the project team to work through the data within a few weeks of the data freeze and delivery. More importantly, the small amount of change orders that were generated during this short time could be held until data was redelivered in the target system. At the time of delivery, changes could be handed over to the new centralized Records Management Centers for updating. Data Modeling Logical Data Model Creation An obvious predecessor to the migration of the Legacy Thesis data was the selection and development of a target database/GIS in which to migrate the data. STW's objective was not to simply migrate their Thesis data into a more modern software package but to improve it. The creation of the target system is referred to as Data Modeling. In the early stages of the UADMS project, a Logical Data Model (LDM) was developed to encompass All of the mission-critical wastewater information that was to be contained in the GIS, and to predict the future business needs relying on that data. Logical data models are built without regard to any technical constraints that may exist in a specific database/GIS. A LDM is essentially a representation of all business information and the relationships found within this data. A primary goal of the migration is to have the database/GIS product conform to the needs of the business so that the business does not risk being forced to compromise their operations based on product limitations. The STW wastewater LDM was initially created using the Thesis GIS as the base to build from. All critical features, attributes, and their valid values found in Thesis were carried over to the LDM. Additional assets, attributes, map features, and their relationships were added to this LDM based on gaps found between Thesis and the needs of the business. Thesis users, wastewater system experts, and business stakeholders were all included in this stage of data modeling. Figure 1 shows a small section of an STW wastewater Entity Relationship Diagram (ERD). Diagrams were created to depict node, line, and area features. In Figure 1, entities that refer to STW assets are found within boxes; and arrows connecting these entities symbolize the relationships between these assets. This section of the ERD shows the relationships between Sewer Pipes and several node entities. These diagrams are very useful tools for understanding a Logical Data Model. ![]() (Figure 1) Physical Data Model Creation While the Logical Data Model (LDM) depicted all the information essential to STW's wastewater asset management needs, it was just the first step to a final target database/GIS. The process of migrating the STW wastewater L D M to a Physical Data Model (PDM) involved two tasks: creating a physical database model and implementing relationships through the use of a rulebase. First, the actual Oracle database that was to be used with the UADMS GIS applications was designed by creating tables that contained the spatial and attribute information for each entity in the LDM. These tables were designed based on pre-defined LDM specifications such as attribute data types (character, numeric, etc...), widths, and precision. The tables also contained lists of valid values for those attributes. For the majority of STW entities, a one-to-one relationship existed between LDM entities and PDM tables. In STW's case, there are actually two PDM tables working in conjunction for each entity; the business table contains the attributes and an SDE table contains the geographic coordinates. These unique entity types are referred to in the GIS as Feature Types. A few assets, such as inlets and outlets to a pumping station, were represented as two unique entities in the LDM but could be joined into one table (Feature Type) in the PDM due to matching attributes. Distinction between these two features could be made based on attribute values. These attribute distinctions are referred to in the GIS as Subtypes. Subtypes can also be found within tables storing only one unique LDM entity. Gravity Sewers, for example, were all contained in one PDM table but were broken into multiple subtypes that represented uniqueness in ownership and purpose. (ex. Public-Foul-Gravity Sewer). Subtyping becomes important in the PDM because the LDM describes the relationships at the subtype level. Rulebase Development Coinciding with the Oracle database, the UADMS PDM has data relationship rules defined and controlled by a Rulebase. UADMS uses the ArcFM Rulebase, which works in conjunction with the Oracle database to control many aspects of wastewater data. All of the data validation checks native to Thesis that were deemed useful were built into the UADMS PDM either through the database or rulebase. The functionality of the Thesis validation checks was not only migrated but also vastly improved in UADMS. Thesis validation included checks for null and invalid attribute values where they were not already constrained by valid values called code lists. Connectivity restrictions were built into the Rulebase to control and manage certain situations. For example, certain node assets can connect to not only a Feature Type like Gravity Sewer but also to the Subtype Private Foul Gravity Sewer. The rulebase controls these relationships, allowing connectivity to be established between nodes that have a pre-defined relationship in the rulebase, and disallowing connectivity between features that have not been defined. The Rulebase, in fact, goes even farther by restricting the maximum number of allowable connections between features. The rulebase also defines many other rules governing such things as display, add routines, date stamping, and parent/child polygon relationships. Data Modeling Summary The Data Modeling effort has been shown to provide a comprehensive migration of Mission-Critical Legacy data and validation rules. By creating a Logical Data Model, STW was able to eliminate the business shortfalls of their Legacy system from the new system, as well as design a system that could be expanded in the future. The addition of a rulebase to the data modeling procedure has vastly improved the functionality of the Legacy system and will improve the quality of data that is added in the future. Data Mapping With an approved target data model in place, the next stage in migrating the legacy wastewater GIS was determining how to move the Thesis data into the UADMS PDM. A methodology called Data Mapping was used to step through all the Thesis dbase tables and determine what columns were needed for migration, and how and where to fit them into the UADMS data model. This mapping process can be discussed in two parts: mapping of attribute data and mapping of graphic data. The Thesis Data Mapping document is a spreadsheet containing multiple worksheets that are required to fully map all the tables in Thesis to UADMS. A description of this document and the individual worksheets follows. Thesis Data (Tables Summary) The Tables Summary was done before any other mapping could take place. This table lists all known Thesis dBase tables (.dbf) and the columns within those tables. The worksheet also notes if a specific Thesis table has any information relevant to migration. Each individual table and column that was mapped (translated) to UADMS is noted in the Comments Field of this worksheet. A sample of this worksheet (Table 1) is shown below.
(Table 1)
Attribute Data Mapping As stated in the PDM discussion, the target data model was built with Thesis in mind and most Thesis features have a one-to-one correspondence with UADMS tables. Where UADMS has separate tables for each feature, Thesis does not. As stated, Thesis is a manhole-based system and, as such, stores all node features as manholes. Manholes in Thesis hold attributes defining what type of node feature they really are. Within the data mapping effort, this meant that one Thesis table "manhole" could be mapped to several UADMS tables. This second worksheet compares the Target database (UADMS) to the Source database (Thesis) and shows how and where Thesis data is translated into ArcFM. The attribute matches can be seen across the length of any given row in this worksheet. Where attributes are populated by valid values, references are given to the worksheet that maps the valid value relationships between Thesis and ArcFM. Rules governing the translation of these attributes can be found in the comment field. Where no Thesis data existed that would correspond to a UADMS attribute, those fields remain either null or a default value was used to populate that field. A simplified example of this worksheet (Table 2) is listed below.
(Table 2)
Finally, Thesis defines specific lists of attribute values (code fields) that can be related to the valid values found in UADMS. Since the attributes in Thesis were mirrored in UADMS, these Thesis code fields have been mapped to UADMS valid value tables. This worksheet compares lists of valid values between UADMS attributes and Thesis Attributes. UADMS stores these valid values in Valid Value Tables (VVTs) and Thesis stores valid values in Code Fields. In many cases UADMS, which has more attributes than Thesis, has multiple VVTs that are compared to (derive from) single Thesis Code Fields. A description and sample of this data mapping worksheet is listed in Table 3.
(Table 3)
Graphic Data Mapping Several key Thesis tables contained all of the coordinate data needed for features to be represented graphically within the Thesis module, GeoThesis. Thesis is a manhole-based system, and the coordinates of the manholes define most graphics. Pipes are drawn based on the location of manholes at their origin and destination. All manholes have x and y coordinates stored within a single Thesis table. Curved pipes and area feature geographic definitions are held in two separate tables that reference a large point list file of coordinates. Thesis also stored pipe connections at locations other than where manholes existed. These unique types of connections were stored as a specific distance along a pipe and allowed the Thesis system to link pipes without recording a pipe fitting or manhole. All of the tables, relationships between those tables, and unique methods used to store graphic information were recorded in the data mapping documentation. Data Translation / Validation Graphic Extraction The initial translation involved direct coordinate extraction. The majority of point and line features within Thesis were easily translated directly into their exact coordinate locations based on the values stored in the main Thesis coordinate table. The base map information used within Thesis is the same source used in UADMS and therefore allowed a side-by-side comparison between the two systems. This comparison revealed a precise coordinate translation. The initial extraction of graphic data involved simply translating the entire set of manhole coordinates and the pipes that used those coordinates. Graphic Validation In addition to standard manhole locations and straight sections of pipe, additional features such as curved pipes, polygons, pipe connections drawn in Thesis without manholes, and label positions required translation. These features were translated as a second step and required extensive coding to interpret graphic information from Thesis that was either encrypted and required custom translation algorithms or was stored across many tables. All additional graphic features were added during this stage. Any remaining feature that was not translated was written to an error log file for post translation review. Attribute Translation All graphic features were translated with their unique ID's from Thesis. This allowed all associated attributes, which were linked by these ID's, to be associated with the correct point or line feature. Attributes were translated according to the data mapping work done between the Thesis and UADMS data model. This process was mainly a one to one mapping with a few UADMS fields concatenated or modified from Thesis. Any deviation from the data mapping rules, such as invalid or missing Thesis values, were written to an error log file for post translation review. Import to ArcFM Attribute data was translated initially into MS Access and Graphic data was extracted and validated in ESRI Shape files. The last step in translation was to move this data into the target system. Access tables were moved into Oracle and Shape files translated to SDE files. Data could then be delivered to the Severn Trent Water in the UADMS environment. Conclusions This paper has discussed several critical considerations in the translation of Legacy data. An initial review of the source system is important in order to reveal the mission critical elements necessary in the source data model. Any data validation rules used in the source system must also be studied and carried into the target system. Shortfalls between the Legacy system and business requirements are also exposed during this initial review. Involving all stakeholders in the development of a new data model will ensure a complete and useful target system. With the development of a target data model, a thorough mapping of data between the two systems is crucial. A data map also provides a record of the translation rules and a reference for validating the success of the translation. By using the methodology outlined in this paper a user can ensure a smooth, comprehensive and beneficial move from Legacy to Reality. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| © GISdevelopment.net. All rights reserved. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||