GISdevelopment.net ---> GITA 2001 ---> Direction for Data

Turning legacy into useable reality

Neil Tansley
Severn Trent Water,Waterworks Road
Edgbaston, Birmingham B16 9DD

Albert Sarvis
Stoner Associates, P.O. Box 86
Carlisle, PA 17013


Introduction
Analysis of the Severn Trent Water wastewater data migration project can be viewed as a case study in legacy system migration. The general strategies and methodologies used in this project can be applied to a variety of legacy migration projects; and the details of this migration are quite similar to other projects of its kind. The objective of this paper is to provide an overview of the Severn Trent Water data migration project, the challenges to migration, and the business implications inherent in legacy system migration. Discussion of the initial stages of the migration project require a clear description of the legacy and target systems data models and a thorough mapping of how data was moved between the two models. An emphasis has been placed on retaining mission-critical data, system functionality, and the customized rules found in the legacy system. Following explanation of these specifications, the actual translation stages and the validation of the converted data are described.

Project Background
Severn Trent Water (STW) is the second largest water utility in the United Kingdom (UK). With an area of 22,000 square kilometers, the STW service area is located in the Midlands of the country, surrounding the City of Birmingham. This massive data conversion project was a key component of a £32 million pound (approx. $54 million) GIS implementation project, which commenced in 1998. The project scope included development and implementation of GIS applications and enhanced business processes for more than 1,000 of STW's 4,500 employees. The scope of the data conversion included both STW's service areas for clean water distribution (41,800 km) and wastewater collection (52,300 km).

While the clean water data conversion used mostly paper map sources, STW wastewater mapping was carried out using a proprietary AM/FM package called Thesis, licensed from Oscar Faber Water (UK). The data in the wastewater mapping system had been diligently maintained in Thesis since 1993 by 'agents' of STW and managed by local government organizations. Thesis is a stand-alone, DOS-based GIS that uses Dbase files to hold attribute data. STW and its 'agents' used all modules of the Thesis software program in daily operation. These modules provided view, query, edit, plot and digitizing functionality. The entire STW business relied on Thesis through the complete utilization and integration; this provided a sizable challenge for system migration from volumetric, technical, and business-process standpoints.

Project Challenges
There were many unique challenges to the Thesis migration project. One notable challenge was the existence of stand-alone Thesis systems originally maintained by 68 individual 'agencies' and seven in-house STW operational sites. Translating from many separate sources, even though they resided the same system, required finding the subtle differences in the way data was maintained. These differences included use of non-valid values, liberal use of a 'remarks' field to store key attributes, and shortcuts in digitizing. A Several non-alphanumeric database fields were also a unique aspect of this project. All coordinate data, and many other graphic description fields, used a form of binary storage that required specialized coding from the software vendor. Subsequent to decrypting the data, interpretation of the resulting values also proved difficult. A detailed discussion of these challenges is included in the section on data modeling.

One of the more common challenges faced in any legacy system migration is the existence of proprietary, undocumented data models. Thesis data was transferred from one Thesis machine to another through an exported/imported ASCII file only readable by Thesis. Additionally, migration of this data was carried out using the raw database files and not the native transfer files. Crucial first steps included discovering in which tables the attributes, graphic coordinates, and valid value lists were stored. Lack of a documented data model meant that all of these tasks were performed blindly, requiring a type of reverse engineering. Data model interpretation will be discussed later in the paper with Data Model Development and Data Mapping.

Data Freeze and Change Tracking
An important business consideration for a legacy system migration is what to do once data has been handed off to the migration vendor. This data must essentially be frozen at a point in time. Business does not stop during this period and changes to the system begin to accumulate. For traditional migration projects, change orders that arose during migration either caused a tremendous backlog of work, or were literally hand-drawn onto paper plots and then set aside until data was ready for updates.

In an ideal migration project, the ability to carry out a database comparison or use data in a Records Management System would provide better tools to keep track of normal changes. By comparing the legacy database to the target database, differences could be detected and updated in the target system. In many legacy migration projects, however, this comparison takes more effort than re-creating the data from scratch. . Another method that can be considered is to use date stamp attributes on the legacy system to extract changes that occurred after a defined date, and add those changes to the target system following migration. Unfortunately, this option rarely exists in true legacy migrations, and the STW project was no exception.

The STW solution sought to reduce the amount of data tracking necessary by minimizing the time between freeze and delivery of migrated data. Since the real issue concerning change tracking is the amount of data that accumulates over the elapsed time between data collection and migration, arranging for quick turn around almost negates the need for change tracking. The Thesis migration project adapted well to this technique because so many separate systems existed. The files associated with the many users of Thesis were relatively small and constituted neatly organized batches of data for migration. This allowed the project team to work through the data within a few weeks of the data freeze and delivery. More importantly, the small amount of change orders that were generated during this short time could be held until data was redelivered in the target system. At the time of delivery, changes could be handed over to the new centralized Records Management Centers for updating.

Data Modeling

Logical Data Model Creation
An obvious predecessor to the migration of the Legacy Thesis data was the selection and development of a target database/GIS in which to migrate the data. STW's objective was not to simply migrate their Thesis data into a more modern software package but to improve it. The creation of the target system is referred to as Data Modeling. In the early stages of the UADMS project, a Logical Data Model (LDM) was developed to encompass

All of the mission-critical wastewater information that was to be contained in the GIS, and to predict the future business needs relying on that data. Logical data models are built without regard to any technical constraints that may exist in a specific database/GIS. A LDM is essentially a representation of all business information and the relationships found within this data.

A primary goal of the migration is to have the database/GIS product conform to the needs of the business so that the business does not risk being forced to compromise their operations based on product limitations. The STW wastewater LDM was initially created using the Thesis GIS as the base to build from. All critical features, attributes, and their valid values found in Thesis were carried over to the LDM. Additional assets, attributes, map features, and their relationships were added to this LDM based on gaps found between Thesis and the needs of the business. Thesis users, wastewater system experts, and business stakeholders were all included in this stage of data modeling. Figure 1 shows a small section of an STW wastewater Entity Relationship Diagram (ERD). Diagrams were created to depict node, line, and area features. In Figure 1, entities that refer to STW assets are found within boxes; and arrows connecting these entities symbolize the relationships between these assets. This section of the ERD shows the relationships between Sewer Pipes and several node entities. These diagrams are very useful tools for understanding a Logical Data Model.


(Figure 1)

Physical Data Model Creation
While the Logical Data Model (LDM) depicted all the information essential to STW's wastewater asset management needs, it was just the first step to a final target database/GIS. The process of migrating the STW wastewater L D M to a Physical Data Model (PDM) involved two tasks: creating a physical database model and implementing relationships through the use of a rulebase. First, the actual Oracle database that was to be used with the UADMS GIS applications was designed by creating tables that contained the spatial and attribute information for each entity in the LDM. These tables were designed based on pre-defined LDM specifications such as attribute data types (character, numeric, etc...), widths, and precision. The tables also contained lists of valid values for those attributes.

For the majority of STW entities, a one-to-one relationship existed between LDM entities and PDM tables. In STW's case, there are actually two PDM tables working in conjunction for each entity; the business table contains the attributes and an SDE table contains the geographic coordinates. These unique entity types are referred to in the GIS as Feature Types. A few assets, such as inlets and outlets to a pumping station, were represented as two unique entities in the LDM but could be joined into one table (Feature Type) in the PDM due to matching attributes. Distinction between these two features could be made based on attribute values. These attribute distinctions are referred to in the GIS as Subtypes. Subtypes can also be found within tables storing only one unique LDM entity. Gravity Sewers, for example, were all contained in one PDM table but were broken into multiple subtypes that represented uniqueness in ownership and purpose. (ex. Public-Foul-Gravity Sewer). Subtyping becomes important in the PDM because the LDM describes the relationships at the subtype level.

Rulebase Development
Coinciding with the Oracle database, the UADMS PDM has data relationship rules defined and controlled by a Rulebase. UADMS uses the ArcFM Rulebase, which works in conjunction with the Oracle database to control many aspects of wastewater data. All of the data validation checks native to Thesis that were deemed useful were built into the UADMS PDM either through the database or rulebase. The functionality of the Thesis validation checks was not only migrated but also vastly improved in UADMS. Thesis validation included checks for null and invalid attribute values where they were not already constrained by valid values called code lists.

Connectivity restrictions were built into the Rulebase to control and manage certain situations. For example, certain node assets can connect to not only a Feature Type like Gravity Sewer but also to the Subtype Private Foul Gravity Sewer. The rulebase controls these relationships, allowing connectivity to be established between nodes that have a pre-defined relationship in the rulebase, and disallowing connectivity between features that have not been defined. The Rulebase, in fact, goes even farther by restricting the maximum number of allowable connections between features. The rulebase also defines many other rules governing such things as display, add routines, date stamping, and parent/child polygon relationships.

Data Modeling Summary
The Data Modeling effort has been shown to provide a comprehensive migration of Mission-Critical Legacy data and validation rules. By creating a Logical Data Model, STW was able to eliminate the business shortfalls of their Legacy system from the new system, as well as design a system that could be expanded in the future. The addition of a rulebase to the data modeling procedure has vastly improved the functionality of the Legacy system and will improve the quality of data that is added in the future.

Data Mapping
With an approved target data model in place, the next stage in migrating the legacy wastewater GIS was determining how to move the Thesis data into the UADMS PDM. A methodology called Data Mapping was used to step through all the Thesis dbase tables and determine what columns were needed for migration, and how and where to fit them into the UADMS data model. This mapping process can be discussed in two parts: mapping of attribute data and mapping of graphic data.

The Thesis Data Mapping document is a spreadsheet containing multiple worksheets that are required to fully map all the tables in Thesis to UADMS. A description of this document and the individual worksheets follows.

Thesis Data (Tables Summary)
The Tables Summary was done before any other mapping could take place. This table lists all known Thesis dBase tables (.dbf) and the columns within those tables. The worksheet also notes if a specific Thesis table has any information relevant to migration. Each individual table and column that was mapped (translated) to UADMS is noted in the Comments Field of this worksheet. A sample of this worksheet (Table 1) is shown below.

(Table 1)
Thesis tableName column name data type width   Comments
MANHOLE Manhole Details( Holds the core data relating to Manholes.)
MANHOLE ACCESSTYPE VARCHAR2 1   Mapped
MANHOLE CHAM_SIZE1 VARCHAR2 4   Mapped
MANHOLE CHAM_SIZE2 VARCHAR2 4   Mapped
MANHOLE COVERLEVE L VARCHAR2 7   Mapped
MANHOLE COVERSHAPE VARCHAR2 1   Mapped
MANHOLE COVER_DUTY VARCHAR2 1   Mapped
MANHOLE ENTRY_TYPE VARCHAR2 1   Mapped
MANHOLE GEN_NUMBER VARCHAR2 10   link to graphics and connected pipes
MANHOLE MANHOLENO_ VARCHAR2 10   Mapped
MANHOLE MANHOLEUSE VARCHAR2 1   Mapped
MANHOLE MANSTATUS_ VARCHAR2 2   Mapped
MANHOLE MATERIAL__ VARCHAR2 1   Mapped
MANHOLE OPEN_SIZE1 VARCHAR2 4   Mapped
MANHOLE OTHER_TYPE VARCHAR2 4   Mapped
MANHOLE OTHER_TYPE VARCHAR2 7   not used
MANHOLE QUALITY___ VARCHAR2 1   not used
MANHOLE SHAF_SIZE2 VARCHAR2 4   Mapped
MANHOLE SLAB_TYPE_ VARCHAR2 1   Mapped
MANHOLE STATUS____ VARCHAR2 1   Mapped
MANHOLE TYPE______ VARCHAR2 2   used to determine feature type---mapped
MANHOLE YEAR_BUILT VARCHAR2 4   Mapped
MANHSURV Manhole Survey Details( Holds manhole condition details)
MANHSURV BENCHING__ VARCHAR2 1   Mapped
MANHSURV CHAMB_COND VARCHAR2 1   Mapped
MANHSURV COVER_COND VARCHAR2 1   Mapped
MANHSURV DEPTHSILT VARCHAR2 3   Mapped
MANHSURV DIST1_____ VARCHAR2 10   not used
MANHSURV DIST2_____ VARCHAR2 10   not used
MANHSURV DIST3_____ VARCHAR2 10   not used
MANHSURV GEN_NUMBER VARCHAR2 10   link to graphics
and connected pipes

Attribute Data Mapping
As stated in the PDM discussion, the target data model was built with Thesis in mind and most Thesis features have a one-to-one correspondence with UADMS tables. Where UADMS has separate tables for each feature, Thesis does not. As stated, Thesis is a manhole-based system and, as such, stores all node features as manholes. Manholes in Thesis hold attributes defining what type of node feature they really are. Within the data mapping effort, this meant that one Thesis table "manhole" could be mapped to several UADMS tables.

This second worksheet compares the Target database (UADMS) to the Source database (Thesis) and shows how and where Thesis data is translated into ArcFM. The attribute matches can be seen across the length of any given row in this worksheet. Where attributes are populated by valid values, references are given to the worksheet that maps the valid value relationships between Thesis and ArcFM. Rules governing the translation of these attributes can be found in the comment field. Where no Thesis data existed that would correspond to a UADMS attribute, those fields remain either null or a default value was used to populate that field. A simplified example of this worksheet (Table 2) is listed below.

(Table 2)
ArcFM Thesis
TABLE_
NAME
COLUMN_
NAME
V V T TABLE_
NAME
COLUMN_
NAME
CODE_
FIELD
TUA132_
GRAVITY_
SEWERS
TP_INTERNAL_
CONDITION_
GRADE
INTERNAL_
CONDITION.VVT
PIPE INSPECTED_  
TUA132_
GRAVITY_
SEWERS
TP_LINING_
MATERIAL
SEWER_LINING_
MATERIAL.VVT
PIPE LINING_MAT LINING_MAT
TUA132_
GRAVITY_
SEWERS
TP_MATERIAL SEWER_
MATERIAL_
TYPES.VVT
PIPE PIPE_MATL_ PIPE_MATL
TUA132_
GRAVITY_
SEWERS
TP_PURPOSE SEWER_
PURPOSE.VVT
PIPE SEWERUSE__ SEWERUSE
TUA132_
GRAVITY_
SEWERS
TP_SEWER_
SHAPE
SEWER_PIPE_
SHAPE.VVT
PIPE PIPE_SHAPE PIPE_SHAPE
TUA132_
GRAVITY_
SEWERS
TP_STRATEGIC_
CATEGORY
STRATEGIC_
CATEGORY.VVT
PIPE CATEGORY  

Finally, Thesis defines specific lists of attribute values (code fields) that can be related to the valid values found in UADMS. Since the attributes in Thesis were mirrored in UADMS, these Thesis code fields have been mapped to UADMS valid value tables. This worksheet compares lists of valid values between UADMS attributes and Thesis Attributes. UADMS stores these valid values in Valid Value Tables (VVTs) and Thesis stores valid values in Code Fields. In many cases UADMS, which has more attributes than Thesis, has multiple VVTs that are compared to (derive from) single Thesis Code Fields. A description and sample of this data mapping worksheet is listed in Table 3.

(Table 3)
Mhcodefi.stw
CODE_
FIELD
CODE_
VALUE
CODE_
EXPND
V V T Code Description
MANHOLEUSE C Combined SEWER_
PURPOSE.VVT
C Combined
MANHOLEUSE F Foul SEWER_
PURPOSE.VVT
F Foul
MANHOLEUSE S Surface
Water
SEWER_
PURPOSE.VVT
S Surface
Water
MANHOLEUSE W Culverted
Watercourse
 
MANHOLEUSE D Disposal
Main
SEWER_
PURPOSE.VVT
E Final
Effluent
      SEWER_
PURPOSEn VVT
L Sludge
MANSTATUS_ 24 Former Section
24 applies
ADOPTION_
STATUS.VVT
S24 S24-Section
24 -
MANSTATUS_ CW Culverted
Water
Course
MANSTATUS_ HD Highway
Drain
ADOPTION_
STATUS.VVT
HD Highway
drain -
MANSTATUS_ PR Private ADOPTION_
STATUS.VVT
none Not
adopted
MANSTATUS_ PU Public
MANSTATUS_ 18 Former
Section 18/104
ADOPTION_
STATUS.VVT
S104 S104- Section 104 -
MANSTATUS_ AB Abandoned
      ADOPTION_
STATUS.VVT
S18 S18-
Section 18 -
MATERIAL__ B Common
brick
CONSTR_
MAT.VVT
B Common
Brick
MATERIAL__ I Insitu
concrete
CONSTR_
MAT.VVT
I Insitu
Concrete
MATERIAL__ O Other
(use REMARKS)
CONSTR_
MAT.VVT
U Unknown
MATERIAL__ P Precast
concrete
CONSTR_
MAT.VVT
P Precast
Concrete
MATERIAL__ S Segmental CONSTR_
MAT.VVT
S Segmental
PIPE_MATL_ BLANK SEWER_
MATERIAL_
TYPES.VVT
U Unknown  
PIPE_MATL_ AC Asbestos
Cement
SEWER_
MATERIAL_
TYPES.VVT
AC Asbestos
Cement
PIPE_
MATL_
AK Alkathene SEWER_
MATERIAL_
TYPES.VVT
AK Alkathene
PIPE_MATL_ BR Brick SEWER_
MATERIAL_
TYPES.VVT
BR Brick

Graphic Data Mapping
Several key Thesis tables contained all of the coordinate data needed for features to be represented graphically within the Thesis module, GeoThesis. Thesis is a manhole-based system, and the coordinates of the manholes define most graphics. Pipes are drawn based on the location of manholes at their origin and destination. All manholes have x and y coordinates stored within a single Thesis table.

Curved pipes and area feature geographic definitions are held in two separate tables that reference a large point list file of coordinates. Thesis also stored pipe connections at locations other than where manholes existed. These unique types of connections were stored as a specific distance along a pipe and allowed the Thesis system to link pipes without recording a pipe fitting or manhole. All of the tables, relationships between those tables, and unique methods used to store graphic information were recorded in the data mapping documentation.

Data Translation / Validation

Graphic Extraction
The initial translation involved direct coordinate extraction. The majority of point and line features within Thesis were easily translated directly into their exact coordinate locations based on the values stored in the main Thesis coordinate table. The base map information used within Thesis is the same source used in UADMS and therefore allowed a side-by-side comparison between the two systems. This comparison revealed a precise coordinate translation. The initial extraction of graphic data involved simply translating the entire set of manhole coordinates and the pipes that used those coordinates.

Graphic Validation
In addition to standard manhole locations and straight sections of pipe, additional features such as curved pipes, polygons, pipe connections drawn in Thesis without manholes, and label positions required translation. These features were translated as a second step and required extensive coding to interpret graphic information from Thesis that was either encrypted and required custom translation algorithms or was stored across many tables. All additional graphic features were added during this stage. Any remaining feature that was not translated was written to an error log file for post translation review.

Attribute Translation
All graphic features were translated with their unique ID's from Thesis. This allowed all associated attributes, which were linked by these ID's, to be associated with the correct point or line feature. Attributes were translated according to the data mapping work done between the Thesis and UADMS data model. This process was mainly a one to one mapping with a few UADMS fields concatenated or modified from Thesis. Any deviation from the data mapping rules, such as invalid or missing Thesis values, were written to an error log file for post translation review.

Import to ArcFM
Attribute data was translated initially into MS Access and Graphic data was extracted and validated in ESRI Shape files. The last step in translation was to move this data into the target system. Access tables were moved into Oracle and Shape files translated to SDE files. Data could then be delivered to the Severn Trent Water in the UADMS environment.

Conclusions
This paper has discussed several critical considerations in the translation of Legacy data. An initial review of the source system is important in order to reveal the mission critical elements necessary in the source data model. Any data validation rules used in the source system must also be studied and carried into the target system. Shortfalls between the Legacy system and business requirements are also exposed during this initial review. Involving all stakeholders in the development of a new data model will ensure a complete and useful target system. With the development of a target data model, a thorough mapping of data between the two systems is crucial. A data map also provides a record of the translation rules and a reference for validating the success of the translation. By using the methodology outlined in this paper a user can ensure a smooth, comprehensive and beneficial move from Legacy to Reality.

© GISdevelopment.net. All rights reserved.