Avoiding data de-evolution
Kevin Peters Advantica Stoner P.O. Box 86 Carlisle, PA 17013-0086 Abstract Acquisition of accurate reliable data is a major investment in any geospatial system. Once acquired via an initial conversion effort, maintaining data accuracy is key for continued success. After conversion or migration into a new geospatial system, the data accuracy has been accepted and is typically well known due to the fact that conversion efforts are performed to very specific accuracy criteria. It is important for utilities to recognize from that point forward that data evolution takes many paths over time, from the initial implementation of new GIS maintenance tools/procedures, to what hopefully becomes a reliable maintenance process. Without a watchful eye on that evolution, the initial known quality standard can quickly diminish. This paper will explore the evolution geospatial data undergoes within and across the lifespan of ever evolving geospatial systems and maintenance processes. In addition, it will examine data audit techniques that can help ensure data does not ‘regress’ over time from a known acceptable quality standard. Finally it will describe how some of these techniques where applied at a large utility. Introduction When utilities decide to convert their asset data into a geospatial system, they typically go to great lengths to ensure the data adheres to specific quality standards that meet their business needs. This begins by setting up data acceptance criteria that require vendors to provide data that meets quantifiable quality standards. With the criteria in place, a QA/QC process measures and ensures the delivered data truly meets the required standards. This diligence and care is understandable as not only does the GIS rely on good data to function, but many other applications depend on the geospatial data as well, in fact, geospatial data is becoming more and more critical as it begins to function as an enterprise-wide repository. A single type of problem such as poor network connectivity will impact a multitude of other applications from network/hydraulic modeling to outage management. With critical applications depending on the geospatial data and since data conversion is typically the most costly aspect of geospatial system implementations, especially where field collection is required, utilities want to safeguard that asset as they would any other large investment. Part of that safeguard is applied in the form of maintenance where a team of people keep the data up to date. However in protecting this critical asset via the maintenance process, it is important to understand that data evolves over time. If not monitored carefully, data can regress over time and the quality standard once known and measured during the initial conversion may evaporate. To ensure data does not regress over time, it is critical to recognize the following:
Data Evolution Geospatial data undergoes an evolution from an initial conversion stage where data is converted to a digital format, to an early maintenance period, and eventually to a routine maintenance mode. At any point along that evolutionary path, data is subject to mutation via potential migration to new systems. In addition, data quality is subjected to forces all along the evolutionary path that may degrade it. Initial Conversion During the initial conversion stage, geospatial data is converted into a digital format from unintelligent hard-copy or CAD source information or in some cases via field collection. Utilities typically establish acceptance criteria for the initial digital conversion effort. The criteria typically is in the form of error percentages by categories, i.e., 98% of all converted attributes match the source data, 100% of all assets meet connectivity rules, etc. Converted data is then measured against the acceptance criteria/standard during the QA/QC check. Assuming these two prerequisites are in place, the quality of data is known immediately when it’s ‘born’ into the GIS. For example, the attribute accuracy was measured to be 98.5%, 100% of it complied with connectivity rules, etc. Once converted and measured, maintaining that known accuracy post conversion becomes the challenge. Early Maintenance Once converted, geospatial data is updated by a maintenance process in its infancy. There typically is a feeling out period for the new system, tools, and records maintenance process. This early maintenance mode is a particularly dangerous period where, if not watched, data quality could begin to regress. Even with the best-established processes and training, it should be recognized that new records maintenance technicians are getting used to the new tools and procedures. In the case where it is a utilities’ initial GIS implementation, the GIS itself is a new concept. Records maintenance staff, especially those with a CAD background, may not grasp that geospatial data is more than a picture; it is also a supporting database with network intelligence, etc. They may not grasp the importance of snapping connected pipes to gain network connectivity or that annotation is generated by the database and should not be placed or maintained in a free form text feature. Even for those familiar with a legacy system, they will need to become familiar with the new system’s data model, business rules, and tools. Any of these circumstances have the potential to impact the quality of the data that was known when the data had just been converted. Routine Maintenance At some point the maintenance staff becomes comfortable with the new system and data maintenance leaves the infancy stage and becomes a routine endeavor. While data appears to be relatively safe during this stage, quality can still be at jeopardy. Records maintenance staff are resourceful and likely will develop methods that are not covered in documented procedures or specifications, some of those ‘workarounds’ will be helpful, some will be harmless, and some will have a negative effect. Some utilities outsource during this stage and the introduction of new people to the process could be particularly dangerous to the quality of data, especially if there are undocumented specifications. A typical scenario might take the form of something like: John the engineer always symbolizes valves with a circle on his sketches and then jots down a note about the kind of valve. Unfortunately, a standard symbol is not part of the official documentation and most engineers show valves with bow tie symbols (circles are used for something else). Jane the record maintenance technician, who inputs all of John’s work, has worked with John long enough that she knows what he means. Unfortunately Jane is the only one who knows this and when a new technician or vendor gets involved that ‘undocumented’ specification is typically not conveyed. Migration Likely just when maintenance becomes routine, new and better technology comes along and data must undergo a migration to a new system. If done properly, migration typically does not jeopardize quality as the core data should not change, i.e., if your assets were in the right location, with the right attributes, and proper connectivity in the old system, then they ought to be able to be migrated with the same characteristics in the new system. While the migration effort itself shouldn’t have a big impact, problems in the legacy data may exist and be transferred or amplified because of the new system. For example, the legacy data/system may have been subject to a less stringent rulebase where a pipe was not required to have a pipe size in order to be posted to the database. If the new system puts such a requirement in place, then much of the legacy data would be in violation of the new standard. Ideally the problematic data would be fixed as part of the migration process but the reality is that these types of fixes are often left out or overlooked as part of the migration scope and therefore are left to be handled post migration. safeguarding the records maintenance process With a multitude of evolutionary factors that geospatial data can be subjected to, it is important to ensure that safeguards are in place from the inception of a maintenance process. Fortunately, there area several safeguards a utility can take to help protect the integrity of data and prevent it from getting worse. The most obvious is the establishment of a QA/QC process whereby updated data is checked prior to being ‘posted’ to the live system. Additional safeguards, above and beyond QA/QC, can be built into the GIS and records maintenance process from its inception. If these techniques weren’t applied during implementation, they likely can be incorporated into an existing system. Like any other process, the maintenance process itself changes over time and should be open to process improvement. Measurable Acceptance Criteria/Quality Standards The assumption was that a set of acceptance criteria was applied during initial conversion and that it met user expectations and needs. If those standards were never put in place, then create a set of acceptance criteria for new and updated data that defines not only the acceptable error percentage by category, but also defines how errors in each category will be determined and measured. This will help record maintenance staff understand quality expectations and provide a way to quantify improvement in the quality of data. Close Cooperation with Field Personnel Field collection personnel are key to the data maintenance process as they provide the source information for the updates that are to take place in the GIS. Often field personnel and record maintenance staffs are under separate ‘departments’ which proves to be a challenge. In addition, the two teams have slightly different drivers. The field personnel are interested in getting things built/fixed, then noting the results, while the records maintenance staff wants to make sure the data in the GIS truly reflects what was built. Especially in the case of maintenance processes involving paper (as-built sketches, etc), this mode of operation often results in key information not being noted on field sketches consistently and therefore not being available for input into the GIS. The record maintenance staff’s recourse is, to have someone revisit the site in the field, which can prove to be costly, or input the data with missing information and allow quality to be jeopardized. There are several proactive steps that can be taken to establish good communication and cooperation between the two groups. An initial step is to develop a standard records/field sketch process that is based on the geospatial system’s terminology and standards, i.e., standard, feature and attribute names, symbology, domain pick lists, etc. It is important to follow up that process with training. While most utilities recognize that GIS training is important, the training for field personnel, who provide the source information, is often overlooked. Field personnel should be trained so they have an understanding of the GIS and know what data is required of the GIS, what business rules are applied by the rulebase when data is posted, etc. In establishing a process, the records maintenance staff should recognize the people in the field are both their supplier and customer. The maintenance process should allow users to be a guide to data quality. When field personnel discover incorrect data, there should be a process for them to generate a job to have it fixed. Finally, the establishment of a focus group consisting of field personnel, members of the record maintenance staff, and users, helps to facilitate communication. The focus group might be charged with directing ‘change management’ by, examining the records maintenance business process on a continuous basis, making suggestions for improving the process, and monitoring the results, including changes to data quality. Documented Maintenance Specifications Don’t accept the premise that field personnel and records maintenance staff have an ‘unwritten’ understanding because they’ve worked together for so long. All data maintenance specifications, including how data should be collected and noted in the field, should be documented. It provides a standard against which data can be measured and safeguards against staff turnover where, ‘knowledge gaps’ between new and old technicians might exist due to ‘undocumented’ specifications. Rulebase A tight set of business rules applied via a rulebase will help prevent errors from being entered. This is especially useful during the early maintenance period when procedures, specifications and operators are in their infancy. The application of business rules within the geospatial system can take many forms including, the use of domains that force operators to choose only those values on a pick list, connectivity rules applied when data is posted that require assets to have valid connections, and the application of ‘mandatory’ fields that require records maintenance staff to populate critical fields. In developing these business rules it is important to ensure that data adheres to the needs of other applications. For example, if elevation is needed to perform hydraulic modeling and hydraulic modeling is a key function of the GIS, then that field should be defined as ‘mandatory’ and the GIS should force the operator to input the data. Note that the use of a strict rulebase can have negative consequences. For example, if there are mandatory fields and the source data is not complete, it could result in maintenance staff entering ‘made up’ values for the sake of being able to complete their work. In addition, the selection of valid values should be done with caution. Values like “unknown”, “x”, and “other” in domains typically add little value and often get used when accurate values should be applied. Database or Code Triggers As an extension to the set of business rules, database or code based triggers can be applied. These types of triggers might address validity checks such as attribute combinations and can be established to occur automatically when an event such as a record insert, update or delete occurs. Use of Metadata in Data Model Design When designing the GIS data model, make use of metadata fields to track the date features were created or modified, who created or modified them, etc. This will help to distinguish the source of specific errors and is especially useful in determining those errors that came from legacy data. In addition, the data added by the records maintenance staff can be identified and measured separately from that which was initially converted, i.e., what’s the quality of the data since initial conversion. Another piece of metadata to consider is the use of a quality code or confidence factor. These types of fields might indicate where the data came from, For example, was the length of a pipe field verified, was the location of a pole GPS’ed or was it guessed, etc. Tracking this information would allow a utility to determine, over time, what percentage of their data was coming from very reliable source information (GPS’ed/field verified) and what percentage was less reliable. In addition to informing users as to the confidence they should have in the data, it also allows the records maintenance team to focus on and improve those elements that were initially created from less reliable sources. QA Tool Kit The records maintenance staff should have a QA Tool Kit which contains tools that allow them to measure the quality standards. These tools should not only check for errors but should also provide a quantitative measure, i.e., the number of things checked versus the number of errors. The tool kit need not be elaborate and may contain nothing more than a set of SQL scripts. For example, checks to determine things such as, how many pipes in the database are missing their nominal size. Applied over time the QA Tool Kit will allow records staff to track and monitor quality. For example, last year 10% of the pipes were missing nominal size and this year only 5% are missing that information. Data and process audits Even with solid safeguards, no data maintenance process is perfect. Therefore, it is important to check up on data quality by performing audits that can quantify quality. As discussed earlier, the most critical time to evaluate data quality is immediately after the initial maintenance process has been put in place. Ideally a standard set of audits can be imposed that will allow for a comparison to the quality measure that was taken during initial conversion. That same audit approach can be used on a regular basis and allow for a comparison of data quality over time. Checks and approaches to consider for an audit might involve any of the following. Scripts to Verify Specifications/Business Rules Specifications that can be automatically checked are likely the easiest to create and apply. This might include checks that:
Sample Checks Against Source Information A comparison of the updated geospatial data to source information (typically as-built sketches/work orders) will help determine if information was properly input. This allows for a check of all characteristics of geospatial data from attributes to connectivity, and can include aspects often overlooked such as location of assets and annotation placement. Ideally, and perhaps with a healthy budget, a comparison of updated data to what truly exists in the field could be applied. This would test the accuracy of the field collection as well as the data input. Audit of Records Maintenance Process Observing the records maintenance staff as they apply the process can be a valuable exercise. While not directly measuring the quality of data, an audit of the process can lead to the origin and potential causes of data errors. It will identify, where specifications and procedures are not being followed, what workarounds have been ‘invented’, whether operators are turning off the rulebase checks, etc. It’s one thing to check the oil in the car, it’s another to audit how it was done and whether the cap to the oil tank was put back on. Make Use of Metadata If the geospatial data contains metadata that identifies, who created the data, when, etc, then scripts should be devised that make use of that information. Those scripts can, determine if an error is due to updated data versus data converted by a vendor, and measure the error percentage of data that was added or updated by the records maintenance staff. They can also make use of the date created to determine if the percentage of errors is improving over time. Since conversion/migration may leave legacy data problems, checks that measure whether the volume of those legacy errors changes over time, can also be devised. Quantify and Categorize Results The ultimate goal of a data audit is to provide a measure of quality and thus shed light on how well the maintenance process is working. Hopefully it is a measure that can be compared to future data audits. This should include measures gauging the frequency of errors and not just the total number of errors. Users might be shocked to find out they have 5,000 features with incorrect attribute information. When reminded that there are over a million features and the 5,000 errors represent less than half a percent, they will likely be less shocked. Another important note when assessing results is to gauge the severity of a given problem. It might be bad that the size of pipes are missing 5% of the time but it’s likely much worse if connectivity is wrong 1% of the time. Data audit case study at a large utility A large utility had just completed a multi-million dollar GIS implementation project. The project included the conversion of over 40,000 miles of pipes from thousands of hard copy maps and for some areas from legacy geospatial systems. The implementation project team’s mandate included the establishment of a records maintenance system which included, the tools, procedures, training, and set up for a centrally located records maintenance staff. With such a large conversion effort, the data was converted in batches over a three-year time frame and a given batch went ‘live’ after it was accepted. Upon acceptance, the quality of the geospatial data was known to meet very specific criteria, i.e., no system or programmatic errors existed, attribute accuracy was at 98%, etc. and it was turned over to the records maintenance team. After several months of maintaining their data, ‘users’ noticed errors in newly added or updated data including, poor connectivity, duplicate records, poor cartographic representation of features, and the use of ‘comment’ fields for storing attribute information that should have been stored in a specific field that was designed to hold it. As a result, the utility became concerned that their multi-million dollar investment in quality data was degrading, but they did not know for sure, by how much, or why. The implementation team believed they had built in many of the necessary safeguards, measurable acceptance criteria, documented specifications, a solid rulebase, a QA/QC step, metadata, and a rudimentary QA Tool Kit. However, based on the feedback from the users and recognizing that that their maintenance process was in its infancy, they decided to perform a data quality and process audit that would:
To check for the above, the data and maintenance process audit called for the following checks.
Having sufficient metadata to identify old versus new errors made it possible to distinguish whether a given error was, created during initial conversion effort, carried over as part of migrated legacy data, or created by the records maintenance team. This break down of errors was conveyed in an audit report.
*Figure 2 shows example of automated scripts and nature of results *Statistics reflect the nature of the results but do not show actual numbers. The results showed that in some cases, such as the location of features in the GIS, the data added and modified by the utilities’ records maintenance team improved the quality of the data. As speculated with a maintenance process in its infancy, some aspects of the data such as attribute accuracy, degraded. The results also indicated that despite the user perception, a very small percentage of the critical data was in error. The audit via observation of the data maintenance process helped to identify causes behind the problematic data. While retraining on specifications and gaps in the existing documentation were small contributors, communication with field personnel played a significant role. Fieldworkers needed additional training in the use of the standard methodology which included the use of GIS terminology and requirements. For example, they often failed to supply information for required fields. In addition, the records maintenance team did not have a solid process for rejecting incomplete or inadequate data that was gathered in the field. Data from the field personnel was often input “as is” rather than having the field personnel revisit the sketch area and provide the missing information. Finally, there was a need for an enhanced and expanded tool kit that could be used by the record maintenance QA staff to better monitor quality on a regular basis via automated scripts. Concluding remarks The cost for converting data can run upwards to 75% of the overall GIS implementation costs, especially if field collection is involved. Protecting that kind of investment should be a key concern for utilities. Direct threats to the data investment are certainly clear and taken in the form of scheduled back-ups, firewalls that protect it from the outside, etc. However, utilities should recognize that their asset of geospatial data is also threatened by a ‘natural’ evolution and in order to avoid data de-evolution, data quality must be protected by proactively building in safeguards and then regularly monitoring data via audits. | |||||||||||||||||
|
|