GISdevelopment.net ---> GITA 2002 ---> Data Development & Evolution-Providing Data to the Masses

Quality Data, How do I recognize it?

Donald E. Ramsay
Logica, 10375 Richmond Ave., Suite 1000
Houston, TX 77042
Email: ramsayd@logica.com


Abstract
The key to any GIS functioning as expected is quality data. The old adage of “garbage in – garbage out” still holds true. With the amount of existing digital data available today, combined with the new and traditional means of developing digital data from paper sources, it is critical that a common means of identifying and describing quality data be introduced. Many GIS Project Managers and users know they need quality data but they are unsure of exactly what quality data is and more importantly how to define it in a set of project specifications. This paper will first describe what quality GIS data should be and then provide the reader with the tools to write a data quality specification. The reader will also gain the knowledge enabling them to establish a quality control procedure that will ensure that quality data is accepted and introduced into the GIS on a consistent basis.

Overview
When developing a set of data quality specifications associated with accepting data and eventually loading that data into an enterprise GIS, it has in the past been a gray area as to exactly what was meant by quality data. Many GIS Managers simply went with a generic quality value of anywhere from 96% to 99.5%. This level of accuracy may sound good on a conversion specification and everyone within the organization may feel that they will be getting quality data. This paper will describe how even with a quality data standard of 99.5% the data may not be acceptable to run certain applications or even be able to be loaded into the system at all.

With the amount of existing digital data available on the market today it is critical that GIS Managers are aware of how to identify and qualify GIS data to protect the existing database from corrupt data files. Now that GIS data is firmly entrenched and utilized throughout the enterprise it is even more critical to protect the data integrity. In years past it may have been only the mapping and or engineering departments that were impacted by GIS data but now almost every department within a utility has access to and makes critical decisions based upon GIS data.

When taking into account the number of mergers and acquisitions that occur within the utility industry today is becomes even more of an issue to have a means of evaluating the quality of GIS data when two or more existing GIS systems are merged into one. This factor becomes even more critical when you consider that the two systems that are about to be combined because of a company merger or acquisition may not even be the same systems.

Data Quality Considerations
When attempting to define the data quality specifications for a GIS there are three main areas that need to be balanced, these are cost of data acquisition, data source availability, and the application functionality of the GIS. All three of these areas need to be considered prior to making any decisions on data requirements.


A prime example of how these three components are intertwined is in the example of an electric company wanting to design a job to install 12 new customers on a line extension. In order to complete this design, the designer will need to know what circuit phases are present and what is the existing load on each phase to determine which phase or phases the new customers will be fed from. The desired application is for the system to be able to complete a trace of each phase within the circuit and determine what existing customers are being fed from each phase. The functionality of the application comes into play in so much as if the data model of the GIS does not account for individual phase recognition or if the customer’s feed location is not incorporated into the data model then no single phase trace can be completed. The data availability comes into play if both the application and functionality are present but the data is not populated in the correct database fields or incorrect data is populated in the database fields the trace will not be correct. The final area, data costs plays a role in so much as if the costs to acquire this data, ie, a field survey, are more than the utility can justify than chances are the data will not be captured and ever populated into the GIS. As depicted above all of these areas must be considered when defining a set of GIS data quality specifications.

When defining GIS data quality it is far more complex than just saying I want my data to be 99.5% accurate. The data needs to broken down into its prime components and accuracy standards applied to each of these components. These GIS data components are:
  • Database Design Conformance
  • Connectivity
  • Database Attributes
  • Spatial Placement
  • Map Aesthetics
  • Age of Data
  • Data Completeness
Even within these main components there are sub categories that must be considered. The following paragraphs will describe each of these components and what a minimal acceptable data standard should be for each.

Database Design Conformance
Database design conformance within a GIS refers to how the digital files are structured as compared to the physical and logical data models. When accepting data files to be loaded into the GIS, it is imperative that those files adhere to the database design specifications in both the physical and logical design models. All data fields must be correctly structured and named and all associations between database tables must be maintained. Should the data tables not be structured correctly or the data files are in the incorrect format, the deliverable files will not load into the GIS, or if they are force loaded, they will certainly corrupt the database integrity. For this reason the minimal acceptable data quality standard for database design conformance should be 100%.

Connectivity
Connectivity within a GIS refers to how the objects depicted in the GIS are connected to accurately model the required network configuration. Connectivity can be broken down into three main categories each with its own requirements. The way data model has been designed will determine how these connectivity categories are prioritized. As a basic rule to follow, connectivity within the GIS network must be maintained to meet the expectations of the majority of expected system uses. These three main connectivity categories are; object connectivity, database connectivity, and device connectivity.

Object Connectivity
Object connectivity deals with how the individual objects are physically connected in the database. All linear objects should be connected to either other linear objects or point objects. There should be no “open points” except where modeled to depict actual open points in the true network. This type of network connectivity is maintained by having the appropriate snapping routines in place to ensure that there are no erroneous gaps or overshoots in the graphic objects. Linear objects should be snapped end point to end point. Point objects should have their insertion point snapped to the end point of a linear object when connected. Polygonal objects should be snapped closed.

Database Connectivity
Database connectivity refers to how the non-graphic database attributes are modeled to depict the live network. Attribute fields may be populated to represent phase configuration, pressure, wire and or pipe size and other relevant information that is depicted non-graphically. Attributes can even be utilized to accommodate the opening or closing of devices for network traces as long as those traces are run against selected attribute values. Database connectivity is maintained by ensuring that only objects with like or compatible attribute values are adjoined.

Device Connectivity
Device connectivity is one of the most difficult types of connectivity to model in a GIS. In order to model device connectivity, switching devices must be modeled so as to have true open and closed statuses. When device connectivity is modeled correctly the only connection between two objects is through the selected device. By having true device connectivity, networks within the GIS can more accurately duplicate the networks in the field.

In all categories of connectivity it is crucial to have 100% accuracy. Without 100% connectivity, the GIS will not be able to perform many of its more basic applications.

Database Attributes
When defining the minimum acceptable accuracy levels in a GIS for the database attributes it can be broken down into two categories that include valid values and data content. Although both have a direct impact on the integrity of the database each one has its own means of validation. Valid values can be verified through automatic editors while data content needs to be verified through manual edits. The cost of editing and verifying may have a direct impact on what acceptable levels will be established for each category

Valid Values
Valid values are described as a set of values that can populate a database field with no exceptions. Any values entered into these specified database fields that fall outside of this predefined set of valid values is deemed in error. Some examples of sets of valid values would be:
  • Conductor size: 6, 2, 4, 1/0, 4/0, 336, 500, etc.
  • Pipe size: 1, 2, 4, 6, 8, 10, 12, etc.
  • Numeric values: any numeric value characters
  • Standardized street name suffixes: St, Wy, Cr, Ave, Hgwy, etc.
  • Standard Defaults: any value that is a standard default value in a particular field, this can also include “nulls” and or blank fields.
Valid Values should have a minimum acceptable accuracy standard of 100%

Data Content
Data content on the other hand is much more difficult to verify. It requires a manual effort to physically edit the values populated in each database field and make a decision as to if it is the correct value as compared to the original source documents used to capture the information. The value populated in the field may even be a valid value but is it the correct valid value?

Because of the effort involved in verifying every field in every attribute table the minimum acceptable accuracy standard for data content is generally set at 98%.

Spatial Placement
Spatial placement within the GIS refers to where the objects within the GIS are in reference to their actual locations in the real world. Spatial placement can be broken down into two areas of reference. These two areas are absolute placement and relative placement. Although these two areas can define the location of the same object or objects they are quite different.

Absolute Placement
When discussing absolute placement within a GIS, it refers to where an object is located in reference to its location in the real world according to its control coordinate system. This control coordinate system is generally the Global Positioning System (GPS). To verify an object’s absolute accuracy, the (X,Y) coordinate of the object in the GIS should be the same (X,Y) coordinate value when checked in the field. Although the absolute placement of an object may be the most desirable means of placing an object in a GIS, the cost of doing so, along with the issue of the absolute accuracy of the landbase within the GIS, makes it the most difficult to maintain and therefore the least common method used. Many times the absolute position of an object or its GPS coordinates are stored as an attribute value of that object and used for various engineering applications while the object is positioned in the GIS using relative placement techniques.

The accuracy standards for placing objects within a GIS by their absolute locations can be anywhere from ± .5 feet to ± 5’.

Relative Placement
Relative placement is the most common means of positioning objects within a GIS. Relative placement refers to placing an object in relative position to the objects around it. This can be done by visual placement or by using standard offsets. An object may not be in its absolute position but it is correctly offset from other objects within the GIS. Poles are the correct distance from each other; they are in the correct property lot, valves are the correct distance from pavement edges, spans of wires and or pipes are the correct lengths and correctly offset from landmark features.

The accuracy standards for placing objects within a GIS by their relative locations can be anywhere from ± .5 feet to ± 10’.

Map Aesthetics
Of all of the components of a GIS that accuracy standards need to be applied to, map aesthetics is by far the most difficult to quantify. The difficulty comes in to play because you really won’t know what looks good until you see it and several people will have several differing opinions as to what looks good. It seems like no matter how detailed graphic placement specifications for rotation, offsets and size may be there are always instances that are the exception. Because map aesthetics are the first thing a user may see it is important that the map products be visually appealing with all objects being legible and easily decipherable. Consideration must be given to the amount of time and resources that could be spent in continually repositioning objects to achieve a higher level of aesthetics that may or may not affect the overall performance of the GIS and its applications.

The accuracy standards for map aesthetics are generally established at 98% of the objects must be correctly placed in regards to rotation and/or offset with minimal or no overstrikes for displayed text.

Age of Data
Although the age of the data is a direct correlation to the sources used to create the data, it is still important to have up to date data in the GIS to facilitate acceptance and therefore use of the GIS throughout the user community. If the system users do not have confidence that the data they are accessing is up to date and useful to them in making their appropriate decisions they will not rely on it and as an end result not use the GIS. Every effort must be made to use only the most recent sources available to capture the objects included in the GIS. In addition, an aggressive maintenance and update process must be in place to ensure the continued integrity of the database.

The age of the data within the GIS should be no older than 6 months out of date.

Data Completeness
Even though all the data captured in the GIS is accurate and meets the minimum accuracy standards, has all of the data from the original sources been captured? Data completeness addresses this concern. By comparing the information object by object to the data depicted on the original sources an accurate evaluation can be made. Although this is a manual process, the effort spent to verify data completeness can be worthwhile. Should several critical objects not have been captured, like a number of transformers, vales or switches, it could have a serious impact on the effectiveness of the GIS applications.

The minimal acceptable accuracy standard for data completeness should be 98%

The following bullets summarize the minimal acceptable standards for each of the data quality categories.


Data quality acceptance
When developing a data acceptance plan, the main goal should be to develop a plan that allows for the maximum amount of data to be verified with the least amount of resources expended as possible. The way to accomplish this is to combine both automated and manual edits of the data while having an efficient means of reporting on the data’s acceptance during certain stages of the acceptance process. By having these touch points during the acceptance process, data that may have been found to be unacceptable in earlier edits points will not continue through the acceptance process wasting time and resources editing unacceptable data.

By running automated edits on those areas that can be edited by automated routines with automated reporting functionality and little or no operator intervention, large amounts of data can be edited in a short amount of time with very little resources expended. The following data quality categories can be completely edited using automated routines.
  • Database Design Conformance
  • Connectivity
  • Valid Values
The following data quality categories must be edited using manual edits. The extent of these edits will depend on the amount of data edited in each deliverable file.
  • Database Content
  • Spatial Placement
  • Age of Data
  • Map Aesthetics
  • Data Completeness
When conducting a manual edit an intelligent selection of objects should be identified for edits. This selection process should identify those objects that have a potential for errors. These can include areas of heavy density; a large amount of changes occurring, or areas of insufficient source data. Anywhere from 10 to 25% of the object should be selected for manual edits. If errors are identified in these samples a greater percentage should be selected for additional edits. The purpose of a quality control edit is to identify errors to the point of acceptance or rejection, not to edit every object in the database. Some assumptions can be made that if 25% of the objects are acceptable the remaining objects are as well. Using the reverse logic, as soon as an edit uncovers sufficient errors to reject a file the edits should stop and the file returned to production for corrections.

The following flow chart depicts these edit points and where in the process the incremental accept / reject decisions can be made:


© GISdevelopment.net. All rights reserved.