GISdevelopment.net ---> GITA 2000 ---> Data Development and Evolution

Data integrity and quality - How do you get there?

Bob Britton
US WEST, 700 W. Mineral Ave. , Littleton, CO 80120

Darrell Rhodes
Analytical Surveys, Inc., 941 N. Meridian St.
Indianapolis, IN 46204


Quality Control And Quality Assurance (Qa/Qc) Processes
Data integrity and quality are critical to the success of any geospatial program implementation and are achieved through effective QA/QC and data validation processes, especially during the data conversion phase of the project. The importance of these processes become more evident as the project stakeholders recognize that the initially converted database will be used as the foundation for all future business applications and functions requiring geospatial data and analysis. Typically, additional layers or levels of data will be added on top of the originally converted data as future enhancements are made to the geospatial system, adding more importance to the initial quality and integrity of the primary system data.

Typically, conversion vendors will utilize detailed quality assurance and quality control steps within their conversion process to ensure the data specifications are met prior to delivering the data to the client. With the increasing focus on ISO 9000 implementation and certification, quality assurance and control procedures are beginning to adhere to a basic set of standards recognized throughout the GIS community as well as other industries, resulting in improved QA/QC processes.

During data conversion, quality control is focused on "inspection" and will include full manual and automated checks of the converted data against the source information and specifications at defined checkpoints within the conversion process. Manual checks may include a one to one comparison of the source document with a hard copy plot as well as performing consistency checks digitally on screen. Automated checks typically involve very specific validation routines that are run at the completion of the tasks. These would check feature level and model level requirements and would report them to an operator for correction. This process is generally cyclical until all the errors have been identified and corrected before moving to the next task. Quality assurance is focused on the "process" as well as "validation". The conversion process must be engineered to ensure the quality of the data is "built in" rather than "inspected in". This involves various types of validations built into the conversion software and process to minimize the "human error" factor on the data.

For example, data model requirements can be incorporated into the conversion software, enabling the data being entered by an operator to be validated "on the fly". In other words, if an operator is capturing attributes for a cable feature, the software would only allow the operator to key in legal values for the attributes as defined by the data model. For system defined attributes, the software would populate these fields automatically and would eliminate any operator intervention thereby reducing the risk of error. In addition to process engineering, quality assurance also involves data validation utilizing random sampling techniques at major points within the conversion process. This generally occurs on the final software platform and will consist of running QA scripts and checking reports as well as reviewing hard copy check plots and performing onscreen integrity checks. Generally, better results are achieved if the conversion vendor can replicate the random sampling process / technique utilized by the client.

The client must fully understand the conversion vendors QA/QC processes in order to establish their own checks and be assured that the data will comply with the specifications and meet the desired quality levels. In this regard, the client and the conversion vendor must work together and share in the QA/QC responsibility. Providing the conversion vendor with concise requirements and targets is the first step in setting up good quality methods. This is generally accomplished through the development of acceptance criteria, which is discussed in the last section of this paper.

Most clients typically try to minimize the resources required to perform the QA/QC analysis of the data delivered by the conversion vendor. In addition, knowledgeable resources are not always available. This is compounded by the fact that when the resources are available they are needed for other programs within the company. In addition to resource issues, there are also time constraints that must be managed as part of the data review. Generally, there are specific time periods that have been defined in the contract for data review and acceptance. With this in mind, the client usually uses a statistical sampling scenario to maximize the resource and time constraints to validate data coming from the vendors. It is also important to note that the client is counting on much of the actual detailed quality control to have taken place before the data is delivered.

To establish a proper statistical sampling method, the client must adhere to an established set of criteria such as presented in the ANSI Sampling Procedures and Tables for Inspection by Attributes. It is critical that once the inspection criteria are established, the staff assigned to perform the inspection follow the criteria to the letter. A common failure of this type of analysis is convincing the users that it works. The people who are performing the inspection are usually ex-records staff who will not be tolerant of ignoring "other errors" that they notice outside of the sample set of data. For random sampling to work however, you must record errors only on the items randomly selected for sample.

One method to help mitigate this natural resistance to look past additional errors is to utilize a separate error tally sheet to record the "other errors" found during the inspection. The project team will then need to decide whether or not action is required to specifically correct these errors before the data is turned over to the end user. If the data were accepted based on the sample, typically the "other errors" would not be corrected before the data is turned over to the client. If the data were rejected based on the sample, typically the "other errors" are returned to the conversion vendor for correction as part of their normal rework cycle to correct the rejected delivery. A typical random sampling process will include four components, which are, inspection conditions, characteristics to be inspected, inspection methods, and the consolidation of results. With these components in place, one should be able to ensure their converted data is complete and accurate. Again, to establish proper statistical sampling processes, one should follow accepted standards such as ANSI's Guide to Inspection Planning, and Sampling Procedures and Tables for Inspection of Attributes.

Inspection Conditions
Inspection conditions include determining the inspection timing, personnel requirements and facility requirements. Inspection timing is usually a contractual obligation that is tied to the time period between when the data is delivered and final payment is made. The timing requirement and number of data deliveries (batches) will drive the personnel and facility requirements. Personnel requirements typically include a Process Coordinator, Records Inspector, and Database Inspector. Facility requirements will be based on the physical and logistical characteristics of the inspection process including space and furnishings.

A Process Coordinator will be responsible for the process implementation and its continuous monitoring. This person should have the following basic skills:
  • Scheduling and resource management
  • Comprehensive understanding of statistical principles and inspection processes
  • Comprehensive understanding of the Client's sources
  • Comprehensive understanding of the Client's Geospatial System Requirements
  • Comprehensive understanding of contractual terms and conditions
A Records Inspector will be responsible for identifying and inspecting specific features and tallying those results. This person should have the following basic skills:
  • Familiarity with the statistical principals upon which the inspection process is based
  • Comprehensive understanding of the Client's sources
  • Comprehensive understanding of the Client's Geospatial System Requirements
  • Ability to analyze source and conversion data and tabulate inspection results
A Database Inspector will be responsible for the inspection of the non-graphic database products produced in the conversion. This person should have the following basic skills:
  • Familiarity with the statistical principals upon which the inspection process is based
  • Comprehensive understanding of the Client's sources
  • Comprehensive understanding of the Client's Geospatial System Requirements
  • Basic system navigation and file processing skills
The facilities requirements should be determined based upon the specific parameters of the project. How many batches are going to be delivered over what time frame will determine most facilities requirements. Keep in mind the following key points when determining your specific facility requirements:
  • There will typically be many boxes of paper documents arriving with each batch that will need to be inventoried and filed while the batch is being processed.
  • The paper and data received for each batch will typically need to be saved for the extent of the contract.
  • The inspection process will generate large amounts of documentation, which will have to be filed and stored for the duration of the contract.
  • Inspectors will require adequate work surfaces with sufficient room to layout D or E size drawings.
  • Copy and FAX equipment should be available and in close proximity.
Characteristics To Be Inspected
Characteristics to be inspected will be specific to your projects requirements. Typically, you may want to categorize characteristics into some General categories as well as some Feature Specific categories. Within each of these categories, both major and minor characteristics will need to be defined. The major characteristics correspond to critical product requirements that can lead to material defect in the geospatial application. Minor characteristics relate to data non-conformities. This distinction will be critical when defining your acceptance criteria, as allowed major characteristic errors will typically be much more stringent than allowed minor characteristic errors. An example of how you might categorize your inspection characteristics could be taken by looking at a cable feature in a telephony geospatial conversion. The cable feature would have some major characteristics such as whether or not the feature was captured, whether or not it was captured along the correct side of the street, or whether its database relationships are modeled correctly. If any of these major characteristics are captured incorrectly, they will have significant affects on the usability of the data in its' various applications.

If the cable were not captured, then the connectivity for those cable pairs would be corrupted in all field side plant from that point on as well as increasing the potential for a cut cable in the future. For these reasons, these types of errors are weighted more heavily during the inspection. In order to account for the significance of the error and provide the statistical results, these types of errors are counted as if all the attributes for this feature were converted wrong. If a cable has been defined as having twelve attributes in the sampling scheme, then a missed cable would result in twelve errors in the inspection tally. The same holds true for a cable captured on the wrong side of the street. In some geospatial data models, this is a critical error and could cause cable cuts and added engineering and construction costs due to the incorrect placement information. This example would also result in twelve errors tallied during inspection.

In the case of the cable relationships within the data model, the accuracy of this data would be highly valued and therefore would be heavily weighted during inspection. These data relationships are the foundation of the geospatial systems operation and could cause severe data corruption in many cases. For this reason, it is typical that the acceptance criteria will mandate that any data relationships or model requirements must be 100% correct. The client will usually develop a suite of automated QA scripts that are run against the data to check the validity of this type of relationship as well as other data model requirements. Typically, the QA scripts are made available to the conversion vendor because most clients prefer the quality of this data be assured before the data is delivered the first time. Also, if the conversion vendor has the ability to run these scripts on the data, it will help them identify potential data and/or process errors and enable them to modify the conversion process to eliminate the problem(s) from occurring on future data sets.

The same cable could have many additional minor characteristics. These characteristics would include the actual values of the attributes captured or the symbology used for the feature. These are the types of errors commonly introduced as "human errors" that occurred during the conversion process or by source inconsistencies and legibility issues that may exist. In most cases involving these types of errors, the functionality of the geospatial system and applications would be less affected, although an error is still undesirable. Based on the nature of these characteristics and the business rules that generally apply to them, it is very difficult to write automated scripts that will identify the errors. For these reasons, these types of errors are best suited for statistical sampling. Your inspection sheets should be tailored to easily record these types of errors. In addition, these types of errors usually serve as the basis for the base units in the overall-sampling scheme. In other words, these unit totals form the statistical sampling criteria for the sampling tables. This would be defined in the overall acceptance criteria as an allowable error percentage (# of errors divided by total sample set units).

Inspection Methods
Now we will discuss some details of the actual inspection methods used in a random sampling analysis. The methods are presented in two phases, a Preparation phase and an Inspection phase. We will discuss the basic concepts that should be used in each of these phases as they relate to a typical geospatial data batch delivery. A typical example of a batch sampling summary sheet is provided at the end of the paper in Exhibit B. Many of the items discussed below will be recorded on this sheet prior to the actual inspection of the data. In the Preparation phase, the following bullets describe the major tasks that need to be accomplished:
  • Determine Sampling Scheme:

  • This is where a determination is made as to which sampling tables will be used to base the accept/reject percentages on. The sampling schemes are described as Normal, Reduced, or Tightened. To make this determination, the history of the deliveries must be reviewed and compared to the switching rules given in the ANSI Z1.4 standard. The results of this first analysis are recorded on the sampling summary sheet for use in determining the accept/reject of the batch upon completion of the inspection.

  • Determine the Required Sample Size:

  • Need to determine the actual number of units that will need to be inspected for the delivery batch. Again, the ANSI Z1.4 specifications will provide the tables needed to choose the proper sample size for the batch being inspected. It is important to note here that there must be a convenient method to know how many units are delivered in each batch. The actual definition of a QA unit is the foundation of the inspection process. Typically, the lowest measurable unit is the single attribute value of each feature in the geospatial system. Therefore, typically the total units in a batch are the total of all attributes for all features delivered in the batch. This usually involves the conversion vendor providing these counts and the client confirming them upon receipt by running a simple script. The total units delivered in the batch along with the acceptance error rate allowed will be used in the sampling tables to return the number of units needing to be inspected.

  • Determine the Accept/Reject Values for Number of Non-Conformances:

  • This step will provide the results criteria that will determine whether the delivery batch is accepted, rejected, or requires a re-sample. Referring to the ANSI standard sampling tables we can obtain the appropriate accept (Ac) or reject (Re) numbers to record on the sampling summary sheet for the batch. The total number of nonconformances recorded during inspection will be compared to these numbers to determine the outcome of the batch.

  • Identify Source Units for Inspection:

  • In this step of the preparation phase, the data sources being inspected need to be easily identified and must be able to incorporate a means to segregate out the inspection units. A method that has been used successfully in the past is to first assign a sequential number to each source record or facilities map. Second, a random number generator is used with the highest number equal to the total number of all records associated with the delivery batch. The source records selected for inspection are based on the random numbers returned by the program. The required sample size will give the number of sources to be chosen. Next, the individual feature on each source needs to be randomly selected from the selected source map(s). It is easy to accomplish this with the use of a Mylar overlay printed with a simple numbered grid. Again, a random number generator is used to select the grid numbers. The overlay is then placed over the source and the closest feature to the selected grid will be the feature identified for inspection.

    Note: A simple shortcut has been taken in this sampling scenario assuming an equal average number of attributes per actual feature. If this is not the case in your specific project additional source and feature selections may be required to achieve the total attributes needed for the sample.

    The Inspection phase involves the actual comparison of the source information for that feature against the corresponding feature in the converted data. The inspectors assigned to the inspection effort must be knowledgeable with the source records, the geospatial application, and the conversion rules and standards used in both. The process requires the inspectors to start with the feature selected for inspection on each source record, locate the corresponding feature in the geospatial application and record any non-conformances onto the feature inspection sheet. A sample of a feature inspection sheet is included at the end of this paper as Exhibit C. Remember also that only the selected feature is to be analyzed and recorded on the inspection sheet. If "other errors" are going to be identified during the inspection, they must be recorded separately and accounted for outside the sampling needed to achieve these results.

    Many challenges can occur in this seemingly simple step. Many times the location of features in the geospatial system is made more difficult as the source record grids may not have been incorporated into the geospatial data. Inspection of a feature (i.e. cable) is made more challenging as there may be many features with the same size, type, counts, etc. as the sample selected. If the source records are schematic in nature and the converted data is now geographic, additional challenges are incurred while trying to locate the target feature in the converted data. Processes will need to be developed in the initial batch inspections to formulate solutions in these feature location challenges.
Consolidation of Results
The final stage of the random sampling process is the consolidation of results. During this phase, someone will be responsible for tallying the non-conformances identified on each inspection sheet onto the batch summary sheet. The total of all non-conformances is calculated and then compared to the accept/reject values listed. Nonconformance totals less than or equal to the "Ac" value indicate an acceptance of the batch. Totals greater than or equal to the "Re" value indicate a rejection of the batch, and anything in-between indicates a re-sampling is required.

In this final step of the inspection phase, the inspectors work are typically spot checked by a supervisor to ensure accuracy and consistency. This is a good idea because of the depth of knowledge required across the sources and geospatial requirements. It is always a good idea to try and head off things missed or incorrectly marked as a nonconformance before the final accept or reject decision is made. Also in this final step copies of all inspection findings should be made to provide detailed feedback to the conversion vendor on the errors found. Once the results are confirmed, the notice of acceptance or rejection should be given to the conversion vendor and project team. The conversion vendor will use the inspection results to enhance their in-house QA/QC processes to better "trap" the non-conformances in future deliveries. The project team will track results of the batches in order to determine the level of sampling required in future deliveries. Acceptance Criteria
The acceptance criteria defined for the geospatial conversion project is one of the most important factors in a successful project. Careful analysis of system requirements, source record quality, and business needs will all influence what acceptance criterion is required. The acceptance criteria will have a bearing on conversion cost and schedule so it is important to have at least a general outline of the acceptance criteria before finalizing the contract for conversion. It is very important that the client and the conversion vendor work together to finalize the acceptance criteria to ensure that both parties fully understand the requirements and expectations of the project. There are usually several options and alternatives available for defining acceptance criteria for a data conversion project. Typically, there could be one standard for one type of data (data relationships, data formats, etc.) and a second standard for another type data (physical data attributes, symbology, etc.). Generally there are different acceptance requirements depending on the nature of the data and its overall importance to the functionality of the system functions and applications. For most projects, there will be a requirement to meet 100% of the data model / relationship requirements, where as, the requirement for the capture of the physical features and attributes will be 98%. This would apply to major characteristic errors and minor characteristic errors discussed earlier in this paper. Since there are so many options available to develop comprehensive acceptance criteria to meet the needs of your project, an example of written acceptance criteria has been provided in Exhibit A to give a sample of some of the language that has been used on other successful projects. This is not an all inclusive list and is only meant to provide a representative example. The level of detail and amount of criteria that is developed will solely depend on the specific project requirements.

Exhibit A

Standard Acceptance Criteria
The following criteria shall serve as the guidelines for the acceptance of the converted facility data for the client. The acceptance percentage for the facility data is based on the total number of features/database attributes in a unit of delivery (batch). Conversion source documents refers to any maps, drawings, documents, and digital data provided by the client to be used in the conversion effort for capturing required data.

The conversion vendor is responsible for achieving a 100.0% accuracy rate for the data complying with the database model specifications and all topological/system requirements (computer checkable). These requirements are system specific and should be listed in detail. Some examples are as follows:
  1. None of the features that are defined in the database design/schema as having a database record, shall be missing a database record.
  2. None of the database records will have incorrect database relationships.
  3. Features (data elements) must correspond to their specific database record.
  4. Attributes are populated with legal value ranges.
  5. Data lies within the map extents.
  6. Edgematching tolerances are as specified
The conversion vendor is responsible for achieving a 98.0% accuracy rate for the capture of required features depicted on the conversion source document(s). Accuracy is defined as the existence, completeness, placement, and cartographic representation of the geographic features or data. If more than 2% of the geographic features are in error, the conversion vendor shall be responsible for correcting the errors identified by the client. If fewer than 2.0% of the geographic features are in error, then the client shall deem the delivery acceptable and be responsible for correcting those errors in house. The conversion vendor is responsible for achieving a 98.0% accuracy rate for the capture of the required database attributes when compared to the conversion source document(s). If more than 2.0% of the database attributes are in error, the conversion vendor shall be responsible for correcting the errors identified by the client. If fewer than 2.0% of the database attributes are in error, then the client shall deem the delivery acceptable and be responsible for correcting those errors in house.

The conversion vendor is responsible for achieving a 100.0% accuracy rate for graphic cross-tile/source connectivity (edgematching) for all linear features within a delivery area. The conversion vendor is responsible for achieving a 99.0% accuracy rate for attribute cross-tile/source (edgematching) for all linear and polygon features. This excludes features that can change attribute values between tiles as defined by the data model / specifications.

When the legibility of the features and the associated annotation on the source document(s) are difficult to interpret, the conversion vendor will make every effort to capture this data correctly. However, these situations are interpretative and will not be subjected to or considered in the acceptance percentage. Errors generated due to conflicts that exist between conversion source document(s) will not be subjected to or considered in the acceptance percentage. This excludes conflicts that were resolved between the conversion vendor and the client via problem/action reports or any other documented method.

Digital data/documents provided by the client to be used by the conversion vendor in the conversion process will be considered a conversion source. Any errors contained within the client supplied data/documents, or any errors caused by the client supplied data/documents not specifically identified as being within the scope of work to find and correct, will not be subjected to or considered in the acceptance percentage. Anomalies that the conversion vendor cannot resolve shall be brought to the attention of the client and will not be subjected to or considered in the acceptance percentage.

The conversion vendor is not responsible for "errors" generated due to any proposed or pending change orders. These "errors" will not be subjected to or considered in the acceptance percentage. Change orders must be agreed to by both the conversion vendor and the client and formally executed change order prior to implementing the change in the conversion process. Approved change orders do not affect the error calculation formulas and specifications until the areas where the changes were implemented have been delivered to the client. The client reserves the right to reject a delivery in total if in their judgement it is unusable for the QA/QC analysis and inspection (i.e. unreadable tapes, illegible plots, or when the error rate reaches 10% before the inspection process is complete).




© GISdevelopment.net. All rights reserved.