Foundation of the quality assurance odyssey
Kevin Peters
Stoner Associates, P.O. Box 86 Carlisle, PA 17013-0086 Introduction With limited resources and compressed project schedules, it is difficult for utilities and municipalities to perform quality assurance/quality control on 100 percent of converted data. Due to these constraints, quality assurance/quality control is typically performed via sampling techniques. Considering the massive investment that is tied to data, a sampling often does not seem adequate and often results in a lack of confidence in the converted data. To help build confidence that data can be validated via sampling, it is important that the client and vendor work together to build quality into the conversion process at the beginning of a project. To ensure this solid foundation for quality control is established, it is critical that technical conversion specifications be developed that:
Source to target conversion specifications It is important that both the vendor and customer have a clear understanding of how source data is to be interpreted, put into the target GIS system, and displayed in the GIS. Detailed specifications covering these issues can be mutually developed and addressed through a Data Source Matrix, Source Interpretation Document, and Map Standards. Data Source Matrix The source of converted data for any given project can come in a variety of formats including, hard copy maps, digital CAD files, tabular data, legacy AM/FM/GIS data, or in some cases, generated via software. In addition, different sources often depict the same entities or attributes in different ways; and quite commonly, the 'same' information on two different sources varies. For example, it is not uncommon to find the same entity located in two different geographic locations on two different sources or a given attribute for the same entity with two different values. To help organize how the source data is to be used, a Data Source Matrix should be developed. The creation of a Data Source Matrix not only defines how the source data is to be used, it also establishes a hierarchy describing which source takes priority over another. Specifically, the source matrix should:
Figure 1 shows an example of a Data Source Matrix. The creation of the Data Source Matrix should be a mutual exercise between the organizations that are familiar with the source data and the conversion team. In order to complete the source matrix, it is important to understand not only what data exists on a source, but also how it is depicted on the source. Therefore, it should be created in conjunction with the Source Interpretation Document. Source Interpretation Document Whether source data exists as hard copy maps, tabular formats or legacy GIS data, specifications must exist for defining how the source data is to be interpreted against the target data model. The Source Interpretation Document will do this, and in doing so will demonstrate to the customer, before any data is converted, that the conversion team has a solid understanding of their source data. The methods for creating the Source Interpretation Document and determining its content vary depending upon the type of source data; however, it must include a full listing of the entities, attributes, and values that exist in the target system. For Hard Copy Source Maps If all hard-copy source maps contained detailed legends covering every symbol and piece of text that exists on every source map, the need for a Source Interpretation Document is diminished. However, hard-copy source data typically contains a variety of graphic representation (symbols, lines and text) for the same entity/attribute. For example, an open Valve may be represented in two ways: Open -----X---- or -----O----- It should be no surprise to find varying symbology on hard copy source data; after all, one of the goals of a GIS/AM/FM system is to standardize mapping symbology. The variations may be due to changes made to the map standards (legend) over time, or simply due to 'cartographers' not adhering to the map standards. While variations within a given set of data may be minimal, the situation is compounded when data for an organization has been maintained by separate departments or record offices, and is even further compounded when it is from a utility company that was acquired at some point in the past. The main purpose of the Source Interpretation Document is to compile all varying source symbologies and align them to a common entity, attribute, and value (thus symbol) in the target GIS/AM/FM system. This process begins by identifying sets of source maps that display unique sets of symbology. For example, all maps from a given record office may have used a common symbol set or legend. A sample of maps from each unique set should be fully reviewed. If map standards or a legend exist, it provides a good starting point. Where a unique symbol, line, text or combination of these elements exists, it should be noted, copied and matched to an entity/attribute/value in the target data model. In theory, the sample maps should be reviewed to the point where no new unique symbology is being found. It is important to use a photocopy from the source map itself when conveying the symbology. This ensures the full context of the source symbology is conveyed. For example, an 'X' along a Water Main might mean Closed Valve but represent an entirely different entity along a Sewer Main. In addition to a photocopy of the source symbology, any specifications or comments describing how the asset should be interpreted and converted should be included.
Figure 2 shows an example of a Source Interpretation Document for a Hard Copy Map. Side Note: There is an additional benefit of performing this exercise: It can assist in determining the map standards for the GIS/AM/FM system. A good start in determining what standardized symbology should exist in new system begins with knowing exactly what is currently being used for manual mapping. For Tabular Source Data Interpretation of tabular data, whether from a hard copy format, database or, commadelimited files while less complicated than maps, is still important and should be documented. Evaluating tabular data typically consists of a column mapping from the tabular data to the target entities, attributes, and values along with any specifications or migration rules, such as parsing of fields that might pertain to the target system. Depending on the amount of data that is coming from tabular sources, it could be included in the mapping document or source matrix. For example,
For Legacy GIS Source Data Where source data exists in a legacy GIS/AM/FM system, the source interpretation document typically evolves into a data mapping document where the entities, attributes and values from the legacy system are mapped to the target system. However, the source interpretation goes much further and also must cover the migration of graphic elements, including x and y coordinates, as well as connectivity. Map Standards Document Prior to starting conversion, it is important to have a clear understanding as to what the converted data should look like graphically. Specifically, there is a need to define the symbols, lines, and text that will be used to represent the various combinations of features, attributes, and values. The size of the symbols/text and the thickness of lines must also be defined. In addition, the frequency of annotation placement must be specified. The sum of these specifications results in a Map Standard, much of which is typically determined by or becomes the target system's rulebase. A Map Standard such as the one in Figure 3 compiles the graphic information and allows for an easier understanding of what the final map product should look like.
Figure 3: Map Standards Example Mechanism for resolving sourcre/specification queries No matter how much detail is established through the source matrix and source interpretation document, the unknown always arises, i.e., there are bound to be unique instances of symbology on the source data that are not covered by, or perhaps conflict with the specifications in the Source Interpretation Document. Another typical scenario occurs when a source document is simply not legible. These conflicts essentially represent occurrences outside the reasonable constraints of the Source Matrix/Source Interpretation Documents. The customer will feel more comfortable with the delivered data knowing that 'guessing' is not involved in handling source anomalies, but rather they are being queried and resolved. Therefore, a mechanism for resolving these source conflicts, and updating the specifications with the resolution, should be part of the planning process. The mechanism for raising source conflicts is performed by the team performing conversion. That team provides queries to the experts at the utility who best understand the source data. Those queries are answered by the source expert and returned to the conversion team. In this process, it is important that all parties use the common terminology of the target data model since the use of 'local' terminology often leads to confusion. For example, if the model contains specific types of valves such as, Flow Regulating Valve, Pressure Regulating Valve, or Control Valve, it is important that both the query and resolution address exactly what type of valve is in question. Otherwise, the answer might refer to an 'isolation valve' leaving the recipient guessing as to exactly which valve is being discussed. Ideally, all specification and source queries should be logged in a database if possible by entity, attribute, and value. This helps ensure the same query is not asked twice, (perhaps from a different map or source material), and it organizes the queries/resolutions in a manner that facilitates updates to future versions of the conversion specifications. Specifications for addressing source queries should also specify a time frame for their resolution. Many a project schedule has suffered due to delays in answering queries. In the event turn-around time becomes a project risk, it might be wise to consider implementing defaults for specific types of queries that can help control production if a fixed response period has expired. These pre-defined, mutually agreed defaults for specific types of queries could be determined by examining past queries and their associated resolutions. For example, if a common query was that the size of a valve was not shown on the source, and the answer to that query nine out of ten times was "set it to 4", then that answer could be used as the default. The ability to track and identify these query/resolution trends is yet another benefit to managing them through a database. Acceptance Criteria A lot of attention is focused on defining the acceptance criteria with regards to the percent accuracy that should be met. However, many RFPs and Proposals make references to meeting different categories of accuracy but often do not fully define them or discuss the means by which they should precisely be measured. A set of acceptance criteria should be established that defines not only the error percentage by category, but also defines how errors in the category will be determined as well as how they will be measured. The methodology for defining how an error percentage should be measured depends heavily on the category or accuracy that is being measured. Accuracy requirements for categories that are set at 100 percent typically only need a definition. For example, if the expectation is that connectivity and programmatically derived data must always be correct, then measuring those against the number of attributes or entities is not critical. However, if an accuracy requirement for a category is less than 100 percent it becomes critical for the measurement approach to be defined. For example, completeness and attribute and location accuracy might be measured as follows: Attribute Accuracy This measurement criterion is based on the total number of attribute errors found divided by the total number of possible attributes for those entities that are sampled. An attribute error is recorded when any attribute value for a given entity is not correct as reflected in the source and conversion specifications. X% of all attributes must correspond to the value that is reflected in the source, conversion specifications, etc. Positional Accuracy This measurement criterion is based on the number of graphic features whose symbols or lines have positional errors divided by the total number of graphic features having symbols or lines that are contained in a given sample. A positional accuracy error will be recorded when any symbol or line that is not correctly located as reflected in the position shown on the source or as reflected in the Map Standard placement rules. X percent of all sampled graphic features with symbols or lines must correspond to the locational placement rules as defined in the Map Standards and/or the relative location reflected on the source document or material. Entity Completeness Accuracy This measurement criterion is based on the number of entities that are found to be missing (or extra) divided by the total number of entities that were sampled. A completeness error will be recorded when any entity depicted on the source document or material is not reflected in the data. The count of the entities captured must be X percent of the number of entities in the sample. All of these accuracy definitions must delineate the precision of measurement down to the number of attributes/entities. Further precision could be given by adding weighting factors to the entities/attributes. Those items that are more critical would be scored higher. For example, if it is more critical that the ownership of an asset be correct than its installation date, then the acceptance criteria could define the ownership attribute with a value of 1.5 and the installation date with a value of .5. Data Test Plan With all necessary conversion specifications in place, including the Source Matrix, Source Interpretation Document, Map Standards, and Acceptance Criteria, the next step includes developing a plan for how to test all of the data that is converted against those specifications. With a solid Data Test Plan the customer can be assured that data is tested thoroughly. The Data Test Plan does that by charting the strategy for testing data. That includes:
Typically, this means a row for physical location, connectivity, completeness and each attribute. For each one of these entity characteristics, one or more of the QC check columns is noted, indicating a test for data quality must be performed on that particular characteristic. Sorting entities and attributes by QC phase provides a list of all checks for that phase. This process also shows how many individual entity characteristics are checked during different phases of QC, producing a desired redundancy where necessary and bolstering the comprehensiveness of the QC process.
Figure 4: Test Plan Matrix Example Error Tracking With a set of specifications, acceptance criteria to measure against, and a strategy for performing the QC, the validation process can begin. However, there needs to be mechanism for tracking the errors that are found during data validation. This can be accomplished with an Error Tracking database, which can achieve two goals:
To help ensure consistency in reporting errors, the error tracker should also include standardized error descriptions and formats where possible. For example, problems pertaining to annotation can typically be described in one of the following manners: Missing, Extra, Incorrect Rotation, Overstriking (other symbology) or Incorrect Location. Attribute errors might be recorded in a standard format such as 'Entity.Attribute Name should be ___ instead of __'; for example, Valve.Size should be 6" instead of 4".
Figure 5: Information in Error Tracking Database A standardized approach to tracking errors, and especially doing so through a database, will not only allow for consistency in recording errors but will also make calculating accuracy much easier. The number and type of errors that are found can easily be inserted in a formula to determine if each error category in the acceptance criteria is met. Concluding Remarks The conversion of data from a variety of sources can be a very complex and expensive process. The complete conversion process certainly extends far beyond the technical specifications mentioned in this paper; it will also include conversion procedures, quality control procedures, quality control software design, etc. The information described here points to the importance of key technical conversion specifications, as well as some suggestions for how they can be defined. Establishing these specifications at the beginning of the conversion process is as crucial to project success as the procedural aspects of the conversion. By defining how to translate the source data into the target system, what the data in the target system should look like, what the expectation is with regards to accuracy, how the data will be tested to ensure that accuracy, and how errors will be recorded during testing, a utility or agency can create a solid foundation upon which data conversion procedures can be built. Building the foundations of quality into the conversion process in this manner through client/vendor team work will help ensure that the significant investment associated with creating, converting, and maintaining data is a sound one, and it will build confidence that a 100 percent check of the data may not be necessary. | ||
| © GISdevelopment.net. All rights reserved. |