GISdevelopment.net ---> GITA 2003 ---> Data Management - The Evolution of Data

Data quality control in a GIS project

Maria Navarro
Intergraph Utilities & Communications
Mailstop: LR23B2
241 Disk Drive
Madison, Alabama 35758


Abstract
This presentation emphasizes the quality of data, which is one of the most important elements in an organization. When companies purchase sophisticated technology solutions that provide analyses for better decision-making processes, the acquisition of data (spatial and alphanumeric) can become the point of project success or failure, due to false expectations or not having clear rules established for acquiring data that meet the company’s goals. This paper offers some recommendations for establishing data qualitycontrol standards to achieve the company’s goals.

Complex and sophisticated tools can help in acquiring high quality data, but company rules and practices also need to be established to determine if the data will meet its requirements.

A search of the sources and types of data will identify the different types of error, which will help to define the standards.

It is important to define and tune workflows that ensure the quality of the data, and test if these standards will meet the project’s requirements; the experience and knowledge of the people involved in the project is very valuable in order to get quality data and help the project succeed.

A system not only needs to provide good data, but to integrate it and produce useful information at different levels in the organization.

Data for decision-making
We live in a world that is constantly changing and companies invest in sophisticated technology to capture, store, manipulate, analyze, produce, present and integrate data in order to plan, model and make better decisions about complex problems or phenomena which occur at a certain time and location.

Since phenomena is interrelated with everything and in order to be quantified, analyzed, observed and studied, it needs to be de-composed into significant parts or units (data), which combines space, time and characteristics or attributes.

The quality of data will be measured in terms of how accurate it can represent or model phenomena; for example, facility data in a utility company, with all its distribution and transmission infrastructure, customer’s database, etc.

Complex and sophisticated tools can help in acquiring high quality data but company rules and practices also need to be established to determine if the data will meet its requirements. Accurate data becomes an important factor that will support decision- makers at every level of the organization, such as strategic, operative and tactic.

The quality of the data will depend on many different factors, and companies need to make wise decisions in order to set the correct standards so as to gain good data that meets the company’s goals, but that does not to hold up the project because of unnecessarily high expectations.

Once a company decides to invest in acquiring new data or updating it, many decisions need to be made: a) Technology: are the existing tools sufficient? b) Human resources: are the people in the organization prepared for the project? If not, what needs to be done to remedy this? c) Time frame: when is the data needed? According to Flowerdew and Chrisman (1991) * here are some questions that need to be asked:

What is the data for?
What type of data?
Which are the sources of the data?
How accurate is the data?
Where and when do data refer to?
What is the minimal unit of spatial and alphanumeric data?
Will the data be interchanged with other systems?
Which is the scope of the project?
What are the costs/benefits?
Will the data also be used in other applications?

The quality of data can vary depending on the answers to the above questions. According to Chrisman’s (1983) definition, data quality is based on “ fitness for use”; that is to say, each company defines its own quality based on its own standards and expectations. In order to have clear data standards, the sources and types of errors need to be identified, making it easier to define acceptable quality limits in order to define good data, especially when the data comes from different sources, such as subcontractors, other agencies, aerial photography, and from analog data, such as maps and digital data from other systems, for example.

Spatial and non-spatial (or attribute) data should also be thought of in terms of its ultimate purpose as well as the types of systems that it will be integrated into and the expected results. For example, which of the data produced in the GIS will be used to feed another application, or vice versa, and what is needed by the GIS that could be obtained from other systems.

A good GIS should be able to integrate data from different sources and present it in the form of a report, maps, displays, digital formats, among others; to be used by all the departments in the organization. According to Shepherd (1991)* geographical data is especially difficult to integrate because it contains various kinds of inconsistencies, such as: a) variations in resolution, b) differences in the definition of data units, for example political boundaries, c) variations in the use of terminology, d) the information is generally collected at different points in time, e) there is always a human factor, for example differences in interviewing technique, interpretation or observation, and f) the information can be stored in different formats.

The purpose of integrating data from different sources is to provide information to communicate to all the different levels of the organization. Standards must be clear for locational and non-locational data. Rules and procedures should be set not only for data extraction and representation, but also according to Guptill (1991), a complete specification should be created for each feature of interest. The best way to accomplish this is to create a template with the definition of the feature, like the type of spatial data that will be used to represent each feature at a certain level of resolution. For example, a a line could be used by an electric company to represent a conductor, so the style of the line, such as color, type of line or even the width could represent one or many of its attributes, such as the status (e.g. proposed, operational, removed) or the voltage. For certain attributes, such as size and material or manufacturer, a list of valid values should also be set, in order to have better quality control; validation rules such as domains and constraints to accept only reference lists should be created, and databases make that possible.

The creation of a consistent and standardized spatial model in the end is a manual operation. That is to say, after using all the diverse and automatic tools to clean and make the data uniform, the final tuning requires the eye and hand of an expert, who must make the final adjustments to resolve any data inconsistencies.

Standards
There are three main phases in a GIS project: a) compilation of data: In this stage the company needs to review the data that will be collected, based on the type of analysis and results that will be needed. Standards will play an important role in this phase to determine if the data will be useful; b) analysis of the data in order to produce information for planning and decision-making. The results produced will mostly depend on the quality of the data collected. Once the first results are in, the company should review it to see if any standards need to be redefined, and apply the new rules as soon as possible. Generally, the compilation is the longest phase and will continue for nearly the entire duration of the project; c) monitoring: for example companies will want to periodically inspect the facilities of an electric company.

When a company decides to invest in a project that involves the acquisition of new data, it should determine if standards exist inside the company and if these comply with the scope of the project.

In a GIS project, data standards should be considered at different levels (Chrisman 1991); they need to be based on the following: a) positional accuracy, b) attribute accuracy, c) logical consistency, d) timeframe, e) representation, f) abstraction, g) selection/completeness as well as h) integration and i) presentation of the information. Databases have played a growing role in the homogenization and integration of data in GIS applications. They provide (Healey 1991)*: a) a good way to store data; b) a standardized and consistent way to input and update data; c) a secure environment in which restrictions can be applied for viewing, modifying data; and d) a method for managing various users.

For a project to be successful, two more considerations are necessary: a) workflow: It is important to standardize the various workflows required to acquire data. These need to be clear at every level, inside and outside of the organization (e.g. contractors) and should be included as part of the specifications for designing, evaluating and testing, making sure that the data will meet project requirements. A pilot project will help validate these methods for reviewing, correcting and rejecting data. These will also give an idea of the types of errors that can easily be corrected through automated operations and if it would be more expensive to reject them. Exceptions should be documented as well as how to handle them; and b) the human factor: the interpretation and participation of the operator is critical. Workshops should be held periodically in order to promote the exchange of experience and knowledge among project participants, to answer questions and to make sure that the methodology is correct and that the objectives are clear. Based on operator experience, the methods and standards can be tuned or changed at the different stages of the project. Even though this may appear to be costly, it will help prevent misinterpretations as well as delays caused by having to repeat or re-do work. It is just as important to invest in human resources as in technology.

Summary
Standards for good data need to exist or be defined according to the company’s rules and practices; and should take into account the various sources and types of data, integration with other systems, the available technology and the experience and knowledge of the human resources involved, all toward achieving the primary goal: providing integrated and useful information at every level of the enterprise to facilitate planning and rto allow better decision-making.

References
  • Chrisman N.R. (1991). “The error component in spatial data”. Geographical Information Systems. Volume 1: Principles, edited by Maguire D.J., Goodchild M.F, and Rhind D.W. pp. 165 – 174.
  • Flowerdew R. (1991). “Spatial data integration”. Geographical Information Systems. Volume 1: Principles, edited by Maguire D.J., Goodchild M.F, and Rhind D.W. pp. 375 – 387.
  • Healey R.G. (1991). “Database management systems”. Geographical Information Systems. Volume 1: Principles, edited by Maguire D.J., Goodchild M.F, and Rhind D.W. pp. 251- 265.
  • Navarro Maria D.C., Legorreta Gabriel.(1998) “Sistemas de Informacion Geografica. Teoria introductoria y ejercicios con AutoCad e Idrisi”. pp. 55 –58.
  • Shepherd I.D.H. (1991). “ Information Integration and GIS”. Geographical Information Systems. Volume 1: Principles, edited by Maguire D.J., Goodchild M.F, and Rhind D.W.
© GISdevelopment.net. All rights reserved.