Data quality control in a GIS project
Data for decision-making
We live in a world that is constantly changing and companies invest in sophisticated
technology to capture, store, manipulate, analyze, produce, present and integrate data in
order to plan, model and make better decisions about complex problems or phenomena
which occur at a certain time and location.
Since phenomena is interrelated with everything and in order to be quantified, analyzed,
observed and studied, it needs to be de-composed into significant parts or units (data),
which combines space, time and characteristics or attributes.
The quality of data will be measured in terms of how accurate it can represent or model
phenomena; for example, facility data in a utility company, with all its distribution and
transmission infrastructure, customer’s database, etc.
Complex and sophisticated tools can help in acquiring high quality data but company
rules and practices also need to be established to determine if the data will meet its
requirements. Accurate data becomes an important factor that will support decision-
makers at every level of the organization, such as strategic, operative and tactic.
The quality of the data will depend on many different factors, and companies need to
make wise decisions in order to set the correct standards so as to gain good data that
meets the company’s goals, but that does not to hold up the project because of
unnecessarily high expectations.
Once a company decides to invest in acquiring new data or updating it, many decisions
need to be made: a) Technology: are the existing tools sufficient? b) Human resources:
are the people in the organization prepared for the project? If not, what needs to be done
to remedy this? c) Time frame: when is the data needed? According to Flowerdew and
Chrisman (1991) * here are some questions that need to be asked:
What is the data for?
What type of data?
Which are the sources of the data?
How accurate is the data?
Where and when do data refer to?
What is the minimal unit of spatial and alphanumeric data?
Will the data be interchanged with other systems?
Which is the scope of the project?
What are the costs/benefits?
Will the data also be used in other applications?
The quality of data can vary depending on the answers to the above questions. According
to Chrisman’s (1983) definition, data quality is based on “ fitness for use”; that is to say,
each company defines its own quality based on its own standards and expectations.
In order to have clear data standards, the sources and types of errors need to be identified,
making it easier to define acceptable quality limits in order to define good data,
especially when the data comes from different sources, such as subcontractors, other
agencies, aerial photography, and from analog data, such as maps and digital data from
other systems, for example.
Spatial and non-spatial (or attribute) data should also be thought of in terms of its
ultimate purpose as well as the types of systems that it will be integrated into and the
expected results. For example, which of the data produced in the GIS will be used to feed
another application, or vice versa, and what is needed by the GIS that could be obtained
from other systems.
A good GIS should be able to integrate data from different sources and present it in the
form of a report, maps, displays, digital formats, among others; to be used by all the
departments in the organization. According to Shepherd (1991)* geographical data is
especially difficult to integrate because it contains various kinds of inconsistencies, such
as: a) variations in resolution, b) differences in the definition of data units, for example
political boundaries, c) variations in the use of terminology, d) the information is
generally collected at different points in time, e) there is always a human factor, for
example differences in interviewing technique, interpretation or observation, and f) the
information can be stored in different formats.
The purpose of integrating data from different sources is to provide information to
communicate to all the different levels of the organization. Standards must be clear for
locational and non-locational data. Rules and procedures should be set not only for data
extraction and representation, but also according to Guptill (1991), a complete
specification should be created for each feature of interest. The best way to accomplish
this is to create a template with the definition of the feature, like the type of spatial data
that will be used to represent each feature at a certain level of resolution. For example, a a
line could be used by an electric company to represent a conductor, so the style of the
line, such as color, type of line or even the width could represent one or many of its
attributes, such as the status (e.g. proposed, operational, removed) or the voltage. For
certain attributes, such as size and material or manufacturer, a list of valid values should
also be set, in order to have better quality control; validation rules such as domains and
constraints to accept only reference lists should be created, and databases make that
possible.
The creation of a consistent and standardized spatial model in the end is a manual
operation. That is to say, after using all the diverse and automatic tools to clean and make
the data uniform, the final tuning requires the eye and hand of an expert, who must make
the final adjustments to resolve any data inconsistencies.