GISdevelopment.net ---> GITA 2000 ---> The best of the rest

Data quality: Defining an achievable standard

Tony S. Holmwood
Brown & Root Services, Halliburton Brown & Root
Hill Park Court, Leatherhead, Surrey KT22 7NL
United Kingdom


Background

'Data is the Cinderella of system implementation' - Why so ?
It is generally accepted that 50 - 85% of the cost of building and implementing an asset management/GIS system lies in the provision of data. Yet population of a new system's database is seldom considered until late in the day. Data collection and data conversion are not so interesting as hardware and software. The requirement for data is seldom thought about until the project has been under way for some time and the major decisions on platform and functionality have been made. The inter-related issue of Data Quality is often ignored completely. However, the issue of Data Quality is taking on increasing importance. As the performance of large public service companies and authorities gains a higher public profile, so management is forced to implement better systems with which to manage their infrastructure. The demand by the regulatory authorities that the major utility companies achieve a greater efficiencies and better safety is forcing an increased understanding of the assets which make up the backbone of their business. This in turn is placing ever greater emphasis on their asset management and GIS systems. In the United Kingdom the issue of public and employee safety has become a major subject of concern and debate.

Both the regulatory authorities and company managers at board level are increasingly aware that safety can only be delivered where information about the maintenance, performance and condition of the assets is accurately recorded and readily available. Therefore the issue of Data Quality is becoming increasingly important.

Purpose
It is a business requirement to have the GIS database populated with data of a known and agreed quality.

The purpose of this paper is to raise the profile of Data Quality, and in so doing to consider some of the questions which arise when the subject is considered in any depth. The basic proposition that the paper addresses is stated above.

The paper sets out to describe an approach to Data Quality which has been developed within the United Kingdom Utilities, Telecommunications and Rail industries. However, elements of the approach can be applied to any situation where it is necessary to have a quantified basis for data quality. In particular the primary role of the Data Quality Strategy at the early stages of a project is emphasised.

The ideas put forward here are not intended to be prescriptive. Each business and project needs to address the issue of data quality in the light of its own circumstances, business objectives and constraints. However, it is hoped that elements of the approach discussed in this paper can be modified and expanded to meet individual needs. An important objective of this paper is to promote reaction and discussion.

Overview
The need for Data Quality impinges on every stage of the development and implementation lifecycle Data Quality is considered here in the context of the standard development project life cycle. To achieve a known and required quality of data after the event - that is after completion of system development and implementation - is generally difficult, often impossible. It is important therefore that the matter of Data Quality is considered at the earliest stages of design - where design includes:
  • system function
  • database structure
  • collection methods and tools
  • data migration and conversion
  • data maintenance.
The need for data quality impacts the design activity in all these areas.

The individual business perspective
It is important to emphasise that there are no absolutes when it comes to Data Quality. The impact of legacy systems (electronic or paper) on the data collection and conversion activity can be huge. Inevitably there is a cost associated with quality, and therefore it is important that trade off between quality and cost is understood by the business. This understanding can only be achieved where a framework exists in which data quality can be properly considered and acted on. The starting point for this is the development of a Data Quality Strategy which can be agreed by the business. The activity of producing this forces the business to consider the importance of individual elements of data and the consequences (e.g. costs) of inaccuracies.

The approach to Data Quality
It is suggested that no GIS project is meeting its corporate responsibility unless the following questions have been successfully addressed:
  • Which data is most important ?
  • What is the relative importance of different data ?
  • How will Data Quality be measured ?
  • At what stages will Data Quality be measured ?
  • How will we know if the important data is good enough ?
  • What will be done when data fails to meet the set standard ?
This paper outlines an approach which allows such questions to be considered rationally, and then answered in the light of the business needs.

The 'top down' approach to quality
In the this discussion the development of the approach to Data Quality follows a 'top down' sequence which reflects the sequence of events which need to be followed by the project itself.
  • Developing a Data Quality Strategy
  • Setting the Data Quality Standard
  • Defining Data Quality
  • Measuring Data Quality
  • Definitions and examples
  • Application of the Standard
  • Supporting and Enhancing Data Quality
Developing a Data Quality Strategy
The Data Quality Strategy is the starting point for all work in connection with Data Quality.

The strategy is important because:
  • It provides the first outward and visible sign of the approach to be adopted
  • It is an important input to the design stage
  • It ensures that the business has a vehicle within which the issues surrounding data quality (or lack of it) can be aired, and decisions made
  • It has an impact on the overall cost benefit analysis
  • It is the basis for all further work in relation to Data Quality.
As such it must be developed and agreed at the earliest possible stage in the project. An effective strategy will reflect the needs of the business while remaining pragmatic and achievable.

Purpose
The purpose of the Data Quality Strategy is to provide the framework in which the business requirements for data quality can be realised in the most cost effective manner. The strategy must provide the basis for data quality in all areas of the project and at all stages; that is through the full project life cycle:
  • System functional design (e.g. to enable quality related meta data to be acted upon)
  • Data design (e.g. to allow quality related meta data to be held and accessed)
  • Data collection design (to ensure that the needs for a measured quality of data are designed into the collection procedures and tools)
  • Data conversion design (to ensure that converted data meets the quality hurdle)
  • Data maintenance (to ensure that data maintenance procedures address the need to maintain and improve data quality).
In addition, the quality of data which transfers between systems must be considered. Few systems exist in isolation. The interface to other systems (electronic or clerical) poses various issues for Data Quality. The strategy must address the allocation of responsibility between communicating systems.

Principles
When developing a strategy, it is necessary to establish an agreed set of principles that underpin the strategy. Although some of the principles may be considered to be truisms, they will in themselves raise questions which it is necessary to address. In this respect they aid clarity of thought. Examples of such principles might be:
  • It is preferable to have no data than unreliable data
  • It is uneconomic to check all data before use
  • All data checks should be against site or other primary source
  • All potential sources should be quality checked before selection
  • The best source of data is the equipment itself - i.e. site
  • All data cleansing should take place external to the target system.
Such principles may have to be violated under specific and difficult circumstances. However, the existence of the principles means that the issues will get properly aired and debated. Formal Change Control should then come into play when it is found necessary to alter these fundamental assumptions.

Scope and Objectives
It will be easier to develop and agree the Data Quality Strategy once its scope and objectives have been established clearly. Examples of such objectives include:
  • The progressive improvement of Data Quality during system life
  • Flexibility to meet future needs (e.g. anticipated business need for higher quality of some classes of data)
  • Provision of quantified benchmarks
  • Achieving consistency of data quality for data from variable quality sources
  • Basis for supply of data from different external suppliers (e.g. basis of contract with conversion vendors or collectors)
  • Assessing the marginal cost of changes in data quality
  • Establishing meaningful Corrective Action Procedures.
Setting the Data Quality Standard
It is necessary to define and set a standard against which measurements of Data Quality can be compared. The Data Quality Standard (DQS) is the benchmark for Data Quality. Measuring Data Quality is not enough. Only by providing a benchmark (standard) is it possible to say that the required level of quality has been reached.

Objectives of a Standard
The objectives of a DQS include:
  • To provide a benchmark for developers and implementers to work to
  • To allow the end users to know the confidence which they can place in the data available to them
  • To enable regional variances in quality to be identified and addressed (or allowed for)
  • To provide a basis for planned enhancement of Data Quality, by incrementing the standard
  • To allow resources to be allocated to priority areas through differentiated standards
  • To allow marginal improvements in quality to be agreed, costed and planned.
By meeting these objectives the DQS addresses the need for a quantified basis for the assessment of Data Quality. The capability for decision making and corrective action then follow.

Acceptable Quality Level
The standards for Data Quality are defined and set in terms of Acceptable Quality Levels (AQLs). Because there is a cost associated with each increment in Data Quality it is essential that the AQLs are established and agreed by the business (e.g. the project sponsor).

An AQL can be set for any grouping of data. At the lowest level an AQL can be set for a specific characteristic of the data, such as its currency. At the other extreme an AQL can be set for an equipment type (e.g. to reflect its relative importance), or for the whole set of data in a geographic area.

AQLs can also be time dependent thus allowing the Standard to reflect the need to improve the data by setting progressively higher standards. The eventual Data Quality target is sometimes referred to as the Target Quality Level (TQL) thus recognising the need to set a goal for quality which may not be achievable immediately. An AQL represents the minimum level of quality that is acceptable to the business.

Defining Data Quality
There are a number of characteristics of data that taken together define Data Quality As in other areas of life data quality can be broadly defined as fitness for purpose. It follows therefore that the level of quality is defined as the degree of fitness for purpose. For example, data with a direct impact on safety will probably have a much more stringent requirement to be correct than data which is descriptive. If the former is 100% correct and the latter only 80% correct, both may still be described as having adequate quality.

Quality Indicators
However, fitness for purpose is not in itself a quantifiable measure of quality. Therefore, it is necessary to agree a number of characteristics that can be taken together to indicate the overall fitness for purpose of a data item, or collection of data items. These characteristics are referred to as Quality Indicators (QI). It is essential that the QIs are capable of meaningful measurement. The DQS must identify an agreed measurement method for each QI.

Importance of data
When establishing the required data standards, using AQLs, it is necessary to take account of the relative importance of different parts of the infrastructure (i.e. items of equipment) and the different items of data (attributes) which describe it (e.g. safety data may be considered of greater importance than measurement data). It is also necessary to recognise that the Quality Indicators themselves will have different levels of importance to the business. For example the need for the database to be up-to-date (currency) may be considered more important than the need for traceability. It should be a business responsibility (with help from the project) to establish the importance of the data and the QIs in terms of an agreed set of AQLs. For these reasons the approach outlined here allows the AQLs to be set, and the quality to be measured, at data item level.

Measuring Data Quality

The basis of Data Quality measurement lies in comparing samples of the data against an agreed source
Measurement of data quality is based on a comparison of individual data items with the most appropriate source for the data. This may or may not be the data source from which the data now being checked was originally derived. In many cases the best source is the actual infrastructure item itself. This implies the need for site survey as part of the quality measurement process, with its associated cost implications. In other cases the most appropriate source for comparison will be paper or electronic records. As already outlined, the Data Quality Strategy needs to address the matter of appropriate sources, both for acquiring the data and for measuring its quality.

Data Sampling
Clearly it is not practical to measure every item of data that is to be entered into the system. Therefore it is necessary to sample the data in accordance with an agreed sampling regime. This is done on a batch basis, where the batch consists of the records for a group of assets and where the batch size is chosen to be convenient in relation to the overall process for populating the database (i.e. collection and conversion). Data samples are taken from each batch of records in a formalised manner. This can be based on a known standard such as:

Sampling procedures for inspection by attributes - ISO 2859-1

This standard is designed to cover a wide range of circumstances and can become complex in its application. A pragmatic approach to its use is therefore required.

Measuring the Quality Level
On a batch by batch basis, the Quality Level(QL) is measured separately for each QI using the data sample taken and checking the data against the agreed sources.

This allows the DQS to be applied to each batch.

Application of the DQS
The objective is to establish the Pass or Fail status for every batch of data.

To do this the DQS is applied by comparing the measured QL of the sampled data for each QI against the AQL that has been set as the standard for that QI. The result is a Pass or Fail for each QI for the data batch. A failure for any QI would normally result in failure of the batch as a whole. However, the concepts of 'Pass' and 'Fail' are tied in to the Corrective Action Procedures which need to be established well in advance, not least because there may well be contractual implications with third parties as a result of failure. The challenges and difficulties of Corrective Action Procedures can be complex and are addressed in another paper.

Definitions and Examples
The effectiveness and complexity of the approach must be customised to the needs of the business

In order to apply the DQS it is necessary to put the foregoing considerations into effect. It is at this stage that the levels of cost and complexity of the overall process are determined.

Factors affecting cost of implementation
There are many factors which will determine the cost of implementation of the Data Quality Strategy and a DQS based on the approach outlined here:
  • the AQL will have a significant impact on the cost of collection and conversion
  • as will the definitions of Data Importance
  • the sample size and frequency will impact the cost of quality measurement
  • the batch size impacts the cost of rework .
The DQS has to be set taking account of these and other factors, in such a way that the conflicting demands for Data Quality and cost containment can be met.

Quality Indicators
7 Quality Indicators are identified. The following table gives their meaning and the level at which they are used.

Data Item Level
Correctness Used to indicate whether the value of a data item is correct (when compared to an ‘agreed source’)
Presence Used to indicate whether a data item holds a value, or has been left unfilled

Record Level
Currency Used to indicate whether a record has been updated to reflect a change to the infrastructure
Traceability Used to indicate whether the source of data in a record can be identified
Completeness Used to indicate the extent to which the data for whole assets (i.e. records) is missing

Record Group Level
Consistency Used to indicate the variability and uniformity of presentation of like data across assets (record group)
Context integrity Used in relation to groups of records to indicate the extent to which interdependent data items are correctly recorded (record group)

It is seen from the above that the QI s apply at three different levels:
  • Data item
  • Record
  • Record group.
This allows the quality to be measured at the most economic and appropriate level.

Importance of Data
The importance of the data is normally a major influence on the AQL. Therefore it is necessary to define Categories of Importance (CI) so that the AQL can be set appropriately. Although this may seem to be another complexity, it allows the more stringent (and therefore costly) AQL levels to be applied only to that data which really justifies the cost associated with high quality.

Importance is separately defined for attributes (data items) and for assets (records). The reason for this is apparent from the definitions. Three CI s are recognised for each:

Data Items
  • Key: Data items, which when combined together provide unique asset identification
  • Essential: Data items regarded as essential to the business
  • Needed: Data items required by the business
Records
  • Cat A: Safety Critical assets
  • Cat B: Business Critical (e.g. operations dependent) assets
  • Cat C: All other assets.
Other categories of importance can be defined if required to meet the business need.

Setting the Acceptable Quality Levels (AQL)
An AQL is set for each QI, and CI. The following table gives examples of AQLs and Quality Measures for the 7


Summary
Achieving a known and agreed level of Data Quality incurs real and direct cost. Failure to achieve the required level of Data Quality may well incur a much higher cost

There many other matters which need to be addressed in relation to data quality, for discussion of which there has been no opportunity in this paper. These include such important issues as:
  • The impact of quality measurement on third party contracts
  • Approaches to Corrective Action Procedures
  • Progressive enhancement of Data Quality through the data maintenance process
  • Formalised evaluation of potential data sources
  • Establishing the marginal cost of Data Quality
  • Building a quality regime into field data collection.
© GISdevelopment.net. All rights reserved.