Data quality: Defining an achievable standard
Tony S. Holmwood
Brown & Root Services, Halliburton Brown & Root Hill Park Court, Leatherhead, Surrey KT22 7NL United Kingdom Background 'Data is the Cinderella of system implementation' - Why so ? It is generally accepted that 50 - 85% of the cost of building and implementing an asset management/GIS system lies in the provision of data. Yet population of a new system's database is seldom considered until late in the day. Data collection and data conversion are not so interesting as hardware and software. The requirement for data is seldom thought about until the project has been under way for some time and the major decisions on platform and functionality have been made. The inter-related issue of Data Quality is often ignored completely. However, the issue of Data Quality is taking on increasing importance. As the performance of large public service companies and authorities gains a higher public profile, so management is forced to implement better systems with which to manage their infrastructure. The demand by the regulatory authorities that the major utility companies achieve a greater efficiencies and better safety is forcing an increased understanding of the assets which make up the backbone of their business. This in turn is placing ever greater emphasis on their asset management and GIS systems. In the United Kingdom the issue of public and employee safety has become a major subject of concern and debate. Both the regulatory authorities and company managers at board level are increasingly aware that safety can only be delivered where information about the maintenance, performance and condition of the assets is accurately recorded and readily available. Therefore the issue of Data Quality is becoming increasingly important. Purpose It is a business requirement to have the GIS database populated with data of a known and agreed quality. The purpose of this paper is to raise the profile of Data Quality, and in so doing to consider some of the questions which arise when the subject is considered in any depth. The basic proposition that the paper addresses is stated above. The paper sets out to describe an approach to Data Quality which has been developed within the United Kingdom Utilities, Telecommunications and Rail industries. However, elements of the approach can be applied to any situation where it is necessary to have a quantified basis for data quality. In particular the primary role of the Data Quality Strategy at the early stages of a project is emphasised. The ideas put forward here are not intended to be prescriptive. Each business and project needs to address the issue of data quality in the light of its own circumstances, business objectives and constraints. However, it is hoped that elements of the approach discussed in this paper can be modified and expanded to meet individual needs. An important objective of this paper is to promote reaction and discussion. Overview The need for Data Quality impinges on every stage of the development and implementation lifecycle Data Quality is considered here in the context of the standard development project life cycle. To achieve a known and required quality of data after the event - that is after completion of system development and implementation - is generally difficult, often impossible. It is important therefore that the matter of Data Quality is considered at the earliest stages of design - where design includes:
The individual business perspective It is important to emphasise that there are no absolutes when it comes to Data Quality. The impact of legacy systems (electronic or paper) on the data collection and conversion activity can be huge. Inevitably there is a cost associated with quality, and therefore it is important that trade off between quality and cost is understood by the business. This understanding can only be achieved where a framework exists in which data quality can be properly considered and acted on. The starting point for this is the development of a Data Quality Strategy which can be agreed by the business. The activity of producing this forces the business to consider the importance of individual elements of data and the consequences (e.g. costs) of inaccuracies. The approach to Data Quality It is suggested that no GIS project is meeting its corporate responsibility unless the following questions have been successfully addressed:
The 'top down' approach to quality In the this discussion the development of the approach to Data Quality follows a 'top down' sequence which reflects the sequence of events which need to be followed by the project itself.
The Data Quality Strategy is the starting point for all work in connection with Data Quality. The strategy is important because:
Purpose The purpose of the Data Quality Strategy is to provide the framework in which the business requirements for data quality can be realised in the most cost effective manner. The strategy must provide the basis for data quality in all areas of the project and at all stages; that is through the full project life cycle:
Principles When developing a strategy, it is necessary to establish an agreed set of principles that underpin the strategy. Although some of the principles may be considered to be truisms, they will in themselves raise questions which it is necessary to address. In this respect they aid clarity of thought. Examples of such principles might be:
Scope and Objectives It will be easier to develop and agree the Data Quality Strategy once its scope and objectives have been established clearly. Examples of such objectives include:
It is necessary to define and set a standard against which measurements of Data Quality can be compared. The Data Quality Standard (DQS) is the benchmark for Data Quality. Measuring Data Quality is not enough. Only by providing a benchmark (standard) is it possible to say that the required level of quality has been reached. Objectives of a Standard The objectives of a DQS include:
Acceptable Quality Level The standards for Data Quality are defined and set in terms of Acceptable Quality Levels (AQLs). Because there is a cost associated with each increment in Data Quality it is essential that the AQLs are established and agreed by the business (e.g. the project sponsor). An AQL can be set for any grouping of data. At the lowest level an AQL can be set for a specific characteristic of the data, such as its currency. At the other extreme an AQL can be set for an equipment type (e.g. to reflect its relative importance), or for the whole set of data in a geographic area. AQLs can also be time dependent thus allowing the Standard to reflect the need to improve the data by setting progressively higher standards. The eventual Data Quality target is sometimes referred to as the Target Quality Level (TQL) thus recognising the need to set a goal for quality which may not be achievable immediately. An AQL represents the minimum level of quality that is acceptable to the business. Defining Data Quality There are a number of characteristics of data that taken together define Data Quality As in other areas of life data quality can be broadly defined as fitness for purpose. It follows therefore that the level of quality is defined as the degree of fitness for purpose. For example, data with a direct impact on safety will probably have a much more stringent requirement to be correct than data which is descriptive. If the former is 100% correct and the latter only 80% correct, both may still be described as having adequate quality. Quality Indicators However, fitness for purpose is not in itself a quantifiable measure of quality. Therefore, it is necessary to agree a number of characteristics that can be taken together to indicate the overall fitness for purpose of a data item, or collection of data items. These characteristics are referred to as Quality Indicators (QI). It is essential that the QIs are capable of meaningful measurement. The DQS must identify an agreed measurement method for each QI. Importance of data When establishing the required data standards, using AQLs, it is necessary to take account of the relative importance of different parts of the infrastructure (i.e. items of equipment) and the different items of data (attributes) which describe it (e.g. safety data may be considered of greater importance than measurement data). It is also necessary to recognise that the Quality Indicators themselves will have different levels of importance to the business. For example the need for the database to be up-to-date (currency) may be considered more important than the need for traceability. It should be a business responsibility (with help from the project) to establish the importance of the data and the QIs in terms of an agreed set of AQLs. For these reasons the approach outlined here allows the AQLs to be set, and the quality to be measured, at data item level. Measuring Data Quality The basis of Data Quality measurement lies in comparing samples of the data against an agreed source Measurement of data quality is based on a comparison of individual data items with the most appropriate source for the data. This may or may not be the data source from which the data now being checked was originally derived. In many cases the best source is the actual infrastructure item itself. This implies the need for site survey as part of the quality measurement process, with its associated cost implications. In other cases the most appropriate source for comparison will be paper or electronic records. As already outlined, the Data Quality Strategy needs to address the matter of appropriate sources, both for acquiring the data and for measuring its quality. Data Sampling Clearly it is not practical to measure every item of data that is to be entered into the system. Therefore it is necessary to sample the data in accordance with an agreed sampling regime. This is done on a batch basis, where the batch consists of the records for a group of assets and where the batch size is chosen to be convenient in relation to the overall process for populating the database (i.e. collection and conversion). Data samples are taken from each batch of records in a formalised manner. This can be based on a known standard such as:
Sampling procedures for inspection by attributes - ISO 2859-1
This standard is designed to cover a wide range of circumstances and can become complex in its application. A pragmatic approach to its use is therefore required. Measuring the Quality Level On a batch by batch basis, the Quality Level(QL) is measured separately for each QI using the data sample taken and checking the data against the agreed sources. This allows the DQS to be applied to each batch. Application of the DQS The objective is to establish the Pass or Fail status for every batch of data. To do this the DQS is applied by comparing the measured QL of the sampled data for each QI against the AQL that has been set as the standard for that QI. The result is a Pass or Fail for each QI for the data batch. A failure for any QI would normally result in failure of the batch as a whole. However, the concepts of 'Pass' and 'Fail' are tied in to the Corrective Action Procedures which need to be established well in advance, not least because there may well be contractual implications with third parties as a result of failure. The challenges and difficulties of Corrective Action Procedures can be complex and are addressed in another paper. Definitions and Examples The effectiveness and complexity of the approach must be customised to the needs of the business In order to apply the DQS it is necessary to put the foregoing considerations into effect. It is at this stage that the levels of cost and complexity of the overall process are determined. Factors affecting cost of implementation There are many factors which will determine the cost of implementation of the Data Quality Strategy and a DQS based on the approach outlined here:
Quality Indicators 7 Quality Indicators are identified. The following table gives their meaning and the level at which they are used. Data Item Level
Record Level
Record Group Level
It is seen from the above that the QI s apply at three different levels:
Importance of Data The importance of the data is normally a major influence on the AQL. Therefore it is necessary to define Categories of Importance (CI) so that the AQL can be set appropriately. Although this may seem to be another complexity, it allows the more stringent (and therefore costly) AQL levels to be applied only to that data which really justifies the cost associated with high quality. Importance is separately defined for attributes (data items) and for assets (records). The reason for this is apparent from the definitions. Three CI s are recognised for each: Data Items
Setting the Acceptable Quality Levels (AQL) An AQL is set for each QI, and CI. The following table gives examples of AQLs and Quality Measures for the 7 Summary Achieving a known and agreed level of Data Quality incurs real and direct cost. Failure to achieve the required level of Data Quality may well incur a much higher cost There many other matters which need to be addressed in relation to data quality, for discussion of which there has been no opportunity in this paper. These include such important issues as:
| ||||||||||||||||
| © GISdevelopment.net. All rights reserved. |