Abstract
Utility organisations are coming under increasing pressure to consolidate front and back office systems and to integrate legacy data to create single sources of accurate information to drive efficiency and to conform to regulations. The UK in particular over recent times has seen heavy fines being imposed by the Regulator for lack of accurate information. The convergence of Engineering based CAD and GIS further complicates an already difficult task. In order to bring data together into modern systems and information supply chains, which often also involves re-purposing, data quality is a fundamental issue. Sometimes ignored, and almost always underestimated, it is an enormous task to measure and assess the quality of data sources to enable meaningful and effective migration from source and integration. The imposition of a data quality regime which is then fundamental to the on-going ‘health’ of the information systems also suffers in a similar fashion from a lack of understanding and the lack of the relevant tools and methods for spatial data quality management.
Recent advances in technology and standards initiatives are beginning to change the face of spatial data quality assessment. Advances in the flexibility and power of IT enable spatial data quality to be treated as integral parts of mainstream systems and not a specialised ‘silo’ application. Also standards are emerging from ISO and the OGC that enable practical data quality solutions to be put in place that enable integration and re-purposing of spatial data.
This paper will examine the business need for such spatial data quality tools and management systems and will explore the modern tools and techniques available. It will show, using real world example, how the implementation of such solutions is not only practical in the modern era, but can produce significant cost savings and contribute to overall business efficiency and regulatory compliance.
INTRODUCTION
Effective planning and decision support regardless of the business or industry is dependent on accurate and up-to-date information that originates from multiple data sources. These data must be integrated to provide meaningful information upon which decisions and forward planning can be based. Under increasing pressure to provide a better and more efficient service to both its customers and its shareholders we see the Utility Industry increasingly attempting to integrate front and back office systems and provide a ‘single source of truth’ upon which to plan the business. However such moves have begun to highlight operational and economic problems as the complexity of data integration and the effort required becomes known. Not only this but the penalties for getting things wrong are not just felt in the failure to recognise the targeted efficiency gains. In the recent past in the UK well-publicised and severe fines have been handed out by the regulator to Utility organisations that fail to serve their customers through a combination of a lack of information and poor quality data.
What is becoming obvious is that the critical factor in the integration of data and the interoperability of that data is quality. This paper examines this issue of data quality, specifically with regard to spatial data, an increasingly prevalent type of data within the information supply chains that underpin Utility organisations performance.
SPATIAL DATA IN UTILITY ORGANISATIONS
System and Data integration is not simply about the physical merging of the various datasets into one single ‘mass’. A particular individual or group typically collects data, regardless of type, for one or a small number of specific uses. The nature of the collection process can prescribe that many aspects of the data are specific to its original purpose, and such constraints within the data acquisition process can limit the useful range of data application in the future. The mere presence of particular elements within datasets does not guarantee that they are fit for purpose for new applications and processes and it is crucial that the quality of data is understood and considered when attempting to reuse and repurpose that data.
This is particularly true in Utility organisations where large amounts of spatially based datasets can be found. These range from asset records themselves digitised from paper maps back in the 1980s in the UK, through to Customer Billing records, metering requests, and even to street excavations (soon to be required by Law in April 2008 in the UK under the Traffic Management Act to provide accurate spatial references). In general two distinct sources for pure spatial data exist, namely Geographic Information Systems (GIS) and Computer Aided Design (CAD) packages.
A great deal of time and money has already been spent collecting both the engineering and spatial data. These data have been collected over the last 20 – 25 years, in terms of spatial data, Almost certainly they will have been collected with a primary purpose of producing maps or drawings. They will have been collected using data schemas that, in some cases, are at best partially understood by successor bodies. Almost certainly they will have been collected in proprietary systems. As a result, these datasets contain a plethora of data quality issues, especially when attempts are made to reuse the data for other purposes and/or to integrate these datasets with other spatial and non-spatial data to support the decision-making process. One natural conclusion might be to start over and re-survey and re-collect the legacy data. However, this is certainly not desirable for a number of reasons, not least because many of these datasets represent historical situations that are not longer present. Furthermore, these legacy datasets represent such a huge investment that it would simply not be economically viable to attempt to re-capture them. Solving the repurpose/reuse issue is the only way forward.
This situation is being compounded by ever-increasing streams of spatial data coming in from the field, and the advent of more advanced collection techniques such as satellite-based GPS. The demand for the convergence of the engineering world with its CAD and Building Information Models (BIM) and the GIS domain is also now gaining pace as the benefits are being recognised (Zeiss, 2007).
Often the situation is complicated by the fact the data including the spatial references are spread across multiple databases serving various operational users that often for historical reasons have been kept quite separate. Mergers and acquisitions in the Utility space have contributed to the myriad of data sources commonly found within a modern Utility IT environment.
Also it is worth noting that the use of spatial data are no longer restricted just within an organisation but the whole supply chain concept now extends between organisations, as regulation, legal directives and consumer demand all drive an explosion in the requirement for information, specifically spatially-based information, across all walks of life. This brings to light an often over-looked issue. As more and more spatial data is shared, such sharing becomes the focus of contractual agreements and service levels are imposed. Policing of these contracts and the enforcement of service levels is not a simple task.
The challenge facing all Utility organisations is how to tackle the integration and repurposing problems in an efficient manner, on an on-going reproducible basis, that will provide a single source of truth view of the business to serve as a stable basis for the operation and financial planning of the business.
STANDARDS
Reproducibility and measurable efficiencies requires Standards in one form or another both in process and measure. Standards do exist in spatial data quality but are not well known. In a recent survey carried out by The Open Geospatial Consortium (OGC) Working Group on Data Quality just short of 60% of respondents said they were not using any recognised standards in data quality work being carried out in their organisation. The Standards that do exist result from the same challenge that we now face when the Digital Chart of the World (DCW) was first produced. This project involved bringing together data collected by a large number of bodies with different objectives. In essence this can be thought of as interoperability long before the term was ever used and on a scale that was unique and possibly will never be seen again. Once constructed opportunities were seen to use the DCW for purposes outside of the specification from which it was created, namely operational navigation charts. This was a natural expectation as the dataset was public domain. However, immediately issues became apparent; it was not fit for purpose. Indeed, why should a dataset created for the purposes of medium altitude en route navigation by dead reckoning visual pilotage be of any value in assessing drainage basin characteristics?
These realisations led to the creation of ISO19113/19114 (for a summary of these see Chapter 15 in ‘Spatial Data Quality’, Shi et al. (2002); ISBN 0-415-25835-19). The opening sentence of ISO19113 sets the tone:
‘Geographic datasets are increasingly being shared, interchanged and used for purposes other than their producers’ intended ones’.
The standard work was carried out under the auspices of ISO/TC 211 and then applied as a case study to the DCW project (see:
http://www.nlh.no/ikf/gis/dcw/). By 1995 the initiative had run out of steam because only
non-quantitative assessments of geographic datasets could take place. Quantitative assessments were not possible for large GIS datasets because of the lack of processing power and the absence of an agreed method to assess and communicate quality measures. However, it is this type of assessment that is really valuable for assessing logical consistency and positional accuracy.
Today things have changed. The processing power is available and mainstream IT architectures have changed. Open, Service Oriented Architectures (SOA) provide a modern platform for sharing and distributing software systems and The OGC has carried out significant work on standards to complement the ISO work described above. The ISO work produced a set of measures that would form the basis of quantification of spatial data quality, which are reproduced in Table 1. The framework is in place. What remains?
| Measure | Description |
| Accuracy | Data should be sufficiently accurate for its intended purposes, representing clearly and in sufficient detail the interaction provided at the point of activity. Data should be captured only once, although it may have multiple uses. Accuracy is most likely to be secured if data is captured as close to the point of activity as possible. |
| Validity | Data should be recorded and used in compliance with relevant requirements, including the correct application of any rules or definitions. This will ensure consistency between periods and with similar organisations. |
| Reliability | Data should reflect stable and consistent data collection processes. |
| Timeliness | Data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period. |
| Relevance | Data captured should be relevant to the purposes for which it is used. It may be necessary to capture data at the point of activity which is relevant only for other purposes, rather than for the current intervention. Quality assurance and feedback processes are needed to ensure the quality of such data. |
| Completeness | Monitoring missing, incomplete, or invalid records can provide an indication of data quality and can also point to problems in the recording of certain data items. |
Table 1. Measures of Data Quality
SOLVING THE DATA QUALITY PROBLEM
An issue that, despite being a stumbling block, is often overlooked is the semantic quality of the data. Combining datasets from different sources and different databases often means integrating similar datasets but from differing schemas. This can be hugely problematic. The use of ontologies is beginning to help provide a framework for describing the meaning or purpose of spatial objects and datasets, which gives a basis from which to work. But even then, measuring quality and assessing fitness for purpose still remains an issue.
It is important that quality assessment goes beyond simple geometry checks looking for overlaps, overshoots etc and makes use of this semantic information. Combining data requires us to check that the description of features match and make sense if we are to use these combined datasets for decision-making purposes. Is this type of valve permitted on this pipe? Does this building type make sense for this generator? Both are valid questions that go beyond checking that the geometries of such objects coincide.
Measurement and maintenance of such criteria is imperative to the usefulness of spatial data.
Recent advances in rules based spatial data quality measures are gaining acceptance and proving an effective solution. The OGC itself is now investigating this approach in an attempt to formulate standards in this area.
Within the last 18 months the OGC has made moves towards quantitative data quality assessment and communication with two main initiatives. The first of these was the Topology Quality Assessment Service (TQAS), developed and deployed within the Geo-processing workflow thread of the OWS-4 interoperability test bed. This project recognised that the syntax, or encoding, rules of the information must be well understood by the exchanging parties. It is clear that Geographic Markup Language (GML) already provides a good foundation for ensuring syntactic interoperability based as it is on XML and XMLSchema with their own well-defined syntactic structure.
However, while these syntactic constraints are necessary for reliable data transport, they are not sufficient to ensure that the exchanging parties can correctly interpret the meaning of, and exploit the features correctly for decision support purposes. To guarantee that the features are consistent with a particular domain interpretation (for example buried asset data), it is necessary to describe the logical consistency constraints within that domain in a formal way and test the features against these constraints or rules. We know that, in general, pipes should connect. However, simply building a GML application schema containing these terms (syntactic structure) does not guarantee that feature instances encoded in the GML satisfy the logical domain constraints (semantic structure) eg, what valves are permitted on what pipes. These formal semantic rules must therefore be expressed in addition to the GML application schema.
(OGC, 2007)
To achieve this, this project importantly produced a rules language, capable of describing syntactic and semantic data characteristics in a way that could be quantified and measured. With this it suddenly becomes possible to assess the compliance of datasets with a standard or measure in a meaningful way. This was a large step forward that, in hindsight, may prove to be a ground-breaking moment. (Watson, 2007)
The second, and follow-up, initiative was the formation of a Data Quality Working Group, whose objective is to build on the OWS-4 work and to examine the possibility of a rules based approach to both measuring and reporting spatial data quality.
This group’s objective is stated as being:
“to establish a forum for describing an interoperable framework, or model, for OGC Quality Assurance Web Services to enable access and sharing of high quality geospatial information, improve data analysis and ultimately influence policy decisions.” ……. (OGC, 2006)
This Group is attempting to define a framework and a grammar for the certification and communication of spatial data quality covering both syntactic and semantic measures.
EXAMPLE
An example of one utility organisation attempting to tackle the data quality issue is MidCoast Water, in New South Wales, Australia. MidCoast Water serves an area of 7000 square kilometers and the authority is responsible for reticulated water supply and sewerage systems to communities in the Manning and Great Lakes regions. The delivery of services to such a vast area brings with it many challenges. Formed just under 10 years ago from the water and sewerage sections of three local authorities, MidCoast Water has quickly grown to be an industry leader - and sees its encouragement of innovation as the key to this positioning.
MidCoast Water introduced a programme to provide improved access to accurate geographical and asset information. The programme was developed in two stages:-
- Improving the efficiency and accuracy with which information is gathered and recorded
- Improving the accessibility of this information MidCoast Water needed to ensure its data quality in order to provide the best possible service to its end customers. They needed to be confident in the reliability of their data at both the attribute and
level so that no manual checking was required and that time was not wasted searching for assets in the field due to errors within their data. Topological connectivity between networks also needed to be assured to prevent the duplication of editing tasks, which could provide a drain on their manpower. As well as MidCoast’s ongoing programme of internal improvement, from 1st July 2006, state government regulations required utility organisations to have the ability to accurately pinpoint their assets. MidCoast Water needed to ensure that they were fully compliant with this before the legislation came into force.
Following the implementation of a spatial data quality regime, MidCoast Water now enjoys the following benefits:
- Interoperability – data is error-free and accessible via multiple applications across the company
- Enhanced productivity – significant time and cost savings have been achieved through increased query performance
- Enterprise-wide data management – business and spatial data has been centralised into a single database, reducing the duplication of effort for maintenance
- Improvements in data gathering – it now takes just a few hours to translate spatial data into maps, instead of a week
- Efficient processing of property information – MidCoast Water has experienced a 60% reduction in staff time for this task when combined with other processes
- Versatility of application – the return on investment has increased due to an ability to apply the data in a variety of new ways and through data mining opportunities.
The return on investment for MidCoast Water has been substantial. Not only have they met their objectives to centralise business data and to accurately manage and pinpoint their assets, they have also achieved measurable time and cost savings. As just one example; before the implementation of the new regime it appeared that two sewer stations were needed at a particular development. Under the new system with the resultant improvement in data quality, the information supplied made it clear that only one sewer station was required, saving MidCoast Water up to $300,000 AUD.
CONCLUSIONS
The flow of information across the Utility industry is rapidly growing in size and complexity, and as legacy data is re-worked and different data types (CAD/GPS) are included and larger data integration problems are being created. The data are increasingly being subject to legal contract (the digital rights management requirement) and service level agreements. The data are required for performance measurement. There is a hugely complex supply chain in which spatial data is playing an increasing role serving these connected consumers and suppliers. This problem is going to get worse as the engineering world seeks to fully integrate with what has traditionally been thought of as the GIS world into mainstream IT creating the important single source of truth vital to the business.
Data quality, the ‘forgotten common sense’, is of vital importance in enabling such integration and sharing of data to underpin the decision support process.
Recent advances led by such bodies as ISO and the OGC, and the use of rules to provide quantitative data quality description and measurement provides a means to implement data quality regimes that not only enable supply chains to operate efficiently, but also provide real commercial benefit to contributing organisations.
REFERENCES
Data Quality Working Group Charter, (2006) Open Geospatial Consortium, Inc.
ISO 19113:2002 Geographic information -- Quality principles
ISO 19114:2003 Geographic information -- Quality evaluation procedures
ISO 19115:2003. Geographic Information -- Metadata
ISO 19125-2:2004. Geographic Information -- Simple feature access -- Part 2: SQL option
ISO 19139:2007. Geographic Information OWS-4 - Topology Quality Assessment IPR, 2007, Open Geospatial Consortium, Inc.
Watson, P. (2007) Formal languages for expressing data consistency rules and implications for reporting of Quality Metadata. International Symposium on Data Quality 2007, ITC, Enschede.
Zeiss, G. (2007) GIS/CAD Convergence. GITA Australia Annual Conference, Brisbane, August, 2007
Page 1 of 1