Integration across heterogeneous spatial data and applications within a large cyberinfrastructure project
Service-oriented architecture for GIS interoperability
Web services architecture relies on language- and system-independent functional components (services) described using WSDL (Web Services Description Language, W3C 2003a) and accessed using SOAP (Simple Object Access Protocol, W3C 2003b). The GEON cyberinfrastructure design is consistent with principles developed within the Open Grid Services Architecture (OGSA) model [Foster et al. 2001, 2002], which, in particular, provides interfaces for service security management, service deployment and invocation.
GEON system architecture includes several layers of services (Figure 1). Data services provide access to relational databases and files registered and managed by GEON grid software (hosted data), as well as external map services: ArcIMS sources, OGC's WMS and WFS sources. The data grid represents a collection of PoP (Point of Presence) nodes with a GEON software stack that includes services for managing storage and replication, grid monitoring, access control and logging, versioning, and querying of hosted data. For data hosted on the GEON grid, "GIS versions" of such low level management services were developed. For example, the GEON grid monitor service, which reflects continuously updated state of grid nodes and provides access to each node's configuration and status, includes an ArcIMS component that shows the state of PoP nodes as an interactive map (Figure 2). Non-hosted sources represent map and data services at USGS, EPA, etc., indexed, in particular, by the Geography Network (www.geographyntwork.com). Unlike hosted sources, they only support access control and query services.

Fig 2 GEON grid monitor service
The next level of services includes data registration which covers both hosted and non-hosted sources. Data source registry includes, for each source: source schema; index metadata that situate the dataset in spatial, temporal and ontological (using OWL; see W3C 2004) contexts; data access services; permissions, and type of datasets management (hosted vs non-hosted). Further, spatial data integration services of the middle tier include data discovery, conversion, query rewriting and execution, and results assembly services. The services assembled here support the mediation strategy of data integration [Wiederhold 1992, and in the geospatial context: DeVogele et al. 1998, Gupta et al. 1999, Shimada and Fukui 1999, Boucelma et al. 2002], which is appropriate for a system composed of individually created and maintained sources some of which are external to the community grid.

Fig 3 Internal organization of the map assembly service
Finally, the services can be accessed from both desktop (ArcGIS) and web clients (through a web portal). The portal allows users to formulate queries against the global schema and send them to the mediator for execution. The result is a composite map, as described in the scenario below.
The additional abstraction layers inherent in the data grid solution, including logical name space for data resources, common data and collections management operations, standard data access API and security abstractions, and the uniform services-based access protocols, alleviate structural and syntactic problems of GIS interoperability.
However, other types of source heterogeneity remain challenging for integration:
- semantic heterogeneity, i.e. disagreements about meaning or interpretation of similar data, which requires registering semantics of the datasets and rewriting user queries based on such semantic descriptions;
- management heterogeneity, i.e. different sets of functionality enabled over datasets that are hosted versus external to the grid. For example, hosted datasets may follow replication and transformation policies adopted by the grid, while external do not. Such heterogeneity needs to be considered for efficient query execution planning;
- different dataset sizes and shipping costs which necessitate different access strategies, e.g. "staging" large datasets for fast access from web clients, parallel data transfer, versus packaging collections of small files. The latter issue became especially challenging within the BIRN project, where, in order to deliver large (2-4 Gb) brain images to a Java-based Web atlas tool, they had to be exposed as transient "image map services" [Zaslavsky et al. 2004b]. Dynamic on-demand generation of map services is the central feature of the map assembly techniques described below, which are the core component of spatial data integration services.

Figure 4. Example of mapping query results from a non-hosted database with a dynamically generated ArcIMS service (BorderSafe project)