GISdevelopment.net ---> GITA 2002 ---> System Architecture

Version Management Revisited

Peter M. Batty
GE Smallworld
5600 Greenwood Plaza Blvd.
Englewood, CO 80111


Abstract
This paper discusses version management, which is now widely accepted in the industry as an essential technique for managing the design lifecycle. All major vendors now claim to have version management as part of their solution. Despite this, version management and the long transaction problem that it solves are still not widely understood.

This paper summarizes the fundamentals of version management, and looks at different approaches to implementing it: deep and shallow version management. It then goes on to explain how basic version management, while an essential pre-requisite, is just the beginning of a full solution for managing the design process. Other important issues that need to be addressed include design versus as-built views, future views of the network, partial job completion, jobs built on jobs, handling historical information, and support for detached design work in the field. Each of these is explained and approaches to implementing them are discussed.

The Long Transaction Problem
For a detailed discussion of the long transaction problem, see Newell and Easterfield, 1990, and Newell and Batty, 1994.

The basic technical requirement for a long transaction is the ability to lay out a design - do inserts, updates and deletes - in such a way that the changes being made are not visible to other users of the system (until the design reaches a stage where it is appropriate to share it with others). Since the user has in some sense a private copy of the data, concurrency control needs to be addressed: what happens if two people want to update something in the same area? In general, since these transactions can take a long elapsed time (weeks or months), it is unacceptable to insist that all data in an area be locked. Therefore, most approaches use an optimistic form of concurrency control in which data is not locked, but any conflicting updates are identified and resolved at some point before the transaction is completed.

Checkout
Checkout has been the most commonly used approach to the long transaction problem, in which a small geographic area is copied to a separate database or file where the work is done, and changes are passed back to the master database later. This has a number of drawbacks. The time taken to create the checked out dataset is often significant (minutes rather than seconds in many cases). Since a restricted area is checked out, it is hard to run an analysis of how the design affects a broader area of the network. With any reasonably sophisticated data model, it can be very hard to determine exactly what data should be checked out – there are many difficult issues regarding the handling of data that is related to objects that are geographically within the selected area.

Version Management
Another approach to handling long transactions is to use version management. With this approach it is possible to create different “versions” or “alternatives” of the database. Each alternative is logically equivalent to a replica of the whole database; a user can make changes within an alternative that are not seen by any other users, and the user in the alternative does not see changes made by other users (until she asks to see them). All of the data is not physically replicated to create an alternative - only the changes relative to the parent version are stored in an alternative. There are significantly different approaches to implementing version management, which will be discussed in the next section. Changes are propagated between versions in a controlled way. This paper uses the terminology that a “merge” operation propagates changes down from a parent to a child alternative, and a “post” operation propagates changes up from a child to a parent. Beware though, as Oracle’s new version management technology, known as workspace management, uses different terminology: they use the term “refresh” for propagating changes down, and “merge” for propagating changes up. In general, version management uses an optimistic approach to concurrency control, and any conflicts are detected and corrected when a merge is done. It is also possible to have a tree structure of alternatives, so in addition to handling simple long transactions, this approach also provides a mechanism for handling alternative designs in an elegant way. Version management overcomes all the problems with checkout mentioned above: there is no initial retrieval time, no copying of data is required, and the user has access to the whole database at all times.

While this technology has been available for 10 years now, only recently has the superiority of this approach been widely acknowledged. It is now accepted as the industry standard approach, with all the major GIS vendors and Oracle announcing support for version management. Recent implementations of version management use a significantly different underlying architecture than the longer established one, and the differences are discussed in the following sections.

Version Management Approaches

Deep Version Management
The most established way to implement version management, which has been available for about ten years and is in production use at hundreds of sites, is termed deep version management in this paper. This approach is described in detail by Newell and Batty (1994), and this paper will just briefly summarize the approach and its characteristics.

Deep version management is implemented as part of the fundamental design of the database management system: it cannot be added in later as an afterthought. Briefly, it works by allowing B-trees, the low-level data structures that make up database tables, to share disk blocks between versions. This means that there is no processing overhead at all associated with versioning. When in a version, the database only sees data in that version and does not require any additional processing to filter out other data. An important feature is that the contents of a data disk block never change; whenever something changes, a copy of the disk block is made and the changes are applied to that copy. This means that it is possible to implement a client-oriented DBMS, in which most data is cached in memory on the client, and the great majority of processing is done on the client rather than the server. This makes the DBMS enormously more scalable than a traditional server-oriented DBMS for certain types of queries, most notably indexed queries where the same data is typically retrieved multiple times, which is the case for GIS display operations. This idea can be extended to a “persistent cache”, which is in essence a demand-driven distributed database, which works very well for applications where geographic subsets of the database are generally accessed in regional offices, as is typically the case with many spatial applications. It is the author’s experience that a client-oriented DBMS requires one to two orders of magnitude less processing on the server than a traditional server-oriented DBMS, which leads to significant scalability and cost advantages. There are sites with thousands of concurrent update users working against a single database using deep version management. Another feature of this approach is that it is possible to version manage the data model in addition to the data, which is very useful for development and testing purposes.

Shallow Version Management
Recent implementations of version management have all used what is termed shallow version management in this paper. In this approach, versioning is not an intrinsic part of the DBMS design. Instead it is done by storing records belonging to all versions in the same relational table, together with additional key information relating to the version(s) that a given record applies to. Queries perform additional processing to determine which records apply to the current version. To make this transparent to the application, queries are usually done against table views, rather than physical tables. The attraction of this approach is that it can be done on top of a mainstream DBMS (using the DBMS in a conventional way – it is also possible to implement deep version management on top of a mainstream DBMS, but only by using the DBMS as a rather unintelligent storage mechanism for blocks of binary data, and having a whole additional query engine on the front end). The main drawback is that it requires significantly more server processing than the deep version management approach. The other potential drawback is the relative immaturity of most of these implementations, which is discussed in the next section.

Considerations in evaluating version management systems
This section briefly outlines some considerations in evaluating version management systems. The key recommendation can be summarized in two words: check references. It is relatively easy to produce a demo showing basic version management functionality. It is much harder to produce a system that provides good performance and scalability. In addition, it is difficult to produce a system that is robust enough for production use in a utility. In particular, there are some complex issues relating to system administration, such as garbage collection or database compression. Some people apparently do not realize that data that is stored in versions other than top is essential data that needs to be maintained at all times. It is unacceptable for any database operation to require that all versions be removed or posted to the top level. A typical medium to large utility needs thousands of alternatives to be managed at any given point in time, all of which contain essential data. Another operation that must be supported on versioned data is changing the data model, which obviously needs to be done on all production systems from time to time. The ability to version manage the data model makes this operation much easier.

In summary, when evaluating a version management system, look for several reference sites with a similar number of users and database size to your own system. At each site, look at the performance, and what sort of server is necessary to achieve this. And check on any operational issues; in particular, find out whether versions have ever had to be removed for administrative reasons.

Additional Requirements For Design Applications
As mentioned earlier, while version management is a key piece of functionality in implementing a good design application, it does not solve all the issues that need to be addressed. This section looks at a number of additional requirements and discusses potential solutions in each area.

First a quick note on terminology: this paper uses the terms “design”, “work order” and “job” fairly interchangeably, and loosely, to refer to a group of proposed changes that is intended to be carried out by a worker in the field. Different organizations may use these terms in more specific ways, and may have hierarchical relationships; for example, a work order may consist of multiple work requests, each of which can have one or more designs. Handling these relationships is straightforward and will not be discussed here – apart from phased completion, which is not simple and will be discussed below.

Views of the data - requirements
An important generic concept is that multiple views of data are required. A single application will usually need most or all of these. Basic views required include:

The "as-built" view
In this view, the user only sees the state of the network that is currently built, with no interference from designs that have not yet been built. A simple example is shown here:


Figure 1: Simple example of an as-built view

The "individual work order" view
In this view, there is a clear graphical indication of the changes that are being made as a part of a work order. This view is used to create the work order document, which a crew will use to carry out work in the field. Objects to be added, replaced and removed will all need to be marked with distinctive styles. For example, cross-hatching is often used to indicate objects to be removed.


Figure 2: Individual work order view for a simple network extension

An issue that is somewhat independent of the main one being discussed is that the work order view may need to contain temporary graphical changes to make it easier for the crew to interpret the printed work order. One common requirement is to make selected existing objects invisible (if they are not relevant to the current work order). Another is to temporarily offset objects or to add temporary notes, to make it easier for the crew to interpret the work order document. These temporary changes are likely to be specific to the individual work order view and not applicable to a multiple work order view.

The "future as-built" view
In this view, a selected set of pending designs is added onto the current as-built view to create a new view that represents the anticipated state of the network in the future after these changes have been made. No distinction is apparent between objects which currently exist in the field and those which are part of the selected pending designs. This allows analysis routines (for example, load flow analysis) to look at the impact of these future changes without needing any special logic. Multiple future as-built views can be created, with different sets of designs incorporated into each one. They can either be created on demand, or can be maintained permanently – for example, some organizations permanently maintain a view that includes all approved jobs.


Figure 3: Future as-built view, showing the completed design

The "multiple work order" view
Another common requirement is the ability to get a graphical indication of all the proposed work in an area (which may consist of data in many work orders). Temporary graphical changes, which have been made in an individual work order view, may not be applicable in a multiple work order view (particularly if specific objects were made invisible in an individual work order). The multiple work order view is often used when a user is working on a new job, to display other pending jobs that may overlap or impact the current job.

Views of the data – solutions
This section discusses various areas that need to be considered in addressing the problem of providing the various views described in the previous section. There is a graphical aspect of this, which relates to providing alternative graphical representations for work order and as-built views. There are many variations in the details of how this can be implemented which will not be discussed in detail here. However, in general, most of these solutions rely on storing a status field on each object (containing a value such as Proposed, Existing, or To be Removed), and having the ability to define the style (symbology) of that object depending on the value in that status field and the type of view that is currently being displayed (work order versus as-built).

A more interesting question is how to handle trade-offs between isolating the as-built data from work that is in progress, and the requirement to view data from multiple work orders at the same time, which is required for both the future as-built view and the multiple work order view.

Using traditional version management with merge and post
Traditional version management, using only merge and post operations between parent and child alternatives, has some limitations in meeting the requirements laid out in the previous section. The requirement for the as-built view is that it should be cleanly separated from any work in progress. If the data for a design is stored in an alternative until the work is completed in the field (i.e., until it becomes as-built), and it is posted at this point, then the requirement for having an as-built view is satisfied. In this scenario, the top alternative always represents the as-built view. However, this approach does not handle the future as- built view well, since all designs are in their own alternatives and they cannot be combined easily. It is possible to implement a multiple work order view using this approach if the drawing routines are flexible enough to change alternatives dynamically to draw relevant information from each design of interest. However, this merely provides a graphical view, not one that can be queried.

A fairly common alternative approach is to post jobs to top once they are approved, but before they are built. This enables better handling of both the multiple work order view and the future as-built view, but introduces some compromises into the as-built view. When using this approach, a necessary addition is to have some sort of explicit relationship between a work order object and the objects that belong to that work order, in order to identify the elements of the work order in the top version of the database.

Since data is posted to top before it is "as-built" in this approach, the "as-built" view needs to be created by adding application functionality on top of the database. Visually, an "as-built" view can be achieved by making objects with a status of "Proposed" or "Replacing" invisible, while displaying objects with a status of "Existing", "To be replaced", or "To be removed". All the visible objects are displayed with the same "as-built" style. For query and analysis operations, some more care is required to provide the as-built view. For (topology based) network tracing, this can be done reasonably well by setting up connectivity rules so that Proposed or Replacing geometry does not connect to geometry with any other status. For queries and reports (such as listing all transformers in a tax district), it is necessary to make sure that an additional predicate is added to only include objects with appropriate status values.

While this approach is acceptable for certain applications with a well known set of operations, it does not work as well in certain situations - especially if the data model has many relationships. If, for example, you want to query from a rack to the shelves inside it to the cards on the shelves, it is very inconvenient and error-prone to have to validate the status of every record as you do these joins.

While this approach introduces these compromises into the as-built view, the advantage is that it supports the multiple work order view and the future as-built view better than the other options considered so far. The multiple work order view is generated by making all objects visible and using the same styles (dependent on status) which are used in the individual work order view. A future as-built view can be generated by creating an alternative, selecting the work orders which we want to complete in this view, and going through the completion process in that alternative (converting "Proposed" to "Existing", deleting "To be removed", etc.).

In addition, this approach can support the idea of "partial completion" (a.k.a., "partial posting") reasonably well. A user can select a subset of a work order and apply the necessary status changes to convert the subset to "Existing". There are still potential consistency/integrity issues, especially with complex data models, but in general it is a workable approach for reasonably simple models.

The sideways merge approach
Both approaches described in the previous section have some drawbacks. The one in which jobs are not posted to top until they are completed does a good job of modeling the "as-built" network cleanly with no interference from proposed work, but it is poor at handling the multiple work order and future as-built views. The second approach addresses the latter two problems much better, but introduces potentially serious compromises for the as-built view, especially for complex data models. This makes it unsuitable as a generic solution. The “sideways merge” approach introduces some new ideas, which are a significant step forward in tackling this problem. It uses the top alternative strictly for storing as-built data, as in the first option above. This gives the advantage of providing a very clean and generic way of separating proposed work from the current network, as noted before.

The sideways merge is an operation that allows changes from one alternative to be applied to another alternative elsewhere in the hierarchy, instead of restricting changes to being propagated only to immediate parent or child alternatives, which is what traditional merge and post operations do. This technique can be used to create either future as-built views or multiple work order views, by copying data from all the relevant design alternatives into a new alternative. Another way to implement the multiple work order view, which is appropriate when a small number of jobs need to be displayed graphically in addition to the current job, is to implement the ability for the drawing routines to render graphics from more than one alternative on the fly.

While the sideways merge is a fairly complex operation to implement robustly, it is a very clean approach, and is by far the best solution of those discussed here.

Phased completion of jobs

Jobs built on jobs
An additional requirement is that it may be necessary to create a design that is dependent on another design that has not yet been built. In general, this is more common in communications than in utilities. In communications in particular, there is a requirement to record which jobs are dependent on other jobs, and to easily distinguish which subsequent jobs are affected if a given job is cancelled or rescheduled. There are at least two concepts that are important when implementing a full solution to this problem. One is to be able to create a "future as-built" view on which the new job can be based (by selecting the other jobs on which this job is dependent, and by making those jobs look as if they have been completed). The second concept is that when the job is completed, some sort of validation is required to determine that the necessary work in jobs that it was dependent on has also been completed. In some cases, there may also be a requirement that if a job is cancelled, any dependent jobs need to be checked to see if they are still valid. In general, this sort of validation is very application-specific.

Partial Completion (partial posting)
Sometimes a job may be completed in several pieces, each of which needs to be converted to become "as-built" independently. If this phasing is known in advance, the problem is the same as the previous one (jobs built on jobs). If not, the issue is a little different. This is an interesting problem, in the sense that it may seem straightforward to the casual observer who hasn’t studied the problem in detail, but in fact it is extremely complex to implement a generic solution. The difficulty is making sure that the subset of data that is "posted" (in a general sense, not necessarily in a strict version management sense) is completely consistent. This is very similar to the task of ensuring consistency in a spatial replica or an extract of a database, and it is possible for most applications to set up a framework to ensure that the chosen subset of data is consistent (in conjunction with appropriate application specific rules).

Identifying potential conflicts early
The optimistic concurrency control and conflict resolution used by version management has been proven to work very well in production systems. However, it is desirable to highlight and avoid potential conflicts wherever possible. One common approach is for a user to sketch a rough outline of the area that they expect their design to cover before they start it. This is either stored in a short transaction partition, or posted to top immediately, so that it is visible to all users. A test is then run to see if this area overlaps with any other currently active jobs. If so, both the current user and the owners of the overlapping jobs are notified so that they can investigate further what the other jobs involve, and whether they are likely to conflict with each other. This is independent of the other issues covered here, but it is worth mentioning as a useful technique to use in conjunction with version management.

Mobile design
There is a very strong requirement to be able to carry out design operations in the field and synchronize these with the central database. This could include both the initial creation of the design and the process of making as-built changes. Therefore, it is important that the version management approach can support the appropriate use of detached databases. This will not be discussed in detail in this paper due to lack of space, but it is a complex topic in its own right.

Archiving and historical information
Another common requirement is the ability to look back in time. Requirements vary extensively in this area. At a simple level, many companies require work orders to be archived for a number of years (usually the requirement is to archive the equivalent of the paper work order document). At the other extreme, some companies would like to be able to completely reconstruct their entire network as it was at any point in the past. This is another problem that is largely independent of the core "work order problem", and for space reasons we will not discuss it further here.

Conclusions
While most vendors have now moved beyond check-out to provide some basic version management, there are some significant differences between the implementations in the marketplace. In evaluating implementations, it is especially important to look at existing operational systems, in particular examining performance, scalability and reliability. Also consider functionality differences, such as whether version management of the data model is supported, and how sophisticated the conflict resolution is.

This paper has also explained how there are significant additional requirements for a design application in addition to basic version management. The ability to do a sideways merge operation is a very important aspect of providing a comprehensive solution. Other areas to evaluate in a design application include support for jobs built on jobs; partial completion/posting; mobile design capabilities; and archiving capability.

References
  • Newell, R.G., and Easterfield, M.E., 1990: Version Management - the problem of the long transaction. Proceedings of the Mapping Awareness conference.
  • Newell, R.G., and Batty, P.M., 1994, GIS databases are different: Proceedings of AM/FM Conference XVII, Denver, pp. 279-288. (Smallworld technical paper 19).
  • Oracle Technical Network, Workspace Management information: http://otn.oracle.com/products/workspace_mgr/content.html
© GISdevelopment.net. All rights reserved.