GISdevelopment.net ---> GITA 2001 ---> System Architecture

Practical Object Versioning for Distributed Mobile Databases

Robert Wills
Severn Trent Systems, Alexander House
19 Fleming Way, Swindon SN1 2NG, United Kingdom


Extending the Enterprise Database to Mobile Devices
Distributing enterprise data to mobile devices for use by field personnel makes good business sense for a wide variety of applications. This is most easily achieved with a “thin client” architecture, where the data and application logic resides in the controlled environment of a central data center, and only the user interface executes on the mobile device. There are scenarios where the field worker must be able to run applications when no connection to the data center is available. To achieve this, a subset of the enterprise database and application code must be deployed to a “thick client” device and executed locally.

When a field worker updates the database on their disconnected mobile device, the changes cannot be immediately validated against data elsewhere in the enterprise. There needs to be some way of reconciling the overlapping edits and other data inconsistencies that can occur. If the data volumes are large, this reconciliation process needs to be automated wherever possible.

The discipline of Software Configuration Management (SCM) provides a solid theoretical framework based around the concept of “versioned objects”, which can be usefully adapted to solve some of the problems inherent in building a distributed enterprise-wide database.

Software configuration management (SCM) concepts

“Object”

Objects are collections of data which are treated as an atomic unit for the purposes of versioning. For a SCM system, objects are typically text files containing program source code. Any non-trivial software project will involve a large number of source files, which are usually organized into folders. Figure 1a shows a UML class model which describes a hierarchy of folders containing files - the essential concept behind most computer storage at the operating system level.


Figure 1a

Figure 1b

More complex associations and dependencies between these files are implied by the source code they contain, although it requires an appropriate compiler to figure them out. Some software development environments with SCM features make these associations more explicit by storing and versioning fragments of source code at a more granular level. Figure 1b is a simplistic model of part of the Java language, illustrating some typical fragments.

Formal programming language definitions and the full UML metamodel allow this process of decomposition to be taken much further, modeling the building blocks of individual executable statements and expressions. This level of analysis adds little value to a SCM process, and current SCM tools rarely have an object model more complex than figure 1a above.

The principles of object versioning can be usefully applied outside the SCM context to any model, no matter how simple or complex. The objects being versioned can be traditional RDBMS table rows just as easily as program source files.

Object identity is an important issue for any system that supports versioning. The RDBMS approach is to identify an object by some of its attributes, known as its “primary key”. An OODBMS identifies objects with some sort of extra reference or “handle”, which might be a memory address or a number which acts as a unique object identifier (“OID”). For a distributed, version-aware application there are advantages to the object-oriented approach, since it allows two objects created on different remote systems but with the same primary key to exist in the same distributed database. This situation implies some form of conflict which will eventually need resolving, but the system must have somewhere to store the two objects in the meantime.

“Version”
When an object is changed, an SCM system does not throw away the old state of the data. Each state is called a “version”, and is given some form of unique version identifier (“VID”). More importantly, the SCM keeps a record of what previous version the new data is based upon, creating a directed graph of the versions of a particular object. Other housekeeping information is typically stored with each version, such as the name of the user making the change, a timestamp, and comments about why the change was made.


Figure 2

An important property of object versions is their stability. Once created, the contents of an object version should not change for the entire lifetime of the SCM project. Change always implies creating a new version.

Simplistic SCM systems can only store multiple versions of file objects. More practical SCM systems allow both folders and files to be versioned, and this principle can be extended to a full application data model, where some or all of the different object types are versioned.

The existence of versions gives rise to two different types of object reference. If the example above is part of the version history of a file called “collections.cpp”, then a version-specific reference which specifies “collections.cpp, version 2.8”, will always resolve to the same piece of data. A “bill of materials” describing the contents of a software release is one context where the stability of a versionspecific reference is very useful. If a second source file contains the statement #include “collections.cpp” then it is making a generic reference to the collections.cpp file, relying on some other mechanism, such as a “view”, to select a specific version.

“Baseline” or “Stripe”
In SCM terms, a “baseline” or “stripe” is something that describes a snapshot of the system, and can be used to reconstruct that snapshot later. The snapshot might represent the source files used to build a particular beta release, for example. Physically, the baseline consists of a list of version-specific object references or a rule which can be used to derive the list.

Baselines are typically used to record stable, consistent states that the objects within the system achieve as a group. In the SCM context this might mean that the source files build successfully, or that the resulting executable code does something useful. In the version-aware application context it could mean that all referential integrity and other validation constraints are fulfilled. Baselines can be used as part of the implementation of a “long transaction”, where they act in a similar way to a SQL

Savepoint.

“Check-in”, “Check-out”

A developer using an SCM system is often required to “check-out” a file before making any changes to it. This applies a lock to that particular version of the file which prevents any other developers from starting work on it, until it has been “checked-in” again. Most SCM systems can also be configured to only lock object versions during the brief period when a developer’s changes are being merged back to the code repository. These two approaches correspond closely to the “pessimistic” and “optimistic” locking strategies available in most RDBMS products.

In a distributed database with intermittent connections, a pessimistic lock or “check-out” may take a long time to complete. It requires an exchange of messages with the central datastore, or other lock manager which is not possible while the device is disconnected. A strategy of optimistic locking, coupled with automated tools for merging changes when conflicts are detected later, is effective for implementing most (but not all) application functionality requirements.

“Branch” or “Activity”
When looking at the directed graph of versions of one particular object, there may be occasions when two or more versions exist at the same point in time, and are derived from the same predecessor, or “parent”. Some SCM systems use a form of Dewey-decimal numbering to store an object’s version graph within the names assigned to versions.

In the SCM context, branches occur in the version graphs of objects when different development activities are running in parallel. A typical example is release bug-fixing, where one group of developers make small, carefully-controlled fixes to a code baseline which has been released to customers, while others are working on major new functionality for the next release. Most SCM tools allow branches in individual object version graphs to be associated with a label representing the global context, or “activity” for which they were created.


Figure 3

While the version label “1.3.1.0” is specific to the object “widget.java”, the branch name “V2.2_BETA1_BUGFIX” is global and may involve changes to many objects.

In the context of a distributed mobile database application, global branches can be used to keep changes applied on one mobile device separate from those applied on another. A branch might have the title “changes made on device 21”, for example.

“View”
A view is similar to a baseline in that it is a collection of rules used to derive a list of object versions. Unlike a baseline, a view does not represent a special state of the system, and the contents of a view can change from moment to moment as new object versions are created.

The most important view is often referred to as “main-latest”, meaning “the most recent version on the main branch of every object in the system”. In the SCM context this is the cutting edge of code that the developers are working on. In the distributed application context, this is the best consensus state of the data. A distributed database application will probably enforce rules to ensure that the data visible through the “main-latest” view goes from one valid state to another in a transactionally-secure way. Good SCM practice endeavors to do this, but proving that the source code is “valid” may involve building, deploying and running the code, so automated SCM tools have to be less rigorous in this area.

For a distributed database application, other views can be defined which show the state of the data as it appears on a specific remote device. This is particularly useful when designing an architecture that must provide a consistent user experience across both thick and thin client devices, since a web/application server can show each individual user slightly different data which matches what they see when working disconnected on a device with local data storage.

“Merge”
Where branching provides a way of splitting the “version-lifeline” of an object, merging enables the separate branches to be brought back together again. This is best illustrated with an example from a fictitious SCM scenario:

When AcmeJumble 2.2 Beta 1 was released, “widget.java” was at version 1.3. Two minor bug fixes were made and issued to selected customers as patches – these produced “widget.java” versions 1.3.1.0 and 1.3.1.1. Meanwhile the main development team were adding a new feature which the product managers insisted on including in the final public release of AcmeJumble 2.2 - this produced version 1.4 of “widget.java”. The beta period has now finished, and the bug fixes that were applied need to be merged back into the main branch so that the developers can finish work on AcmeJumble 2.2 Beta 2.
Figure 4 shows the situation graphically.

It turns out that an automated SCM tool can do a surprisingly good job of merging these different versions of “widget.java” back together. The details of this operation are described below, and are key to understanding how the SCM versioning model improves upon “replication” or “synchronization” for reconciling changes in a distributed database.
  • Using the directed graph that links the versions, find the most recent common ancestor of the two versions to be merged – this turns out to be version 1.3.
  • Compute the differences between version 1.3.1.1 and the common ancestor, 1.3.
  • Compute the differences between version 1.4 and the common ancestor, 1.3.
  • If these two sets of differences overlap, then there are conflicting changes – ask the user to choose, interactively.
  • In this case there are no overlaps, so create version 1.5 automatically by taking version 1.4 and applying the (V1.3.1.1 – V1.3) differences already computed.
For program source files, a “difference” is usually defined as one or more adjacent lines of text which differ from the corresponding region of a second file by something more significant than whitespace characters (space, tab, carriage-return, etc.). For structured data, differences will be found at the level of individual object attributes, to which the above algorithm can be more easily applied.

Replication/Synchronization versus SCM Merge
To push this point a little further, imagine you only have “widget.java” versions 1.3.1.1 and 1.4, and wish to “synchronize” them – perhaps 1.3.1.1 was created while working in the field on a laptop, and 1.4 on a desktop PC. If we compare them, there will be five differences, {a,b,c,d,e}. From timestamp information we also know that 1.3.1.1 is the more recent version. Without access to their common ancestor, there is simply no way that we can figure out whether to choose text from 1.3.1.1 or 1.4 at each of the five ambiguous regions of the file. At best we can adopt a “last one in, wins” approach and discard the work that went into version 1.4, or interactively offer the user both versions and invite them to sort the problem out manually. The interactive approach works well for two-way synchronization of personal data on a PDA with a desktop PC, but is not appropriate for a high-volume enterprise-wide distributed database application.

Most industrial-strength database replication mechanisms suffer from exactly the same problem. The word “replication” itself provides a clue about the sort of tasks this functionality was designed for – making replicas of a database to act as a backup, hot standby, or perhaps a local cache for improved performance. It may be possible to configure how frequently a replication process executes and what subset of data gets replicated, but when it comes to conflict resolution the choice is usually to treat one end of the link as the “master”, or to write all conflicts to an error table. Retaining some portion of the version history of objects/rows is the key to improving on this situation, and at the time of writing there do not appear to be any mainstream RDBMS products which do anything “out of the box” to support this.

Instead of program source code, imagine a database of more structured information appropriate to field working – asset catalogs, construction crews, repair jobs, GIS data, and so on. When an office worker changes one attribute of a particular asset record, for example, and a field worker changes another on a disconnected mobile device, there is no reason why tried and trusted SCM principles should not be used to automatically merge the changes. This keeps the enterprise data accurate and up to date, and allows user interaction to be focused where it is needed – on resolving overlapping edits which are truly unreconcilable.

Practical architecture using SCM concepts
Figure 5 is a simple outline architecture for a distributed enterprise database application where the mobile devices can operate as both thick and thin clients, and all persistent data is stored as versioned objects.


Figure 5

Message-Oriented Middleware
Message-oriented middleware (MOM) in the form of store and forward message queues is an ideal technology for propagating changes between distributed databases. It provides storage for queuing up messages which cannot be delivered while the mobile device is disconnected, and all non-trivial implementations will deliver messages in a well-defined order. Some MOM products can also participate in distributed transactions using a two-phase commit protocol, which ensures that messages are never lost as they cross from one system to another.

A major advantage of MOM is its simplicity – as a consequence, implementations tend to be robust. Arguably the biggest disadvantage is the asynchronous nature of a message – by the time you receive it, the world may have moved on and the data in the message is out of date. The ability of a distributed database that uses versioned objects to refer explicitly and precisely to past system states therefore dovetails very neatly with MOM. Transactions can be effectively implemented by collecting a set of object versions into a single message which is then delivered by the MOM and applied to a remote database as an atomic unit.

Conclusion
Taking the step from relational or even object-oriented data to a distributed database of fully versioned objects is not a small one, and the costs in terms of development time, data volumes and application performance should not be underestimated. There are commercial OODBMS products with some level of object versioning support which may provide a useful foundation, but at the time of writing there are no fully satisfactory commercial off-the-shelf products available.

There are two particular functionality requirements for which a versioned database is a good technical solution:
  • Different users in different places often edit the same data, and the machines they work on are not always connected over some kind of network
  • All users need to be able to see up-to-date data incorporating changes made by other physically remote users no more than a few seconds after new data is received.
If both of these are high up on your list of product requirements (and probably not otherwise), then you should think seriously about designing a system around versioned objects.

References
The following books provide a good starting point and both contain excellent bibliographies: Leon, A., 2000, A Guide to Software Configuration Management
Meyer, B., 1997, Object-Oriented Software Construction (Second Edition)
There is an interesting comparative review of OODBMS products at: http://www.dacs.dtic.mil/techs/oodbms2/oodbms2.pdf  In a world increasingly awash with data, versioning features are beginning to appear independently in many new contexts outside of software development, such as the “WebDAV” project (“Web-based Distributed Authoring and Versioning”): http://www.webdav.org/
© GISdevelopment.net. All rights reserved.