Development of a Vision-Based Positioning System for High Density Urban Areas
Tianen Chen and Ryosuke Shibasaki
Center for Spatial Information Science, University of Tokyo
4-6-1, Komaba, Meguro-Ku, Tokyo 153-8505, JAPAN
E-mail: chen,shiba@skl.iis.u-tokyo.ac.jp
Keywords: High Accuracy Navigation, Mobile GIS, Image Sequence Analysis, Absolute Orientation, Relative Orientation
Abstract
An approach to determine the position and rotation of camera for the purpose of developing Augmented Reality GIS,
autonomous navigation systems in urban area is presented in this paper. The method combined CCD camera, DGPS
(Differential Global Position System), INS (Inertial Navigation System), magnetometer, gyroscope and image sequence
analysis technology. It is also assumed that a 2D/3D GIS (Geographic Information System) of the area is provided. Along
street in urban area a sequence of street scene images is captured with CCD camera and matched to the 2D/3D GIS models to
determine the system’s positions and orientations constantly. Relative orientation was used to determine the newly captured
image’s relative translation and orientation to its predecessor when no models could not be clearly seen from the two
neighboring images. The developed methods have been tested with real image and GIS data in outdoor environments. The
results indicate that the method is potentially applicable for personal navigation, mobile GIS position, automatic land vehicle
navigation in urban area where GPS can not be used smoothly, and Augmented Reality research.
Intruduction
High precise automatic location tracking is a major
component for building direction-providing systems,
autonomous navigation system, autonomous mobile robots,
and Augmented Reality research. For vehicle location, many
approaches have been studied and realized before, such as
dead reckoning systems, beacon-based navigation systems,
radio navigation systems, and GPS. Among those systems,
dead reckoning systems are least expensive, but positional
errors accumulate and cannot be corrected without other
information sources. In fact, dead reckoning is seldom used
alone; it is often embedded in other systems, such as
beacon-based systems, which have to rely on dead
reckoning in areas not covered by beacons. A major problem
with beacon-based systems is its high initial cost for
installing hundreds or thousands of beacons in a large city
and the subsequent cost of maintaining them. Radio
navigation systems use radio signals transmitting between
fixed stations and vehicles to figure out vehicle locations.
For example, some such systems use cellular phone services
for signal transmission. The position information provided
by current radio navigation systems are not very accurate;
average errors may be up to 200 meters. GPS can provide
accurate position information for most part of the world, but
it may have trouble operating in urban areas where GPS
satellite signals are often blocked by high buildings, trees,
and other overpasses. We have acquired a DGPS receiver
unit and tested it in Tokyo city. We found that the GPS unit
was frequently not receiving enough satellite signals to
determine the location in 95% areas. Sometimes, GPS did
not output any position information for a long time. Also,
GPS is not an option for indoor robot navigation.
Another way to acquire position information is by computer
vision. In fact, many animals depend on their vision to
locate their positions. Vision-based systems are attractive is
that they are self-contained, in the sense that they require no
external infrastructure such as beacons, radio stations, or
satellites. Vision-based systems in principle can operate
indoors and outdoors; virtually everywhere as long as there
are rich visual features for place recognition.
In this paper, an approach to determine the position and attitude
of camera for the purpose of developing AR type GIS and
autonomous navigation systems in urban area is presented,
which combined CCD camera, GPS, gyro sensor and image
sequence analysis technology.
Matching for Image Navigation
Matching is the bottle-neck of landmark-based
navigation. It can be defined as the establishment of the
correspondence between image-to-model and
image-to-image.
Image-to-Model Matching
In most image navigation systems, the key issue is to
establish a correspondence between the world model
(map) and the sensor data (image). Once such a
correspondence is established, the position of vehicle or
aerial photograph can be determined easily as a
coordinate transformation. In order to solve this problem,
we need to extract a set of features from the sensor data
and identify the corresponding features in the world
model. The problem is further complicated by the fact
that the image and the map/model are usually in different
formats.
The methods used to match image and 3D models
directly can be broadly categorized into three groups: (1)
key-feature algorithms, (2) generalized Hough
transformation and pose clustering, and (3) tree search.
Unlike these approaches that use the weak perspective
approximation, we handle the model-image feature
correspondence in tow steps. The first step is to take a
second image keeping enough overlaps with the first one
and match the two neighboring images with common
image-to-image matching methods to be described in
next section. The second step is to select line features
which are corresponding with landmarks of GIS models
from the first image. Since those image line features
have been matched with that of the second image. Thus
the correspondence between the second image and the
3D models could be indirectly constructed through the
first image as their “bridge”. This kind of matching
algorithm dose not need any initial estimation values of
the camera’s position and rotation.
Image-to-Image Matching
Image matching is one of the most fundamental
processes in computer vision and digital photgrammetry.
The methods for image matching can be divided into
three classes, i.e. area-based matching, feature-based
matching, and symbolic (relational) matching (Lemmens,
1988).
Area-based matching is associated with matching gray
levels. That is, the gray level distribution of small areas
of two images, called image patches, is compared and
the similarity is measured by cross-correlation
(Lemmens, 1988) or least-squares techniques.
Area-based matching require very good intial values for
the unknown parameters.
The features used as matching entities in feature-based
matching are derived from the original image. In digital
photogrammetry, interest points are most often used
while in computer vision, edges are preferred. Since
edges are more abstract quantities, matching them is
usually more robust than matching interest points. The
similarity, for example, the shape, sign, and strength
(gradient) of edges, is measured by a cost function.
Feature-based matching methods are in general more
robust and require less stringent assumptions.
The third method, symbolic matching, is sometimes
referred to as relational matching, compares symbolic
descriptions of images and measures the similarity by a
cost function. The symbolic description may refer to
gray levels or to derived features. They can be
implemented as graphs, trees, or segmentic nets. In
contrast to the other methods, symbolic matching is not
strictly based on geometric similarity properties. Instead
of using the shape or location as a similarity criterion, it
compares topological properties.
In our system, the feature-based matching method,
combined with the other two matching methods was
applied.
Straight line extraction and description
Edges are the most fundamental features of object in the
3D world. Its extraction is important in many computer
vision systems since image edges usually correspond to
some important properties of 3D objects such as object
boundaries. Edges, especially the straight lines, are the
main features used in feature-based matching and
relational matching. In our method, straight lines are also
the main features used to construct the correspondence
between 3D models and images. Here, we will briefly
introduce their extraction and attribute description for the
next matching process.
An edge preserve smoothing filter was used based on
Nagao and Matsuyama (1979) which strengthens the
gray level discontinuties whilst reducing the gray level
differences in homogenous regions. Edges are detected
by the convolution of a kernel, or several kernels with
the image. Here, the Sobel Operator was preferred. Both
edge strength and gradient direction were used to extract
straight lines with dynamic programming and lest-square
fitting methods at subpixel accuracy (Chen, 1999).
An image line can be described by its attributes, such as
position and orientation. These attributes are the most
important measures used in image line matching.
Besides position and orientation, there are length,
contrast, gradient, texture features etc.(Chen, 1999).
Two-step matching
The problem of straight line matching is to find line
correspondences over two (or more) images. The
similarity of lines in different images is a key feature in
matching. In general, relaxation is often used. However,
since straight lines have many significant attributes, a
more powerful matching function can be defined. As a
result, line matching can be a relative single-pass process.
In this paper dynamic programming method was used for
matching (Chen, 1999).