Subpixel Estimation of Impervious Surface Using Regression Tree Model: Accuracy of The Estimation at Different Spatial Scales


Regardless of the method used for quantification of impervious surface (spectral mixture analysis, artificial neural networks or machine learning algorithms), the subpixel processing techniques have proven effective at increasing classification accuracy of impervious surface (Slonecker et al., 2001). Civco & Hurd (1997) also concluded that the information derived about impervious surfaces from subpixel classification methods was superior to traditional land use/land cover based method.

REGRESSION TREE MODEL
A decision tree method is a multistage or hierarchical decision scheme that recursively partitions a data set in binary fashion into smaller and smaller subdivisions until the final subdivisions can no longer be partitioned or they satisfy some user-defined criteria. The procedure is called a decision tree since it is done in an upside-down tree-like structure; starting with the full data set at the root, followed by a series of partitioning at internal nodes or splits before ending with the final subdivisions at the terminal nodes or leaves (Figure 1). The variables used for partitioning exercise at both the root and internal nodes are predictor or independent variables while the variable partitioned is the dependent or target variable. The decision tree method is called classification tree model if the target variable is discrete or categorical and regression tree model if the target variable is continuous.

Numerous approaches to decision tree method have been developed in the past thirty or so years (Friedl & Brodley, 1997). Early tree construction approaches were limited to utilizing data sets that were both well understood and well behaved, partly due to limitation of computing technology at the time. Tree construction at the time depended solely on analyst expertise to set up a priori threshold values used for splitting the tree nodes. Given the difficulty in specifying the threshold values on the basis of user knowledge alone due to the tendency of the values to vary across both time and space, this procedure is very difficult to implement in practice. With advanced computer technology nowadays, however, a statistical procedure is more commonly used on a set of training data to estimate the threshold values. The specific techniques used for this work are called learning algorithms, which have been developed within the machine-learning and pattern-recognition communities (Quinlan, 1993). The techniques require high-quality training data from which relations among the independent and dependent variables present within the data can be “learned”. A classic example of this learning algorithm approach is the classification and regression tree (CART) model described by Breiman et al. (1984). In CART, a tree is constructed by recursively splitting the data at each node on the basis of a statistical test that increases the homogeneity of the training data in the resulting descendant nodes. There are now a number of statistical softwares that function specifically to handle CART or incorporate CART as part of their package. Among the widely-used ones are See5, Cubist and S-Plus which is the statistical software used in this study.


Figure 1: Decision tree


Starting with a training data set or learning samples, three steps are necessary in building an optimal regression tree; 1) tree building; 2) tree pruning; and 3) selection of optimal tree. Tree building based on binary recursive partitioning begins at the root node using all the learning samples where the CART software finds the best possible variable to split the node into two child nodes. To find the best variable, the software checks all possible splitting variables (independent variables), as well as all possible values of the variable to be used to split the node. The selected split at each node is a split that partitions the data into two parts such that it minimizes the sum of the squared deviations from the mean (of the dependent variable) in the separate parts. The splitting process then goes on to successive nodes until terminal nodes are reached under one of the following criteria: (1) a node reaches a user-specified minimum node size (i.e. number of training samples at the node); (2) all observations within each node have the identical distribution of predictor variables, making splitting impossible; or (3) the deviance among the samples at a node is lower than a user-specified value.

Since the resulting ‘maximal’ tree is constructed from training samples, it follows every idiosyncrasy in the learning data set and as a result generally suffers from overfitting. Overfitting results in less generalization capability and may deteriorate the regression accuracy of the tree when applied to unseen data. Correcting the regression tree for overfitting is the second step in the construction of a regression tree and is called ‘pruning’. A pruning process is generally adopted using a fresh set of samples known as a validation data set. The objective of pruning is to minimize the output (independent) variable variance in the validation data. In the pruning process, the last grown node of the maximal tree is removed first followed by more and more nodes (of increasing importance), resulting in simpler and simpler trees. Each of these simpler trees is a candidate for the appropriately-fit final tree or optimal tree, i.e. the one with the lowest or near-output variable variance. Detailed process of tree pruning is described in Quinlan (1993) and Breiman et al. (1984).

Each of the simpler trees generated after pruning may have different combinations of independent variables and different qualities. Generally one wants a tree that is parsimonious in its use of independent variables yet low in its error rates. The quality of a regression tree is measured by the mean absolute error R(T) expressed by:



where represents the regression plane through the example set, N is the number of samples used to establish the tree and Yiis the actual value of the predicted variable. Thus, mean absolute error can be used as the basis for selecting the optimal tree.

The use of decision trees, both classification and regression, has steadily increased in remote sensing field (Hansen et al., 1996; Friedl & Brodley, 1997). Compared to the traditional supervised classification procedures used in remote sensing such as maximum likelihood classification, a classification tree has several advantages. A decision tree is, for example, strictly nonparametric and do not require assumptions regarding the distributions of the input data. It also handles nonlinear relations between features and classes, allows for missing values, and is capable of handling both numeric and categorical inputs in a natural fashion (Friedl & Brodley, 1997). A decision tree also has a significant intuitive appeal because the classification structure is explicit and therefore easily interpretable. Hansen et al. (1996) tested a classification tree with the use of remotely sensed data and their results showed that the tree performed comparably to a maximum likelihood classifier. In another study, Huang and Townshend (2003) reported that accuracy and predictability of the regression tree models were better than those of the simple linear regression models.

DATA AND PREPROCESSING

Study Area
The study area was Wake County in the State of North Carolina, USA (Figure 2). With a land mass of 860 square miles, the county housed an estimated population of 678,651 persons in 2002. Wake County measures about 46 miles from east to west and 39 miles north to south. Based on the 1998 land use distribution data from North Carolina’s Center for Geographic Information and Analysis (Figure 2 and Table 1), urbanized area covers about 7.3% of the county. Forest cover in the form of evergreen, deciduous, and mixed forests as well as woody wetlands makes up the largest percentage of the land use at 71% of the total area. Agricultural land in the form of crop agriculture and pasture places in second with 18.7% of the


Figure 2:Study area showing land use distribution


Data and Image Preprocessing
The study used two forms of remote sensing imagery and a planimetric data of a small section of the study area. The first type of the remote sensing imagery was scenes from Path 15/Row 35 and Path 16/Row 35 of the Landsat ETM+ images captured by Landsat 7 satellite. The images that were captured on 25 April 2002 had already been orthorectified and projected to UTM Zone 17 coordinate system with a NAD83 datum (Figure 3). The second type of remote sensing imagery acquired was the 0.3m rectified orthophotos from the USGS Earth Resources Observation System Data Center taken on 28 March 2002 (Figure 4). The orthophotos have been orthorectified and projected to UTM Zone 17 coordinate system with a NAD83 datum. The high resolution of the orthophotos made it possible to visually compare between pervious and impervious materials as well as between different types of impervious materials. For that reason, the orthophotos were utilized to update the planimetric data used in establishing the training and validation data set for impervious surface. To avoid potential errors due to temporal difference in image dates, it is best that all the images are of the same date or at least close enough for assumption of no changes in and use/land cover operties.

Table 1: Distribution of land use in Wake County


Digital planimetric data for a small urbanized section of the county was acquired and utilized as the main source of training and validation data sets for impervious surface in building the regression tree model. It is important to note here that although the planimetric data contained rich information in vector format delineating as-built boundaries of building footprints, parking lots, roads, footpaths and other structures, its use for determining the amount of impervious surface was limited by its spatial availability. In this case, the planimetric data covers only about ten percents of the study area. On top of that, the currency of the planimetric data was lacking, requiring updating of information using the latest high-resolution aerial photos. With its information updated using the 0.3m orthophotos, the planimetric data provided an accurate and current training and validation data sets for impervious surface.


Figure 3: Landsat ETM+ images


Page 2 of 3
| Previous | Next |