Machine Learning Methods to Identify Mislabeled Training Data and Appropriate Features for Global Land Cover Classification
Jonathan Cheung-Wai Chan, Matthew C. Hansen and 1Ruth S. Defries
Laboratory for Global Remote Sensing Studies,
University of Maryland, College Park, MD20742, USA
1 Also with Earth System Science Interdisciplinary Center,
University of Maryland, College Park, MD20742, USA
Tel +1 301 405 8696 Fax: +1 301 314 9299,
Email: jonchan@glue.umd.edu
Key Words
machine learning, filtering, mislabeled data, feature selection
Abstract
In an effort to improve the AVHRR 1km global land cover product (Hansen, 2000), the original training data set used to derive the supervised decision tree classifier is evaluated using machine learning methods with two focuses: to identify mislabeled pixels and to optimize the input feature subset. A filtering process is formed by three data modelers: Decision Tree Classifier, Instance Based Learning and Learning Vector Quantization. Cross-validation is used to label the entire data set into correct and incorrect classification. Incorrect classified instances are considered as possible mislabels. Consensus filtering is performed and the results show that 74% of the cases identified as mislabeled agree with expert knowledge. More mislabels were identified if those chosen by either one or two of the data modelers are considered, but agreement with expert degraded to 54.2-65.5%. A wrapper approach of feature subset selection is applied to the improved data set with mislabeled training data discarded. A total of 18 features were chosen from the original 41 features. A map generated using the improved training set was compared with an expert-improved classification. Overall agreements is 53.1% and by class agreements ranged between <25%to>70%. Future efforts would be needed to address the question of what leads to the disagreements and how machine learning methods can be better tapped for operational global land cover mapping.
Introduction
Global land cover information is a prerequisite input to many atmospheric models that describe Earth system processes (Seller et al., 1997). In the last two decades, remotely sensed data have been increasingly utilized to monitor global land cover changes on a regular basis. Previous studies have shown success in using AVHRR data to produce global vegetation map at 1km and 8km resolutions (Hansen et al., 2000; DeFries et al., 1998). For the global land cover map at 1km resolution, only training sites that have been interpreted from fine-resolution data Landsat Multispectral Scanner System (MSS) and Thematic Mapper (TM) as containing 100% of the land covers in interest were chosen. Details of the preparation of the training data set were documented in Hansen et al. (2000). Collecting global training data set is understandably difficult considering the diversity of each land cover type. Exhaustive in-situ validations are either too expensive to get or simply physically not accessible. Mislabels in training data set due to these confinements are not uncommon. Constant efforts have been made to improve the 1km product. This paper reviews the use of machine learning techniques to identify mislabels and to optimize feature subset in the training set.
Methodology
Filtering mislabels in training
Weisberg (1985) suggested the use of regression analysis to identify outliers in the training data. Those cases that could not be described by the model and have the largest residual errors are outliers. Motivated by the same idea, Brodley and Friedl (1999) used filtering to clean out mislabeled training data. The idea is to use different learning algorithms to generate various classifiers from the training set. The classifiers form a committee of data modelers. For cases that do not conform to any model, they can be either noise or exception. Since there is no way to distinguish between noise and exception in the training set, the outliers are treated as candidate mislabels. Brodley and Friedl (1999) have shown that filtering improved classification accuracy significantly for noise up to 30%. Whether consensus filters (to throw away cases that do not follow any model) or majority filter (to throw away cases that follow most but not all models) should be used depends on the objectives of the task. In this study, two sets of training labels were used and the second set is an improved version with additional expert inputs. Mislabeled cases from the first training set identified by machine learning filtering would be compared with the improved data set.
To mark each instance as correctly or incorrectly labeled, a n-fold cross-validation is performed over the training data. The training data is subdivided into n equal parts and for each of the n parts, a classifier is generated from the other n-1 parts. After the n trials, the whole training set would be marked. We used a 10-fold cross-validation for our experiments. For the committee data modelers, we have included Decision Tree Classifier, Instance Base Learning and Learning Vector Quantization.
Decision Tree Classifiers (DTC) recursively subdivide the training set into homogenous subsets according to certain split rules. They are fast in learning and the explicit tree structures enhance interpretation of the classification process. Our experiment used C5.0, a commercial successor of C4.5 developed by Quinlan (1993). The split rule of C5.0 is made according to information gain. Details of the information gain criterion can be found in Quinlan (1993).
Instance Base Learning (IB) is described as a lazy learner as they do not process the inputs until they are requested for information (Aha, 1998). As such, they have little training costs. IB is slow when there are many attributes and as it retains the solution to a similar problem, high classification costs is expected. More about IB can be found in Aha (1992) and Wettschereck (1994). We have implemented the Instance Base Inducer of the SGIMLC++ (Machine Learning library in C++, 1996) Utilities 2.0 for our experiments. The number of nearest neighbor is set to 1.
Learning Vector Quantization (LVQ) is a supervised version of Self-organizing Maps (Kohonen, 1995). By assigning codebook representatives to each class, LVQ defines borders resembling voronoi sets in vector quantization. It is a nearest-neighbor based classifier. Experiments were performed with the LVQ_PAKv3.1 available from the Helsiki University of Technology. We assigned 100 codebook representatives to each class. The first 5,000 steps of training with self-adjusted learning rate were implemented using OLVQ1. Then, training is carried on for 200,000 steps with LVQ1 using a lower learning rate 0.05. Snapshot is taken for every 10,000 step. The step with the highest accuracy rate was chosen for the trial.
Optimal feature subset
Finding an optimal feature subset for a task can improve interpretation and reduce computing costs. Reducing computing costs is particularly important in the context of global land cover mapping considering the size of the data set we are handling. Irrelevant or correlated features are reportedly damaging to learning algorithm like decision trees (John et al., 1994). Applying feature selection will downsize input dimension with the possibility of enhanced accuracy.
Both filter and wrapper approaches can be adopted for feature subset selection. Filter approaches use only the training data in the process of evaluation but wrapper approaches incorporated the induction algorithm as part of the evaluation in the search of the best possible feature subset. Wrapper approaches are reportedly superior when applied to decision trees resulting in smaller tree sizes (better interpretative power) and higher accuracies (Kohavi and John, 1998). For our experiment, we used a wrapper approach with decision tree classifier. The induction algorithm is run on the training data set using different subsets of the original features sets. The subset with highest evaluation is used to generate a classifier.
A forward selection procedure using best-first search is adopted (Ginsberg, 1993). Forward selection implies an operation of addition for each expansion. The search states are nodes representing subsets of features. The idea of best-first search is to jump to the most promising node generated so far that has not been expanded. The search is stopped when an improved node has not been found in the previous k expansions. An improved node is defined as a node that has an accuracy of not less than x percent higher than the best node found so far. For our experiment, k and x are defined as 5 and 0.001% respectively.