|
|
|
Support Vector classifiers for Land Cover Classification
2.2 Artificial Neural Network Classifier
A feed-forward artificial neural network (ANN) is used in this study. This is the most widely used neural network model, and its design consists of one input layer, at least one hidden layer, and one output layer. Each layer is made up of non-linear processing units called neurons, and the connections between neurons in successive layers carry associated weights. Connections are directed and allowed only in the forward direction, e.g. from input to hidden, or from hidden layer to a subsequent hidden or output layer. Non-linear processing is performed by applying an activation function to the summed inputs to a unit. Back-propagation is a gradient-descent algorithm that minimises the error between the output of the training input/output pairs and the actual network outputs (Bishop, 1995). Therefore, a set of input/output pairs is repeatedly presented to the network and the error is propagated from the output back to the input layer. The weights on the backwards path through the network are updated according to an update rule and a learning rate. ANNs are not solely specified by the characteristics of their processing units and the selected training or learning rule. The network topology, i.e. the number of hidden layers, the number of units, and their interconnections, also has an influence on classifier performance. In this study we use the network architecture and number of patterns used for training suggested by Kavzoglu (2001).
2.3 Support vector classifier
In the two-class case, a support vector classifier attempts to locate a hyperplane that maximises the distance from the members of each class to the optimal hyperplane. The principle of a support vector classifier is briefly described next.
Assume that the training data with k number of samples is represented by
{Xi,Yi} i = 1, X Î Rn,
k, where is an n-dimensional vector and
Y Î {-1, +1}
is the class label. These training patterns are said to be linearly separable if a vector w (which determining the orientation of a discriminating plane) and a scalar b (determine offset of the discriminating plane from origin) can be defined so that inequalities (1) and (2) are satisfied.
The aim is to find a hyperplane which divides the data so that that all the points with the same label lie on the same side of the hyperplane. This amounts to finding w and b so that
If a hyperplane exists that satisfies (3), the two classes is said to be linearly separable. In this case, it is always possible to rescale w and b so that
That is, the distance from the closest point to the hyperplane is 1/||w||. Then (3) can be written as
The hyperplane for which the distance to the closest point is maximal is called the optimal separating hyperplane (OSH) (Vapnik, 1995). As the distance to the closest point is 1/||w||, the OSH can be found by minimising ||w||2
under constraint (4).
The minimisation procedure uses Lagrange multipliers and Quadratic Programming
(QP) optimisation methods.
If li, i = 1,….,k are the non-negative Lagrange multipliers associated with constraint (4), the optimisation problem becomes one of maximising (Osuna et.al. 1997):
under constraints li = 0, i = 1, …..,k.
If is an optimal solution of the maximisation problem (5) then the optimal separating hyperplane can be expressed as:
The support vectors are the points for which > 0 when the equality in (4) holds.
If the data are not linearly separable, a slack variable , i =1,……,k can be introduced with = 0 (Cortes and Vapnik 1995) such that (4) can be written as
and the solution to find a generalised OSH, also called a soft margin hyperplane, can be obtained using the conditions
The first term in (8) is same as in as in the linearly separable case, and controls the learning capacity, while the second term controls the number of misclassified points. The parameter C is chosen by the user. Larger values of C imply the assignment of a higher penalty to errors.
Where it is not possible to have a hyperplane defined by linear equations on the training data, the techniques described above for linearly separable data can be extended to allow for non-linear decision surfaces. A technique introduced by Boser et al. (1992) maps input data into a high dimensional feature space through some nonlinear mapping. The transformation to a higher dimensional space spreads the data out in a way that facilitates the finding of linear hyperplanes. After replacing x by its mapping in the feature space , equation (5) can be written as:
To reduce computational demands in feature space, it is convenient to introduce the concept of the kernel function K (Cristianini and Shawe-Taylor, 2000; Cortes and Vapnik 1995) such that:
Then, to solve equation (11) only the kernel function is computed in place of computing , which could be computationally expensive. A number of kernel functions are used for support vector classifier. Details of some kernel functions and their parameters used with SVM classifiers are discussed by Vapnik (1995). SVM was initially designed for binary (two-class) problems. When dealing with several classes, an appropriate multi-class method is needed. A number of methods are suggested in literature to create multi-class classifiers using two-class methods (Hsu and Lin, 2002). In this study, a "one against one" approach (Knerr et al., 1990) (In this method, all possible two-class classifiers are evaluated from the training set of n classes, each classifier being trained on only two out of n classes. There would be a total of n (n-1)/2 classifiers. Applying each classifier to the vectors of the test data gives one vote to the winning class. The pixel is given the label of the class with most votes. To generate multi-class SVMs and a radial basis kernel function (defined as ) was used.
|
|
|