GIS FUNCTIONS - INTERPOLATION1

by
Ferenc Sárközy
Technical University Budapest
Department of Surveying

Abstract

The main strength of GIS (Geographical Information Systems) is the common analysis of compound spatial and attributive data. These later are collected by samples. The values of the collected type of data can be expanded to the sites where no samples are available using interpolation methods. The interpolation can be performed either in the phase of data production or when the data are used for spatial analysis. In the second case the interpolation is a GIS function.
A brief review of interpolation methods is given with emphasis to the applicability of the particular method to one of the cases mentioned above.
The possibilities of two new methods -- the wavelet transformation and the artificial neural network (ANN) approach are also discussed.
Especial enhancement is given to the possibilities of artificial neural networks in model building and interpolation. Connection between the data model and ANN approach is examined. In context of ANN the links between interpolation, classification and GIS functions are explained.
As conclusions several statements about GIS functions and modeling are given

Keywords: GIS, GIS functionality, GIS tools, spatial data models, data production, interpolation, classification, artificial neural networks, wavelet transformation.

0. Introduction

In the paper [Sárközy 1997] analyzing the recent situation and possible further development trends of GIS in context of the systematization of GIS functions I have recommended the more exact separation of data production from data analysis, that is the functions of data preparation from standard GIS functions.
However, the functions for expanding and deepening the target space can be used in both phases of spatial informatics. At this point I should refer to my recommendation related to a new data model representing the natural phenomena [Sárközy 1994]. This data model called "function field" assumes the determination of functions describing the phenomena. If the input data are compiled as functions and the GIS software can handle this model then the basic GIS analysis functions do not have to perform further interpolations.
That is from point of view of system principles the belonging of interpolation functions either to the data production functions or to the GIS analysis functions depends also on the data model of the particular GIS.
However, in both cases the importance of interpolation methods is continuously growing especially by connecting the GIS with some kind of modeling software.

1. The role of interpolation

More and more phenomena can be measured and might be involved in the spatial analysis. Among others we can mention the precipitation, temperature, soil parameters, ground water characteristics, pollution sources, vegetation data. We are not able to measure the values of the particular phenomenon in all points of the sphere, but only in sample points. The interpolation gives us values in such points where we have no measurements.
The goodness of interpolation can be characterized by the discrepancy of the interpolated value from the true value. Because the true value is not known in general, we can select some measured points for testing the interpolation procedure. In the stage of data production we can calculate the values of a particular phenomenon in predefined spots using interpolation procedures. For example if we want to deliver the data in a regular grid, but the samples are measured in scattered points we have to calculate the values of grid points from the samples using interpolation procedures.
In the phase of analysis we face an other problem. We can have the different attributive data in different (regular) spots. For example we have two phenomena: temperature and precipitation (average and agglomerated for a month respectively), given in two not coinciding grid. To perform the interaction computations we should transform the data to a common grid (to a common coordinate system) with the same discrete steps.
Seemingly similar tasks of interpolation are used for the creation of area objects representing a particular interval of the data in question. The construction of iso-lines or iso-surfaces in 3 dimensional case however aims to create and visualize aesthetic features and not too accurate ones (because of the generalisation of individual function values into interval membership, that is because of the artificial degradation of resolution the high accuracy in interpolation has no reason).
Discussing the role of interpolation we should pay our attention to the global and local subdivision of interpolation approaches. Even if we take into the consideration that the global approaches turn into local ones by numerical solutions we have to realize that the GIS requires more local (or less global) methods, while the data production more global (less local) procedures.
We have also to notice that the GIS interpolation computations should be fast and desirably automatic without of setting a large amount of parameters.

2. Interpolation methods

There are different ways of classifying the interpolation methods. Probably my approach does not satisfy strict mathematical requirements, but attempts to clarify the issue for the wide community of people using GIS.

We will discuss the following groups of methods:

2.1 Method based on geometrical nearness - the VORONOI approach.

As a consequence of involvement in research on geometric aspects of crystallography discovered the Russian mathematician VORONOY Georgy Fedoszeevich (1868-1908) in the first decade of the century a special continuous subdivision of the space by convex polyeders. In 2 dimension the polyeders are turning into polygons called Voronoy cells.


figure 1

The geometric rule defining the cell is very simple: each cell has only one sample point and all other points inside the cell are closer to this sample point than to any other sample outside the cell.
To construct the Voronoy cell for a sample point we have to connect all data points to the selected one and to drop the bisecting perpendiculars to each link. The bisectors will intersect with each other. Now we can select on each bisector the nearest to the sample point two intersections, determining the cell. On the figure 1 only those parts of the effective bisectors are shown which are forming the cells.
As interpolation method the Voronoy cells are mostly used for precipitation. Using the measured values of rain in the sample points the method extends their validity for the particular cell, that is it supposes that these values are characteristic for the entire cell. This type of interpolation results in sudden changes of the function values on the borders of the cells. Therefore its use is limited. However at least two other applications make it worthy of speaking about it.
The Voronoy approach is suitable for dynamic spatial object condensation (W. Yang and Ch. Gold 1995) that is for realization of nested GIS implementation models that make possible the storage and processing of the objects on different levels of generalization.
The second application, the Delone triangulation, is directly related to our topic. First we should consider that three points determine a plane. If we want to model the surface of the function with plane triangles fitted to the measured values in sample points then we have to construct a continuous triangulation network on the points. However, if try to connect the points we will find that there are possible different triangulations for the same points. The different triangulations represent different interpolation surfaces and as a consequence different interpolated values in identical sites. That means that without unique triangulation the interpolation by plane triangles is useless In the thirties of our century, DELONE Boris Nikolaevich, professor of mathematics at the Moscow University found out the dual task of the Voronoy subdivision. He proved that the field of scattered points can be uniquely transformed to a triangulated network using the links determined by the Voronoy cells of the points. With other words if we use the links belonging to the border lines of the cells (thin lines on figure 1), then the unique triangulation is established. On the basis of this result are working the TIN modules of the different GIS software.
For correctness we should mention that the interpolation method itself belongs to the third group (methods using basis functions), the geometrical partitioning however has so strong tights to the Voronoy cells, that gives reason to deal with both together

2.2 Statistical methods

The statistical methods interpolate the function value in the unknown point using weighted average of the known values in the sample points:

(1),
where is the interpolated value in the place x0 , F(xi) is the value of the measurement in the sample point xi , i=1,..., n, and the numbers are the weights.
As a rule the statistical methods are local, because they involve into the average only a number (n in the equation (1)) of neighboring samples located usually inside a distance limit (e.g. range in the kriging) from the point to be interpolated.
All statistical methods have the property of smoothing the data reducing the sudden juttings and dips. As we pointed out (F. Sárközy and G. Gáspár 1992) the statistical methods can interpolate only values that are inside the interval between the largest and smallest sample value involved in the average building.

2.2.1 Area stealing interpolation (Ch. Gold 1991)


figure 2
This method often called also as 'natural neighbor interpolation' determines the weights using the Voronoy tesselation.
On the figure 2 we constructed the Voronoy cells for the sample points P1 ...P7 with measured heights h1 ... h7 (the P1 is outside the figure).We want to interpolate the hQ height of the new point Q.
To interpolate to a point we have to redefine the tesselation with the addition of the interpolation point. Now we have two tesselations the original one and the new cell system constructed after inserting the new point. The cell of the new point (filled with patterns on the figure 2) covers some parts of cells originally owned by particular sample points. These particular points will be involved into the interpolation of the new point. That is this method automatically selects from the sample points those which should take part in the interpolation. These points are called as natural neighbors.
The weights of the natural neighbors are nothing else but the areas which the new cell cuts out from the original cell owned by a particular neighbor. These areas are the 'stolen areas', denoted in the figure 2 with ti. After computing the average we have got that hQ = 135.8.

2.2.3 Weighting with inverse distances

This simple method uses for weights some power of the inverse distance:


where p=1,2 or 3.
If we select p=2, and use the same subset of sample points which was determined by the stealing area method, then we get hQ = 129.5. The results are rather different! and the less accurate is undoubtedly the later one.
Moreover the main disadvantage of this method is in the arbitrary definition of the interpolation subset since the method itself does not generate the points to be involved in the interpolation.

2.2.4 The kriging

As more sophisticated method some version of the 'kriging' can be used (for more details about the kriging see some of the textbooks of geostatistics e.g. (E.H. Isaaks, R. M. Srivastava 1989)). The kriging estimates the unknown values with minimum variances if the measured data fulfill some conditions of stationarity (first order stationarity, second order stationarity, intrinsic stationarity). In the cases when no stationarity hypotheses can be stated the universal kriging can be used. This method however has not a unique solution and needs a large amount of interactive input and subjective assessment during the processing.
For the topic of our discussion a relatively new branch of kriging the cokriging has meaningful importance. This method uses one or more secondary features which are usually spatially correlated with the primary feature (e.g. heights secondary, rain primary). If the secondary features have more dense sample sets then the less intensively captured primary feature, with cokriging it can be estimated with higher accuracy without any surplus expenditure.
The first step in the kriging is the computation of a so called experimental semi-variogram using the following formula:


where is the estimated semi-variance for the distance h, n(h) is the number of measured point pairs in the distance class h, F(.) is a measured value in (.).
The equation (3) is relatively easily computable if the measured points are ordered in a regular grid and the field has isotropy, that is depends only on h, but not on its direction. If the known points are located not regularly distance classes have to be formed, and in the lack of isotropy different semi-variograms should be constructed in the typical groups of directions.
In the next step the experimental semi-variogram(s) should be approximated by some kind of functions fitted to the experimental data by means of the least squares method.
The real computation of weights used in (1) to interpolate the Z(x0) unknown function value is performed by the solution of a system of (n+1) linear equations in the form as follows:
(4),
where , and the so called matrix Krige
(5).
The vectors C0 are different for each new point to be interpolated, the coefficients ci,j have to be computed from the interpolated (theoretical) semi-variogram. The theoretical semi-variogram achieves its maximum value (or its 95%) at a distance H, called the range of the phenomenon. The coefficients belonging to points spaced on the range or farther become zero, the rest of them is calculated by the formula:
,
where h is the distance between the points xi and xj.
As mentioned above, if the stationary hypothesis does not hold or with less sophisticated words if the phenomenon (e.g. the thickness of a coal stra tum) has systematic changes in function of the location then the systematic changes should be separated from the random ones. For this sake one can fit a plane or a surface of higher order to the sample data. This surface, of course, can not pass through the points it lies only near the points as a new reference frame. Now, the value of an interpolated point will consist of two parts: the systematic part, that one computes from the equation of the fitted surface (often called trend), and the random part computed by the kriging, related to the new reference frame. This complex method is called as universal kriging.

The co-kriging uses beside semivariograms for each variable also cross-variograms for the variable couples. Let us consider the simple case of two variables. We have n sample points of the primary variable and m sample points of the single secondary variable (for simple writing we suppose m=n+1). For this case the equation (4) and matrix (5) get the form as follows:

(4a),

where (the notation si0 stands for the weights of the secondary phenomenon, are the Lagrange multipliers), , and
(5a).

The coefficients in (5a) have to be computed as follows: ; , the same is valid also for the coefficients in capital letters (Q0j, C0j) related to the interpolated point; where is the regular representation of the empirical cross-variogram:
. (6)

In this brief review we cannot discuss more details. However we should underline that the formulas above are related to the theoretical case of absolute stationarity. In the real cases the global shape of the phenomena is first modelled with some kind of functions usually polynomials. In this practical case these functions are also included into the equation (4a) instead of the rows and columns with numbers one, which are ensuring that the sum of weights related to a particular feature equals the unity. For example if the primary feature's trend surface is the plain f(x)=A+Bx+Cy, and the secondary feature is approximated with an other plain g(x)=D+Ex+Fy, then the formulas (4a) and (5a) should be overwritten in the following way:
, (4b)

where the right hand side vector , and the matrix Krige takes the form
(5b).

The short summary of the kriging methods shows that this approach has several drawbacks. First of all I should mention the ambiguities by handling the trends and the unisotropies. That is the estimates can be biased due to the inappropriate choise of the trend expression and the regular model for variogram approximation.
An other problem is that for creation of empirical cross variogram only those spots are used where both the oversampled secondary variable and the undersampled primary variable have measured values. For taking into consideration the entire information content of the oversampled variable in the cross-variogram construction several research projects are on the way.
The model is a local one, but even so the processing takes a significant run time. Especially the inversion of the huge matrices K can cause numerical troubles and memory overflows. A new research of (Long and Myers 1997) aims to split the co-kriging matrix into simple kriging matrices of the involved phenomena and express the co-kriging weights by the inverses of these smaller matrices (and some other terms). For the visualization (mapping) purposes a second method, usually spline interpolation has to be used.
The interpolated points can be stored in regular grid structure that is convenient for data base manipulations and also for different computational pourposes.

2.3 Methods using basis functions

There are several methods that reconstruct the function using a linear combination of a set of basis functions with the closest fit to the samples. Depending on the type of the basis functions we can distinguish among others polynomial interpolation, splines with tension using radial basis functions, Furier and wavelet transformations (Sárközy F. Závoti J. 1995). These later transform the approximated function from the time or space domain to the frequency domain, that is the interpolation has to be made there. The numerical methods of transformation and inverse transformation however work for regularly located samples and in this form are unusable for interpolation of scattered points data. Only a few new research projects aim to solve the wavelet transformation of not regularly distributed data. We have to wait for some years until the wavelet interpolation could be involved in the solution of spatial interpolation problems.

2.3.1 Trend Surface Analysis

The GIS deals with spatial objects or spatial phenomena which are at least 2 dimensional, but the natural features are often modeled in 3 or 4 dimension.
The TSA aims the global interpolation of the phenomenon in question, it has no interest in detecting local irregularities. Of coarse the global interpolation can be made also on different resolution level. The choice of this depends on the task to be performed and the behavior of the sample itself.
As we have seen in context of the universal kriging the trend surface can be approximated by first order, second order, third order, in general n-th order polynomials. By choosing the order of the surface one should take into the consideration the rule of thumb: any cross-section of a surface of order n can have at most n-1 alternating maxima and minima.
If applied to the terrain one can see in advance the main characteristics of the surface and choose the proper order. However the surface of other phenomena (e.g. temperature) is not visible and one should make several attempts to find the appropriate order.
The polynomials are fitted to the sample data using the least squares adjustment (an alternative name of the fitting procedure is the regression).
This method minimizes the squares of differences of the given and interpolated values, that is denoting the i-th sample value by zi, the interpolated value in the same point by f(xi), the number of samples by n, the adjustment fulfills the condition: . Of course the value of the minimum depends on the suitability of the chosen model for the particular data set. As a measure of the goodness of fitting we can use the quantity called mean square error, abbreviated as MSE.
The flexibility of the polynomial depends on its order. If we choose higher orders we can model more complicated surfaces. However, the higher the polynomial order the more numerical problems might occur by risk of emerging huge, ill-conditioned matrices which should be inverted. As a conclusion for most of the practical tasks the order does not exceed 3-5.
The method can be used also for three dimensional modeling, however in this case it was called as polynomial regression (Collins and Bolstad 1996). The authors used for the estimation of the temperature t the expression t=b+b1x+b2y+b3x2+b4y2+b5xy+b6x3+b7y3+b8x2y+b9xy2+b10z.
In their opinion the difference between the TSA and the polynomial regression consists of the number of independent variables (the TSA as a surface can have only two) and the completeness of polynomial expansion (the TSA has complete expansion, the polynomial regression can be set up using the linear combination of independent functions). We can see, that in the formula of the temperature the variable z is applied only in linear form.
The polynomial trend surface has the unpleasant behavior to increase or decrease very rapidly in the places where the density of samples is rare especially on the boarders of the region.

2.3.2 Regularized smoothing spline with tension

The first wide-spread commercial 3D modeling program package (Smith and Paradis 1989) used the three dimensional extension of the two dimensional spline interpolation worked out by (Briggs 1973). The original method defines a set of values at the points of a regular grid without finding first a continuous function of the space variables that takes the values of measurements at given positions. To cope with the problems of fitting two dimensional piecewise polynomials to randomly located observation points Briggs solved the differential equation of a thin sheet that is equivalent to the third order spline. The solution was made numerically by help of difference equations thus it gives the interpolated function values in a regular grid. The endproduct of the procedure was the drawing of contour lines; the cuts of the lines with the grid were interpolated by a four point cubic polynomial and then the cuts were joined by a cubic spline.
(Mitašova and Mitaš 1993); (Mitašova et al. 1993) gave an explicit solution of variational conditions with direct estimation of first and second order derivatives for two, three and four dimensional cases.
The regularized smoothing spline with tension is a radial basis function method for interpolation from scattered data. The interpolation is flexible through the choice of the tension parameter which controls the properties of the interpolation function and smoothing parameter which enables to filter out the noise. The function derived by minimization of the squared smoothness functional containing all derivatives has the following form:
(7),
where the index j denotes the measured points, the unknown coefficients, R(x,x[j]) the basis functions. For the related cases T(x) = a1 = constant. The basis functions for the two, three and four dimensional cases are as follows:
,
where CE=0.577215... is the Euler constant, E1(.) is an exponential integral function, , is an arbitrarily selected parameter, called 'tension' controlling the flexibility of the system, is the error function.
The N+1 unknown parameters are obtainable from the following system of N+1 linear equations, where z[i] is the measured value in the point x[i];
(8).
In connection with this direct method several questions require further analysis. Firstly the practical application of general global methods is problematical because of the long processing time growing proportional with the third power of the measured values. In (Mitašova and Mitaš 1993) the authors recommend a segmented processing procedure that approximates the global method with several overlapping local ones.
From point of view of the GIS functions we should not forget that this global method does work in 3 dimension, but its results are doubtful for most of natural phenomena, nevertheless its visualization possibilities are excellent.

2.3.3 Method of local polynomials

In 1992 the author of the present paper together with a young associate of the department of surveying Mr. P. Gáspár published a simple interpolation method (Sárközy and Gáspár 1962). We can summarise the essence of our method as follows.
For each point ri in which the value ui of the in general unknown function f(r) is given we can construct a Gi(r) approximating function that is properly near to the original function in the neighborhood of the point. The function should be defined everywhere over the global model and give good approximation at least for the neighbouring points. The simplest local function is the polynomial, in 3 D with three independent variables, or in the case of the given two dimensional example a third order polynomial with two independent variables: zk=zi+a1uk+a2vk+a3uk2+a4vk2+a5vkuk+a6uk2vk+a7vk2uk+a8uk3+a9vk3.
The co-ordinate transforms are as follows: uk=arctg(c(yk-yi)), vk=arctg(c(xk-xi)). For each polynomial Gi(r) the coefficients aj are calculated from the given values of the primary and (in case of need) secondary neighbours using weights considering the distance and the level of neighbourhood as well.
The accuracy of approximation depends on the distances from the central point of the function. The discrepancies of the function values and the known data are computed using the expression . We can order the discrepancies into distance intervals and compute the squared dispersions belonging to each interval by the formula , where K is the number of discrepancies in the particular interval.
In general for isotropic fields where is the dispersion of the function values in the nodal points. The parameters a, b, c can be computed from the related pairs d,. We can handle also the anisotropy and the special behaviour of each local model.
From the dispersions we compute the weights , and as the weighted average (or linear combination of basis functions) the interpolated value of the function:
(9).
For the sake of easier comparison the 500 m.*500 m. simulated study area was covered by the regular function , where . First of all we computed the true values of the 20 m.*20 m. grid points, as shown on the part a of the figure 3. After that, we randomly generated the co-ordinates of 200 spots and computed their heights (part b figure 3).
part a part b
figure 3
Using the heights of the scattered points we interpolated the grid by the methods of inverse distances, kriging and polynomials (figure 4 parts a, b and c respectively).
part a
part b
part c
figure 4

2.4 Method of Artificial Neural Networks

The method of ANN is relatively new at all, but absolutely new in the domain of spatial interpolation. This is the reason we have to explain some basic ideas of this approach before discussing its application for our purposes.

2.4.1 What is an Artificial Neural Network

The research of the human neural system showed that it can response to the signals from the sensing organs by a highly interconnected large system of relatively simple activation elements - the neurons. That is if we have a large number of interconnected simple processing elements we can solve very complex tasks.
This idea stimulated initially a small part of researchers in artificial intelligence to try solve their tasks by artificial neural networks.
The theory and basic algorithms of ANN were worked out mostly in the last decade, software products appeared since 1992-93, the first applications for spatial interpolation emerged in 1997.
There are different types of ANN architecture, however for interpolation tasks it is satisfactory to discuss the default structure: the multilayer feedforward network.

figure 5
On the figure 5 we sketched out a simple MLP (MultyLayer Perceptron) network (for the clearity we plotted only 1 hidden layer, however the multilayer model allows arbitrary number of those). As we can see, the structure consists of nodes (neurons or perceptrons) organized in layers and links. There are three principally different types of layers: the input layer (only one), the hidden layer (arbitrary number), and the output layer (only one).
The number of nodes in the input layer corresponds to the number of independent variables of the particular task, while the number of nodes in the output layer depends on the number of scalar functions involved in the common evaluation, or if we have to approximate a vector function, the output nodes can represent the components of the vector.
As a default, each node is connected to each node of the forward next layer. There are algorithms which can prune the ineffective links and there are network architectures with not only sequential but also backward (recurrent) links.
The nodes of hidden layers and output layers owns activation functions which can differ from each other by nodes and/or by layer types. These functions are mostly nonlinear, easily differentiable functions as sygmoid or hyperbolic tangent or the Gaussian bell-curve.
The formulae for these functions are as follows:
For the Sygmoid . The function values are in the domain [0, 1], if s=0, y=0.5. The hyperbolic tangent is determined by the formula . The interval of output values is in the range [-1,+1], if s=0, y=0. Nowadays gains more and more popularity the Gaussian activation function with the formula: . It has output values among 0 and1, if s=0, y=1.
The diagrams for these functions for D=1 and D=1.5 are shown on the figure 6. Pay attention, that the 'useful' inputs (for which different inputs involve different outputs), ranges about [(-2.7)-(+2.7)] and [(-4)-(+4)], depending on D.

figure 6

3.4.2 How does an Artificial Neural Network work

The objective of a particular ANN is to give desired outputs in response to the inputs of the user. For example if we want to interpolate the heights we can choose for input locations and hope that the outputs of the network will contain the heights in the given spots. But how can the network find out the heights. This is maid by learning from known input - output couples of data.
What can change in the network according to the information extracted from the known data, called training set. The answer is the weights, denoted on the figure 5 by wij.
The procedure adjusting the weights to fit the given data is called the training or learning. To understand this process first we should cast a glance on the way the network processes the input data. For the network on the figure 5 the t-th sentence (record) of training data consists of the input vector xtj, (j=1,2) and the desired output yt. The input passing forward (from left to right) results on the output:
or
(10).
In (10) the fildenotes the activation function of the i-th node (from top to bottom) in the l-th layer (l=0,1,2).
The output value o21t in general is not equal to the desired value yt. To approach to this equality we should perform a backwards pass to adjust the network's weights. From this processing step has obtained this training method the name backpropagation.
First we have to determine an expression characterizing the error on the output, called error function. This function has to be minimized in the training process. Usually the expression
(11)
is chosen for error function, where k is the number of output nodes (in our simple case k=1).
One of the ways of minimizing a nonlinear objective function is the application of the so called gradient descent method. The negative gradient is a vector that shows the local direction of the descent. If we pass along these directions with small steps we can hope to reach the global minimum. Passing along the directions practically means, that we change the weights (independent variables) proportionally the components of the negative gradient. The step size is called learning rate in the backpropagation algorithms. Thus
(12)
and
(13).
Do not forget that in (13) both the error and the weights have the values computed in the t-th cycle. The derivatives of a complex function have to be computed applying the chain law. For the derivative of E with respect of weights wij associated with the links between the i-th node of the last hidden layer to the j-th node of the output layer (in or example with respect of w111,w121,w131 ) we have
(14),
where the input of the j-th node of the output layer . The derivatives in (14) are as follows:
,
,
.
Substituting the derivatives into (14) we get
(14a) ,
thus the weights between the last hidden layer and the output layer should be altered with the value
(15),
The term in equation (15) is the so called local error, and the expression itself is often cited as delta rule.
The terms in (15) are known from the training set (yj), and from the forward pass of the t-th cycle. For example for the network on figure 5 oj is computed from the entire formula (10), pj is the term in square brackets in the same formula and .

The delta rule in form (15) is suitable only for the weights of connections ending in the output nodes because the term is computed from the actual output oj. To find out the changes of weights associated with links pointing on hidden layer we should 'backpropagate' the error.
Let us suppose we have n hidden layers. For computing the changes of weights wij associated with connections between the (n-1), n layers (that is i is a node number in layer (n-1), j in layer n) we need to express the local error in the node j of the n-th layer. For this sake we can use the already computed local errors of the output layer's nodes. Indeed, the error backpropagation is proportional with the local errors of the successor layer's nodes. Using index k for these nodes we can express the local error of nodes in a hidden layer

(16).
Thus, the weight updates also in this case are computed by formula (15) but the local errors have to be considered as expressed in formula (16).

The most critical point for backpropagation is the choice of the learning rate . It can be chosen from the range [0.05, 0.5]. If the rate is too small the learning process is slow, if it is too large the training oscillates around a local or global minimum. There are algorithms which can automatically control the learning rate in dependence of the slope of the error surface.
The backpropagation can be realized in two ways: the on-line backpropagation changes the weights after processing of each sample, the batch backpropagation performs the weight improvement only after the entire training set is processed. This latter works slower.
For improving the convergence of the learning process numerous modifications of the standard backpropagation are in use. For elimination of flat spots on the error surface the backpropagation with momentum gives some hope. The idea of this method is that the weights are improved with a linear combination of the recent and former weight changes:

(17) ,
where the value of the momentum lies in the interval [0,1].

The method of weight decay tries to prevent the weights to grow too large. The weight updates for this method are computed as follows:

(18),
the coefficient ranges from 0 until 1.

The resilient backpropagation is essentially not a gradient descent method but somehow or other similar to it. The method uses batch learning, that is the gradient is computed as the sum of the gradients belonging to each sample of the training set.
The method uses only the signs of the gradient and orders contrary to those signs to the updates. The absolute value of the update factor depends on the relation of the actual and former gradient. If their signs are equal this value is 1.2, if not the value is 0.5. The absolute value of the update itself equals to the product of the update factor and the absolute value of the former update. For the exception when the product of the new and old gradient is equal to nil, the new update has the same absolute value as the old one. The explanation above can be compressed in the following formulae:

(19),
(20).

A very special variation of the MLP networks is the radial basis function (RBF) network. This network type has a basically fixed architecture shown in figure 7.


figure 7
The network consists of three layers: the input layer, the only hidden layer and the output layer. In the case of the figure 7 the input vector has three components that is we have three input nodes, one summing up output node and n training patterns.
The activation function in the hidden layer is some type of radial basis function, usually the Gaussian one, expressed by the formula
(21),
where the vector c represents the center, the scalar r the radius of the function. If the input patterns x are one dimensional then the center c is also a one dimensional shift and the Gaussian takes the form:
(22).
The output can work in two ways: without activation function only summing up the input, or with a sygmoid type activation function with bias that is with an unknown shift b added to the input of the output neuron.
When the output operates without activation as on the figure 7, the network can work in the following way. First we create as many neurons in the hidden layer as many training patterns are present. The centers could be set equal to the corresponding input scalars (or vectors), the radii depend on the distances of the neighboring input patterns, for normalized input data one can select 1. In this case the number of unknown weights is equal to the training patterns. Because of the lack of nonlinear changes in the process of solution the weights can be computed directly solving a system of linear equations.
In the case of activation function with bias on the output we can not use the linear approach, however the solution even in this case is much more simple as in the case of general type MLP feedforward networks because the delta rule can be used here in its simple form (we deal only with weight changes linked to the output neuron(s)), expressed in formula (15). The number of hidden nodes in this second case should be less then the number of training patterns, in consequence the determination of the centers is not straightforward any more, but the details of this case are out of our discussion.

The training itself is not the goal of the network but the tool to prepare it for the interpolation. This statement should to be emphasized because of the fact that several textbooks and software products as well are inclined to forget the basic task of the network: to estimate the values of the particular phenomenon in unmeasured spots.
That is after the training is performed successfully we should apply as input the independent variables of the unknown points and get on the output the corresponding interpolated function values. This procedure in contrast to the training is direct and very fast.

At the and of this section a very important practical remark should be added. The activation functions in dependence of D are sensitive to the input values only in range at about [-1.5,+1.5], therefore the input data should be scaled to this interval. With exception of the linear activation function all others can give only outputs which has less absolute value then 1, therefore the target values should be also scaled corresponding the output value range of the output activation function [-1,+1] or [0,+1].

3.4.3 Some remarks on the architecture of Artificial Neural Networks

The main question, how many layers with how many nodes should be included in a particular network cannot be answered directly yet. There are strategies, algorithms, rules of thumb related to this question however for theoretically consistent answer we have to wait for a while.
In fact, the more layers and nodes are included in the network the flexible it is in training. From other hand the larger the network is the more processing time is required for one training cycle and the more probable is its overtraining.
The compromise can be found by one of the following strategies: beginning the training with a small two hidden layer net with say 5-5 nodes in each. If the training is to slow (the MSE does not decrease in the proper way), then we add gradually new nodes (until a reasonable limit say 15) and new layers until the training does not improve essentially. The algorithm called cascade correlation performs automatically the network expansion.
The other strategy starts with an extended network and eliminates gradually the superfluous nodes. A large number of pruning algorithms can cope with this task automatically.

3.4.4 Artificial Neural Networks in GIS applications

Only a few applications are present in the GIS papers so far.
Xingong Li (Xingong 1997) reports about the neural network used for precipitation estimation. His MLP feedforward network had three layers, the input layer with 9 nodes, the only hidden layer with 10 nodes and the output layer with one node.
The most interesting point of this case study is the choice of the nine input parameters. These are the precipitation, the heights of the three closest weather stations and the distances from those to the interpolating point.
I wonder whether the relative co-ordinates and the heights as input data could not give a more general solution.
The training set consisted of 50 weather station data for four months, the validation set of 18 similar station records. The interpolation was made on the validation set points to be able checking the results. The interpolation by neural network was compared to the truth value and to the results of other interpolating methods ( Voronoy cells, TSA, inverse distance weighted and ordinary kriging) and it was found, that the neural network performs in all months consistently well the interpolation in contrast to the other methods.

A more theoretic paper of D. Pariente (Pariente 1994) deals with the use of neural networks connecting the object oriented spatial data model with the function field data model (Sárközy 1994). This idea can be developed further by associating the data of a function field with a properly composed and trained neural network.

The neural networks have very extended applications in classification issues. This kind of application can be used in processing of multispectral remote sensing scenes, pattern extraction from images and also for classification of land quality in GIS as is explained in (R. Muttiah at all 1996). Though the interpolation and classification are related fields (the classification is nothing else as interpolation between predefined patterns) we have no possibility in this paper dwell on the applications in this field in more details. We want only call the attention of the readers to the fact that the authors of the referred paper developed a neural network interface for the GRASS GIS to solve classification tasks. I suppose that without essential changes the interface can be also used for interpolation.

3. Discussion

The main questions for discussion are as follows:
  1. Where is the border between the interpolation procedures of data production and interpolation procedures as standard GIS functions.
  2. For both cases which particular method has the most benefit.
  3. What is common and what is different in classification and interpolation using artificial neural networks.
  4. Aspects of data model in connection with artificial neural networks.

I hope that my answer to the majority of these questions is clear from the paper. However to be sure in it I will give their summary in the conclusions. But let us see first some alternatives.

  1. There are three possibilities. In the first case we try to include in the GIS as standard function at least those interpolation methods which were discussed in the paper. In the second case we include in the GIS only some local interpolation methods. The third case ignores the interpolation procedures as GIS functions.
  2. Here we have several answers. There are people in favor of geostatistical methods for both data preparation and GIS. Others are adherent of geometrical methods based an Voronoy tesselation, first of all for GIS. The specialists with concern of computer resources often support the inverse distance weighting. Mathematicians prefer the RBF polynomials, etc. The motivation of every group may be accepted for particular circumstances, however to include a method into the GIS as a standard function the method and its implementation should fulfill a couple of conditions. Among others the method should be local, as precise as possible, the implementation should accept the GIS data model and posses easy to use control tools, it should run fast without exaggerated employment of computer resources
  3. The question can be approached from different directions. From the side of ANN the MLP feedforward networks are equally usable for interpolation and classification. Other types of ANN are exclusively or mainly suited to unsupervised (SOM) or supervised (LVQ) classification. The main difference between interpolation and classification MLPs is in the context of input and output. The interpolation network uses point coverages for learning and estimation the classification networks deal also with raster (pixel) data. From the side of functionality the classification is nothing else as interpolation of the membership in given clusters. With other words we define a staircase like function where a group of input variables' intervals corresponds to a stair in the function. An even more interesting GIS application of neural networks is the environmental (erosion, pollution, etc.) modelling by neural networks. This is essentially an approximation of a complex multivariate function with reduced number of input variables. The input and output variables are usually raster data.
  4. The data model was several times mentioned already in the previous paragraph. At this place we want only remind the reader of the fact, that a trained network is nothing else as the approximation of an arbitrary function valid for the region determined as the convex hull of training points.

3. Conclusions

5. References


Address of the author:

Dr. Ferenc Sárközy
Department of Surveying
Technical University Budapest
Műegyetem rkp. 3.
1111 Budapest, Hungary

phone: (36 1) 463 3212
fax: (36 1) 463 3209
E-mail: sarkozy@agt.bme.hu

Prepared for electronic publication on April 3-th , 1998
To be published in Periodica Polytechnica Civil Engineering.

1 Research partially supported by the Hungarian Science Foundation Grant No. T 016487