Almir Karabegovic1 , Zikrija Avdagic2, Mirza Ponjavic1
1 Geodesy Department, Faculty of Civil Engineering Sarajevo, Patriotske lige 30, 71000 Sarajevo, Bosnia and Herzegovina
{almir, mirza}@gauss.ba
2 Computer Science Department, Faculty of Electrical Engineering Sarajevo, Zmaja od Bosne bb, Kampus Univerziteta,
71000 Sarajevo, Bosnia and Herzegovina
ab.asnu.ftenull@cigadvaz
Abstract. The aim of this work is to explore the method of fuzzy clusterization applied for classification of spatial objects or generic geospatial analysis and cluster analysis as a classification of objects by mutual similarities and organize data into groups. Clusterization techniques fall into unsupervised methods, meaning that they do not use predefined class identifiers. The biggest potential of clusterization is in recognizing the basic data structures, not only for classification and identification of samples, but also the reduction of models and optimization.
Keywords: fuzzy clustering, geospatial analysis.
1 Introduction
Spatial-temporal domain is complex and characterized by a large amount of data. For example, several land cover and land use maps could have terabytes of information that require computer-intensive techniques of analysis. Interesting signals and data are often hidden by stronger signals caused by local effects, such as seasonal variations. Merging different regions of the country also brings more complexity. These effects make it difficult to analyze such data. Disparity in collecting and sampling data sometimes requires indirect measurements and interpolation.
Application of fuzzy logic technique hasseveral advantages. Imprecision and uncertainty in this domain are present on several levels. Ambiguity in attributes occurs when belonging to a class is partial or unclear. This is a serious problem with the data obtained by remote detection, such as aerophoto often interpreted inconsistent. Spatial ambiguity occurs when the sampling resolution is not large enough to accurately identify the location of borders or when gradual transitions occur between classes. Clustering is a technique that helps in the analysis of such large data sets through the definition of regions that have similar characteristics. However, conventional clustering method is inappropriate when conflicting with ambiguities in data obtained by remote detection (ground and other data derived from 3D models). The data are often incomplete or have errors in measurement, and spatial and attribute ambiguities that are characteristic of these data are further difficulties in the analysis. Fuzzy clustering is appropriate for this data because of its ability to naturally incorporate real-world problems. The ability to create gradual boundaries allows enhanced interpretation.
Fuzzy clustering is an extension of classical technique cluster analysis and is used to solve many problems in the areas of pattern recognition and the identification of fuzzy model. Various methods of fuzzy clustering proposed and several of them based on distance criteria. Clustering algorithm fuzzy k-means is widely used to understand the pattern, especially with possible overlapping clusters.
Clustering is a process where the features are grouped in clusters. On the base of a given set of data points, each with a set of features, they are grouped in clusters so that data points in a cluster are similar to each other while other ones in separate clusters are different from each other. Clustering process is to allocate a vector objects in clusters based on similarity measures. Selection of cluster centers (or prototypes) is very important for the process of clustering. Measure of similarity should give priority to vector features that are closer to the center than the cluster of vectors that are further.
2 Experiments with real-world data
This study includes geospatial analysis examples from real world: land cover, geo-demographic and surface analysis. Experiments were made using three most popular algorithms: fuzzy c-means, Gustafson-Kessel and Gath-Geva. Fuzziness parameter was chose m=2. As a computational tool was chose MatLab. Spatial data presentation were done by MapInfo Pro with Vertical Mapper.
The first example, from land cover analysis, is classic example of image classification with high imprecision in pixel identification as member of only one class.Boundaries of resulting areas are mostly unclear and their recognition has very subjective nature. Whether it is about the manual recognizing classes (depending on the operator, its experience, the resolution) or semi-automatic methods (again, the operator gives the parameters on the basis of his/her experience), definition of crisp borders may lead to loss of information. By using fuzzy logic, each pixel can be assigned to different degrees of belonging to the class, which gets one intermediate result in the form of a grid. The grid, i.e. intelligent raster preserves all the relevant information and allows better analysis. The problem that arises here is about increasing amounts of data because the original satellite images are usually very large (high resolution means more pixels) and multi-spectral (more colors means more clusters, i.e. each pixel means more degrees of membership) so that these grids easily grow to several terabytes of disk space occupancy. This problem can be somewhat mitigated by using Image Server for placing and manipulating raster images. This study used GeoRaster functionality, Spatial option of Oracle Database Enterprise Edition Database.
In this way, it seems that the application of fuzzy logic increases the accuracy but also makes it difficult to analyze. Therefore, the work investigates the application of fuzzy clustering as method of addressing just these problems because it is an unattended iterative method, which is not required knowledge of data structure. Of course, the iterative process requires time for analysis, but gives good results. Such a process is not dynamic and does not require analysis in real time.
There were two land cover studies in Bosnia and Herzegovina, first one made from 1998-2000 and the second one from 2006-2008. The last one used for application of the fuzzy c-means algorithm (FCM) for fuzzy clustering (Fig.1.). Analysis of fuzzy clustering is much faster (almost in real time), comparing to classic land cover methodology aproach. This means that it is possible to make more variants and compare the results, which is very difficult with the classic methodology.
Second example of geodemographic analysis was observed in two ways, when fuzzification done only in attribute domain and others in both domains (attributive and geometrically). The first one is more common in practice, because the geodemographic data usually reduced to some administrative boundaries such as the municipalities in the example (Fig.2., left-hand side). But when it comes to the national structure, it is difficult to express precise boundaries between classes and it considers another aproach of fuzzification including geospatial domain. Thus, the example is observed as continual field for the entire territory of BiH. Fig.2. demonstrates results of fuzzy clustering analysis (by FCM algorithm) in such case. Obtained maps gives more realistic fuzzy clustering picture. Both cases of geodemographic analysis were first treated by geostatistic methods to get spatial correlation matrix by which clustering data is made.
3 Fuzzy Clustering Algorithms
The goal of cluster analysis is to divide given a set of data or objects in clusters (subsets, groups, classes). This distribution should take into account the homogeneity within the clusters (data belonging to the same cluster should be as similar as possible) and heterogenic among the clusters (data belonging to different clusters should be as different as possible).
The concept of similarity must be specified depending on the data. When spatial data are observed, the distance between the data can be used as a measure of similarity. Additional difficulty exists when it is needed to analyze spatial and attributes data (e.g. type of way: paved, gravel, etc.), simultaneously. In such circumstances cluster analysis required the application of specific fuzzy clustering algorithms.
The most common algorithms for fuzzy clustering are:
-
fuzzy c-means (FCM): spherical clusters of nearly the same size;
-
Gustafson-Kessel (GK): ellipsoid clusters almost the same size, there is a parallel version with one axis;
-
Gath-Geva (GG) / Gaussian mixture decomposition (GMD) ellipsoid clusters of variable size, also, there is a parallel version with one axis;
-
fuzzy c-varieties (FCV): detection of the final lines of the non-Euclidean (manifold) space;
-
adaptive fuzzy c-varieties (AFC) detection of line segments in the manifold area;
-
fuzzy c-shells (FCS) detection circuits (unclosed prototypes);
-
fuzzy c-spherical shells (FCSS) detection circuits;
-
fuzzy c-rings (FCR): detection of circles;
-
fuzzy c-quadric shells (FCQS): detection of ellipsoid;
-
fuzzy c-rectangular shells (FCRS): detection of rectangles and sub-variant
Three of them (FCM, GK and GG) especially adopted are tested and evaulated to compare their performances for cluster analysis in both domains (spatial and attributive) simultaneously.
3.1 MatLab implementation of Fuzzy C-means algorithm
Using the command line in MatLab Fuzzy Logic Toolbox, FCM algorithm can start guessing the initial cluster centers, which are intended to mark the location of each cluster center. Initial centers of these clusters are, of course, probably wrong. In addition, FCM assigns every point belonging to the appropriate level of the cluster. Through an iterative process, FCM algorithm changes the cluster centers and membership degrees until it finds the right location for centers of the data set. This iteration is based on the minimization of target function that represents the distance of each point to the cluster, where the weighting takes the degree of membership to the point.
Function of FCM provides a list of cluster centers and membership degrees of points. This information can be used for the creation of fuzzy decision systems by creating membership functions representing the fuzzy quality of each cluster.
To illustrate how it works clustering FCM, as an example can be take a layer of eastern exposure for the municipality of Tuzla (Fig. 3.). To load a set of data presented, it should run the following command:
load IstokTuzla.dat
plot (IstokTuzla (:, 3), IstokTuzla (:, 1), ‘s’)
that results to view in MatLab (Fig. 4.).
Now, call a function FCM to find two clusters in this data set such that the target function does not decline more.
[center, U, objFcn] = FCM (IstokTuzla, 2);
where the variable contains the center coordinates of the centers of clusters, containing the degrees of belonging and history objFcn contains the target function through iterations.
This command gives the following result:
Iteration count = 1, obj. fcn = 289861475402.036320
Iteration count = 2, obj. fcn = 236593203055.150300
Iteration count = 3, obj. fcn = 235941130181.474330
…..
Iteration count = 93, obj. fcn = 170281804005.298890
Iteration count = 94, obj. fcn = 170281804005.298890
FCM function is an iterative process that is based on the following routine:
initfcm – initializing problem
distfcm – performs distance calculation
stepfcm – executes one iteration clustering
To see the progress of the clusterization, we could show a chart of the target function (Fig. 5.) using the following command:
figure
plot (objFcn)
title ( ‘The values of target function’)
xlabel ( ‘Number of iterations’)
ylabel ( ‘The value of the target function’)
Based on previous, it is created graphic that shows the target function. It is obvious that it does not change significantly after the tenth iteration (Fig. 5.).
Plotting the obtained cluster centers of FCM algorithm is executed using the following codes:
Max = max (U);
% Find points with the highest degree of membership to cluster 1
index1 = find (U (1) == MAX);
% Find points with the highest degree of belonging to cluster 2
index2 = find (U (2) == MAX);
figure
line (IstokTuzla (index1, 3), IstokTuzla (index1, 1), ‘linestyle’,…
‘none’, ‘marker’, ‘o’, ‘color’, ‘g’);
line (IstokTuzla (index2, 3), IstokTuzla (index2, 1), ‘linestyle’,…
‘none’, ‘marker’, ‘x’, ‘color’, ‘r’);
hold on
plot (center (1.3), center (1.1), ‘ko’, ‘markersize’, 15, ‘LineWidth’, 2)
plot (center (2.3), center (2,1), ‘KX’, ‘markersize’, 15, ‘LineWidth’, 2)
FCM algorithm in MATLAB is possible to call in the following way:
load IstokTuzla.txt
data.X = IstokTuzla (:, [3 1]);
param.c = 14;
param.m = 2;
param.e = 1e-6;
param.ro = ones (1, param.c);
param.val = 1;
clust_normalize data = (data, ‘range’);
result = FCMclust (data, param);
plot (data.X (:, 1), data.X (:, 2), ‘b.’, result.cluster.v (:, 1), result.cluster.v (:, 2), ‘s’ );
hold on
new.X = data.X;
clusteval = eval (new, result, param);
validity result = (result, data, param);
result.validity
The result after FCM application is given in Fig. 6.
If apply the same algorithm on data that is not pure cluster, but more linear nature (e.g. roads), it doesn’t give good results. Fig. 7 shows the result of FCM clustering road infrastructure for the same area as the previous example. It is obvious that FCM clustering is less capable to find the cluster centers for linear objects.
3.1 Implementation of the Gustafson-Kessel algorithm
Gustafson and Kessel extended the standard fuzzy C-means algorithm using adaptive distance norm that it would be possible to detect clusters of different geometrical shapes in one data set.
It is used the following code (with the example parameters) to call GK algorithm in MatLab envoronment:
load IstokTuzla.txt
data.X = IstokTuzla (:, [3 1]);
param.c = 14;
param.m = 2;
param.e = 1e-6;
param.ro = ones (1, param.c);
param.val = 3;
clust_normalize data = (data, ‘range’);
result = GKclust (data, param);
plot (data.X (:, 1), data.X (:, 2), ‘b.’, result.cluster.v (:, 1), result.cluster.v (:, 2), ‘ro’ );
hold on
new.X = data.X;
clusteval = eval (new, result, param);
% validation
validity result = (result, data, param);
result.validity
Result is shown in Fig. 8.
3.2 Implementation of the Gath-Geva algorithm
Clustering fuzzy algorithm for the fuzzy maximum likelihood estimate (FMLE) was proposed by Bazdek and Dunn. Gath and Geva reported that with FMLE algorithm clustering is possible to recognize clusters of different shapes, sizes and density. Covariant matrix of the cluster is used in conjunction with the exponential distance and clusters are not limited capacity. However, this algorithm is less robust in the sense that it should be a good initialization, since the exponential distance norm converges near a local minimum. It is used the following code to call GG algorithm in MatLab using the example parameters:
load IstokTuzla.txt
data.X = IstokTuzla (:, [3 1]);
param.c = 14;
param.m = 2;
param.e = 1e-4;
normalas = 1;
param.val = 1;
clust_normalize data = (data, ‘range’);
result = FCMclust (data, param);
param.c = result.data.f;
result = GGclust (data, param);
plot (data.X (:, 1), data.X (:, 2), ‘b.’, result.cluster.v (:, 1), result.cluster.v (:, 2), ‘ro’ );
hold on
new.X = data.X;
clusteval = eval (new, result, param);
validity result = (result, data, param);
result.validity
Result is shown in Fig. 9.
3.3 Validation of clusters
Cluster validation refers to the problem assess whether given fuzzy partitions corresponding to all data. Algorithms of clustering always trying to find a fixed number of clusters that best matches the cluster forms. However, this does not mean it will have the most meaning. It may happen that the wrong number of clusters or cluster configuration does not correspond to groups of data if the data in general can be meaningfully grouped. The most common and simplest approach for determining the appropriate number of clusters is to start from maximum number of clusters then to gradually reduce the number of composing clusters that are similar (compatible) according to a pre-defined criterion.This approach is called merging compatible clusters.
Another approach is clustering data for different values of c (number of clusters)using a measure of validity to assess the quality of the obtained partitions.
Different scalar validity measures are proposed in the literature. Although, none of them is perfect, the next seven are most commonly used:
-
Sharing coefficient (PC – Partition Coefficient): as measure the size of the overlap between clusters (defined by the Bazdek). The lack of PC is the lack of connection with the nature of the data. Optimal number of clusters is the maximum number.
-
Classification entropy (CE – Classification Entropy): as measure of uncertainty of cluster partition, which is similar to PC.
-
Partition index (SC): as the ratio of the sum of cluster compactness and divisibility. This is the sum of individual cluster validity measures normalized through division by fuzzy cardinality of each cluster. SC is useful when to compare different sharing that have the same number of clusters. The smaller the value of SC indicates a better partition.
-
Separation index (S – Separation Index) in contrast to the SC, using the principle of minimum distance to validate sharing.
-
Xie and Beni index (XB) aims to quantify the relationship of the total variation within the clusters and cluster separation. Optimal number of clusters should minimize the value of the index.
-
Dunn’s index (DI) was originally proposed to be used to identify compact and well split the cluster. Clustering result is calculated as it is a classic clustering. The main disadvantage of Dunn’s index is a calculation, since it is very demanding.
-
Alternative Dunn’s index (ADI): the aim of modifying Dunn’s index is that the calculation becomes easier.
It should be noted that the only difference between the SC, S and XB is the approach to sharing clusters. In the case of overlapping clusters of DI and ADI values are not reliable because the re-partition method results in a classical clustering.
No index is not alone credible, but all should be considered and the optimal solution can be detected only by comparing all results. It can be assumed that the partition with fewer clusters is better, i.e. when the difference between the indexes of validity is minimal.
The main disadvantage of PC index is monotonically decreasing with the number of clusters c and the lack of direct connection to the data. CE has the same problems. The validity measures for example with eastern exposure are shown in Fig.10.
4 Conclusion
The fuzzy clustering methods allow classification of the data, when there is no a priori information about data set or content is not known. In particular, the fuzzy methods allow identifying data in more flexible manner, assigning to each datum degree of membership to all classes.
Here is proposed fuzzy clustering as a newer approach to modeling spatial objects. Generality of approach in application of fuzzy clustering algorithms is realized by selecting the test examples that represent classes of the typical geospatial problems. The obtained results indicate the great potential and satisfactory efficiency of the methodology implemented, and the possibilities of further use in solving this class of problems.
The essence of fuzzy clustering algorithms is to use an iterative process. Since the exact number of steps to achieved convergence is not known in advance, these algorithms are not always suitable for application in real time. This is especially characteristic for complex problems. However, for problems that do not have to be solved in real time, conventional methods have the advantage because of their robustness and low-storage in relation to more advanced techniques.
It is shown that performances of fuzzy clustering are significantly influenced by selection initial parameters and that is reason for its special considering in implementation phase. During testing each test is repeated enough times to exclude the influence of stochastic nature of the algorithm on the reliability of the results. Testing results, obtained by fuzzy cluster analysis, showed a significantly higher quality of clustering and more compact clusters than the classical clustering. This methodology provides almost automatic data classification within fuzzy expert system. The advantages of presented methodology are based on the fact that the rule base knowledge is now derived from data and has no longer to be predefined by the user.
References
1. Bezdek, J.C., Keller, J., Krisnapuram, R., Pal, N.R.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Springer (1999)
2. Zhang, J., Goodchild, M.F.: Uncertainty in geographical information, CRC Press (2002)
3. Batty, M., Longley, P.A.: Advanced Spatial Analysis: The CASA Book of GIS, ESRI Press (2003)
4. Höppner, F., Kruse, R., Klawonn, F., Runkler, T.: Fuzzy Cluster Analysis : Methods for Classification, Data Analysis and Image Recognition, John Wiley & Sons (1999)
5. Abonyi, J., Feil, B.: Cluster Analysis for Data Mining and System Identification, Springer (2007)
6. Sato-Ilic, M., Jain, L. C., Innovations in Fuzzy Clustering : Theory and Applications (Studies in Fuzziness and Soft Computing), Springer (2006)
7. Lazzerini, B., Fuzzy Sets & their Application to Clustering & Training, CRC (2000)