Spatial Cluster Ordering and Encoding for High-Dimensional Geographic Knowledge Discovery

 

Diansheng Guo

 

GeoVISTA Center and Department of Geography, Pennsylvania State University

 

(814) 865-9656 (office), (814) 863-7943 (fax), Email: dguo@psu.edu

 


ABSTRACT:

Clustering is one of the most important tasks for geographic knowledge discovery. However, existing clustering methods have two severe drawbacks for this purpose. First, spatial clustering methods have so far been mainly focused on searching for patterns within the spatial dimensions (e.g., location <x, y>), while more general-purpose high-dimensional (multivariate) clustering methods have very limited power in recognizing spatial patterns that involve neighbors. Secondly, existing clustering methods tend to be ‘closed’ and are not geared toward allowing the interaction needed to effectively support a human-led exploratory analysis. To successfully leverage domain knowledge, human users need to be able to control and steer the process interactively and iteratively. The paper introduces a novel computational approach, which is open and highly interactive, to integrate the spatial cluster structure (within the spatial dimensions) into the non-spatial attribute space (comprising many non-locational attributes, e.g., the population, area, household, etc. of cities) and then to use this combined space for discovering high-dimensional geographic patterns. Main contribution of the paper includes three parts. (1) An effective and efficient method is developed for hierarchical spatial clustering within the spatial dimensions. It is graph-based and avoids the single-link effect at different hierarchical levels. (2) A 1-D spatial cluster ordering is derived from the hierarchical spatial clustering result and preserves all hierarchical clusters and some other spatial proximity information. Such an ordering can then be an additional “common” attribute for any high-dimensional clustering or classification method to automatically search high-dimensional geographic patterns. (3) The clustering method is implemented in a fully open and interactive manner and supported by various visualization techniques. This opens up the “black box” of the clustering process for easy understanding, configuration, and interpretation.

 

Keyword: Geographic Knowledge Discovery, Data Mining, Hierarchical Clustering, Cluster Ordering, Visualization


 

1.      INTRODUCTION

The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns that potentially lurk in the data hard to find. Spatial data analysis capabilities currently available have not kept up with the need for deriving the full potential of these data (Openshaw 1991; Miller and Han 2001). It is critical to develop new techniques to efficiently and effectively assist in deriving information from these large and heterogeneous spatial data repositories. Towards this goal, spatial data mining and knowledge discovery has been gaining momentum (Miller and Han 2001).

 

Clustering is to organize a set of objects into groups (or clusters) such that objects in the same group are similar to each other and different from those in other groups (Gordon 1987; Gordon 1996). While clustering is one of the most important tasks in data mining and knowledge discovery literature (Fayyad, Piatetsky-Shapiro et al. 1996), spatial clustering has also long been used as an important process in geographic analysis. To identify clusters over geographical space, various approaches have been developed, based on statistics (Zhang and Murayama 2000), Delaunay triangulation (Kang, Kim et al. 1997; Estivill-Castro and Lee 2000; Zhang and Murayama 2000), a density-based notion (Ester, Kriegel et al. 1996; Xu, Ester et al. 1998), a grid-based division (Wang, Yang et al. 1997), random walks (Harel and Koren 2001), and even a brute-force exhaustive searching method (Openshaw, Charlton et al. 1987). However, on one hand, existing spatial clustering methods can only deal with a low-dimensional space (usually 3-D space: 2 spatial dimensions, e.g., location <x, y> of a city, and a non-spatial dimension, e.g., the population of the city). On the other hand, general-purpose high-dimensional clustering methods developed in the data mining and knowledge discovery literature mainly deal with non-spatial feature spaces and have very limited power in recognizing spatial patterns that involve neighbors.

 

Spatial dimensions (e.g. location <x, y>) cannot simply be treated as 2 additional attributes in high-dimensional clustering methods because of two important reasons. First, the combination of spatial dimensions, which are not independent from each other, bears unique and real-world meanings. Their unique and complex inter-relationship can cause some difficulties for clustering methods (Gahegan 2000). Second, spatial clustering methods often adopt real-world dissimilarity (or distance) measures, e.g., road distance, travel time, etc., and consider complex situations, e.g. geographic obstacles (Tung, Hou et al. 2001). Such unique clustering considerations are hard to integrate within high-dimensional clustering methods. Thus, existing spatial clustering methods are geared to the analysis of spatial dimensions, while general-purpose high-dimensional clustering methods are good at identifying multivariate clusters. For geographic knowledge discovery, it is important to be able to identify high-dimensional (multivariate) spatial clusters, which involves both the spatial dimensions and several (or many) non-spatial dimensions. To meet this need, it is critical to find a way to integrate the power of both spatial and high-dimensional clustering methods.

 

Moreover, existing clustering methods tend to be ‘closed’ and are not geared toward allowing the interaction needed to effectively support a human-led exploratory analysis. To achieve both efficiency and effectiveness for exploring very large spatial data sets, it is critical to develop a highly interactive analysis environment that integrates the best of both human and machine capabilities. Computational methods can search large volumes of data for specific types patterns very quickly with mechanical accuracy and without bias, but they have very limited ability in adapting to various data sets and interpreting complex patterns. In contrast, humans can visually pick out complex and unexpected patterns very quickly, attach meaning to patterns (judge and interpret patterns), and generate hypotheses for further analysis. A desirable knowledge discovery system should have automated computational methods integrated with interactive visualization techniques that leverage the human expert’s knowledge and inference capabilities, to support a better human-machine collaboration.

 

The paper presents a novel approach to integrate the power of both spatial and high-dimensional clustering methods for identifying high-dimensional spatial clusters. The approach compresses the hierarchical spatial cluster structure within a spatial data set into a 1-D cluster ordering/encoding. Then the ordering/encoding (as an additional “common” attribute) is combined with other non-spatial attributes to form a new feature space. This combined space can be processed by any general-purpose high-dimensional clustering method for discovering high-dimensional spatial clusters. The contribution of the paper can be summarized into three parts. (1) An efficient and effective hierarchical spatial clustering method is developed for identifying arbitrary-shaped 2-D spatial clusters. It is a graph-based approach and potentially can accommodate various real-world metrics (e.g., road distance, travel time, etc.). It avoids the single-link effect by singling out boundary points at various hierarchical levels for special treatment. (2) The above hierarchical clustering method then can generate a spatial cluster ordering/encoding that fully preserves and represents all hierarchical clusters. By transforming hierarchical spatial clusters into a linear ordering/encoding, the integration of spatial and non-spatial information is made simpler since the spatial cluster structure is reduced to a single axis (a “common” attribute) in the feature space. (3) The clustering method is implemented in a fully open and interactive manner with support of various visualization techniques. The user can interactively control parameters of the clustering methods and see the immediate result corresponding to the parameter change.

 

The rest of the paper is organized as follows. Section 2 gives a review on related research. Section 3 presents the interactive hierarchical spatial clustering, ordering and encoding. Section 4 introduces the importance and usage of the spatial cluster ordering and encoding for searching high-dimensional (multivariate) spatial patterns. Section 5 gives a brief conclusion.

 

2.      RELATED RESEARCH

This section includes a brief review on both general-purposed clustering methods and hierarchical spatial clustering methods.

 

2.1  General-purpose clustering methods

Clustering methods can be divided into two groups: partitioning and hierarchical approaches (figure 1). The partitioning approach aims to divide the data set into several clusters, which may not overlap each other but together cover the whole data space. A data item will be assigned to the “closest” cluster based on a type of proximity or dissimilarity measure. Hierarchical clustering approaches decompose the data set with a sequence of nested partitions, from fine to coarse resolution. Hierarchical clustering can be presented with dendrograms, which consist of layers of nodes, each representing a cluster (Jain and Dubes 1988).

 

Figure 1. Classification of clustering methods

 

Within each group, according to their definitions of a cluster, clustering methods may also be classified into three groups: distance-based, model-based (or distribution-based), and density-based methods (figure 1). Distance-based clustering methods need a distance or dissimilarity measurement, based on which they try to group those most similar objects into one cluster. K-means and CLARANS (Ng and Han 1994) are distance-based partitioning methods, while the single-link and graph-based methods (Gordon 1987; Jain and Dubes 1988; Gordon 1996; Duda, Hart et al. 2000) can perform distance-based hierarchical clustering. Model-based or distribution-based clustering methods assume the data of each cluster conforms to a specific statistical distribution (e.g. Gaussian distribution) and the whole dataset is a mixture of several distribution models (Duda, Hart et al. 2000). Maximum Likelihood Estimation (MLE) and Expectation-Maximization (EM) are two examples of distribution-based partitioning clustering methods (Bradley, Fayyad et al. 1998). Density-based approaches regard a cluster as a dense region (relative to sparse regions) of data objects (Jain and Dubes 1988). Density-based clustering can adopt two different strategies: grid-based or neighborhood-based. A grid-based approach divides the data space into a finite set of multidimensional grid cells, calculates the density of each grid cell, and then groups those neighboring dense cells into a cluster. Such methods include Grid-Clustering (Schikuta 1996), CLIQUE (Agrawal, Gehrke et al. 1998), OptiGrid (Hinneburg and Keim 1999), etc. The key idea of neighborhood-based approaches is that, given a radius (e), the neighborhood of an object has to contain at least a minimum number of objects (MinPts) to form a cluster around this object. Two representative methods are DBSCAN (Ester, Kriegel et al. 1996; Ankerst, Breunig et al. 1999) and OPTICS (Ankerst, Breunig et al. 1999). Among those density-based methods, only OPTICS can perform hierarchical clustering.

 

2.2  Hierarchical spatial clustering

Spatial data sets often contain hierarchical structures and different patterns may exist at different levels (scales) within the hierarchy. Here only hierarchical spatial clustering methods are reviewed, although there also exist many partitioning spatial clustering methods (some of which are mentioned in the introduction section). Two groups of methods have been developed to detect hierarchical clustering structures in spatial data. The first group consists of those traditional hierarchical clustering methods, e.g., the single-link and graph-based methods (Gordon 1987; Jain and Dubes 1988; Gordon 1996; Duda, Hart et al. 2000). For 2-D spatial clustering, Delaunay triangulation has been used by several researchers (Kang, Kim et al. 1997; Estivill-Castro and Lee 2000) to reduce the time complexity for the construction of a dissimilarity matrix. AMOEBA is a Delaunay-based hierarchical spatial clustering method (Estivill-Castro and Lee 2000), which automatically derives a criterion function F(p) as the threshold to cut off long edges and then recursively processes each sub-graph to construct a hierarchy of clusters. AMEOBA tried to avoid the single-link effect by detecting noise points and excluding them in any clusters at each hierarchical level. However, AMOEBA’s criterion function is hard to justify/customize for different application data sets and tasks. Such a subjectively defined function actually predetermines the result clusters and cannot allow human interaction and steering. AMOEBA only works for 2-D point data.

 

The second alternative for hierarchical spatial clustering is to use a density-based partitioning algorithm with different parameter settings. Extended from DBSCAN (Ester, Kriegel et al. 1996), OPTICS (Ankerst, Breunig et al. 1999) is a neighborhood-based hierarchical clustering method (see figure 1). Given a “generating distance” (e) and MinPts, OPTICS first identifies core objects (those have at least MinPts neighboring objects within distance e) and non-core objects. Core objects can be connected with any other core objects, while non-core objects can only be reached via core objects (no connection allowed between non-core objects). OPTICS is better than AMOEBA in that it develops a clustering ordering to fully support interactive exploration of the hierarchical cluster structure. Although OPTICS is a density-based method, after removing the connection between non-core objects, it works exactly like a single-link method. In other words, it can only avoid the single-link effect at high levels (depending on the “generating distance”), i.e., the single-link effect still exists at low hierarchical levels. OPTICS relies heavies on index structures to speed up k-nearest-neighbor query and to maintain an O(nlogn) complexity. It does not discriminate spatial dimensions and non-spatial dimensions in the clustering procedure. Both AMOEBA and OPTICS cannot identify high-dimensional (multivariate) spatial patterns effectively.

 

3.      HIERARCHICAL SPATIAL CLUSTERING AND ORDERING

An efficient method, achieving O(nlogn) complexity without using any index structure, is developed to perform hierarchical spatial clustering and support a human-led interactive exploration of hierarchical clusters. The method has advantages of both AMEOBA and OPTICS. Moreover, it can generate an spatial cluster ordering/encoding to preserve and represent all hierarchical clusters. The method is based on the Delaunay triangulation (DT) and Minimum Spanning Tree (MST) and overcomes the single-link effect at different hierarchical levels via singling out boundary points for special treatment. To simplify the description of the method, we first introduce the method without considering boundary points. Then a method is introduced for singling out boundary points and treating them differently.

 

3.1  Description of the Method

The input consists of a set of 2-D points V = {v1, v2, …, vn}, where vi = <x, y> is a location in the geographic space. Our clustering method takes 4 steps: 1) construct a DT, 2) construct an MST from the DT, 3) derive an optimal clustering ordering and encoding; 4) visualize and interactively explore the hierarchical structure; 5)

 

1)      Construct DT

A Delaunay triangulation is constructed from the input points, using the Guibas-Stolfi algorithm, which adopts a divide-and-conquer strategy and is of O(nlogn) complexity (Guibas and Stolfi 1985). The triangulation result (figure 2 and 4) is stored in memory with efficient access for: each point, each edge, end points of an edge and incident edges on a point. Each edge has a length, which is the dissimilarity between its two end points.

 

2) Construct MST

A spanning tree is a tree constructed from a graph (G) that has the same set of vertices of G. Many spanning trees can be derived from the same graph. A minimum spanning tree (MST) of a graph G is the spanning tree of G whose sum of edge lengths is minimal among all spanning trees of G. Kruskal's algorithm (Baase and Gelder 2000), which is also of O(nlogn) complexity, is used to construct an MST from the DT. Basically an MST is a subset of those edges in a DT. At the beginning of the construction phase, the MST contains no edges and each point itself is a connected graph (altogether n graphs). The algorithm first sorts all edges in the DT in an increasing order. Following this order (from the shortest one), each edge is selected in turn. If an edge connects two points in two different connected graphs, the algorithm adds the edge to the MST. If an edge connects two points in the same component (i.e., forms a cycle in the graph), that edge will be discarded. When all the points are in a single graph, all selected edges form an MST of the DT (figure 2).

 

Figure 2. Construct a MST from a DT: edges are selected in the order of AB, BC, BE, CD, JK, HJ, HI, HG, JL, EF, DL. Numbers indicate the length of each edge.

 

3) Derive a cluster ordering and encoding

During the construction of an MST, an ordering of all points can be derived to completely preserve the hierarchical cluster structure and other spatial proximity information. A cluster (connected graph) can be viewed as a chain of points (Vandev and G.Tsvetanova 1995). At the lowest level, each cluster (or chain) contains a single point. Each chain has two end points (at the very beginning they are the same point). When an edge E is added to the MST, E connects two clusters, say, cluster C1 and C2. Let P11 and P12 are the two end points of C1, and P21 and P22 are the two end points of C2. Thus there are 4 choices to connect C1 and C2: P11 with P21, P11 with P22, P12 with P21, or P12 with P22. The closest two ends (let it be P11 and P21) will be connected and hence P12 and P22 are the end points of the new cluster. Edge E will be placed between P11 and P21. Please note: P11 and P21 are not necessarily the end points of E. The ordering of those points in figure 2 is shown in figure 3. All hierarchical clusters (points underscored by a line) are preserved (i.e., contiguous) in the 1-D ordering. Moreover, the ordering also preserves other spatial proximity information other than hierarchical clusters. For example, when D is merged to cluster {E, C, B, A} with edge DC, it can be placed next to A or next to E in the ordering—either choice will equally preserve the cluster {D, E, C, B, A}. It is placed next to E rather than A in the ordering because DE (dissimilarity between point D and E) is smaller than DA. Thus the proximity among D, E, and C is also preserved although they do not form a hierarchical cluster at any level.

 

Figure 3. Derive an ordering of the points, which preserves all hierarchical clusters (points underscored by a line) and other spatial proximity information.  Red lines are edges in the MST. The two points on each side of an edge here are not necessarily the end points of that edge.

 

Such an ordering of points and the accordingly ordered edges together constitute a complete representation of the hierarchical clustering structure within the point data set. We can further compress the ordering and the length of edges together into a 1-D encoding: each point is assigned a numeric value and the numeric difference between two neighboring points represents the edge between the two points in the ordering (figure 3). The encoding of those points in figure 2 and 3 can be derived at follows. First, an arbitrary number (here 0 is used) is assigned to the first point in the ordering (here it is point F), i.e., F = 0. Then D = F + EF (the edge between point F and D in the ordering) = 0 + 50 = 50. Similarly, E = D + DC = 60, C = E + EB = 68, etc. The final encoding is: 0, 50, 60, 68, 75, 81, 136, 161, 176, 192, 211, 231. Clearly, the hierarchical cluster structure can be fully recovered with this encoding. The absolute value of each number does not bear any meaning but altogether they represent the hierarchical clustering structure within the point data set.

 

4) Visualize the cluster ordering/encoding and interactively explore the hierarchy

Now we consider a larger point data set, which is shown in figure 4. With the above 3 steps, a final cluster ordering/encoding is derived. To facilitate human interaction and exploration of the clustering result, two novel visualization techniques are developed: the ordering viewer and the trend plot (figure 5).

 

Figure 4. A point data set with hierarchical clusters.  Yellow edges represent the DT and blue edges represent the MST derived from the DT.

 

Figure 5. The spatial cluster ordering/encoding (upper half) and the trend plot (lower half). (MinClusSize = 3)

 

The initial idea for visualizing the ordering/encoding has already been sketched in figure 3. Figure 5 shows a more detailed implementation of the idea. The horizontal axis represents the ordering of points (it is labeled instances because this visualization tool can also be used for non-spatial data set and ordering). Here there are altogether 74 points. The vertical axis represents the length of each edge, which is the numeric difference between two neighboring points in the ordering/encoding. Between two neighboring point in the ordering, there is an edge—thus there are altogether 73 edges. With this visualization technique, a cluster appears as a valley in the graph. Distinct clusters are separated by long edges (high ridges). The second horizontal line (other than the bottom axis) is the threshold value for cutting off long edges. By interactively dragging this threshold line (bar), one can interactively explore clusters at different hierarchical level. Clusters are colored differently and visualized in the 2-D map (figure 4) using the same colors.

 

A trend plot (bottom of figure 5) is developed to visualize the relationship between a distance threshold and the total number of clusters in the data set. In a trend plot, the horizontal axis represents values of possible threshold length. The vertical axis indicates the number of clusters given a threshold edge length. The threshold can be interactively set via dragging a vertical bar. This vertical bar is linked with the horizontal bar in the cluster ordering (top of figure 5): when you drag one, the other will move accordingly. To derive the trend plot, one needs to specify the minimal number of points (MinClusSize) that a cluster should have. Here MinClusSize = 3. With the cluster ordering and the trend plot, one can clearly understand the overall hierarchical cluster structure of the data and interactively explore with ease.

 

3.2  Tackling the Single-Link Effect

Without further improvement, a MST-based clustering method can suffer from the single-link effect (also called chaining effect), which is caused by some linearly connected points that run through a sparse area. In figure 7(a), point a, b, c, d, e, f, and g potentially can cause single-link effects at different hierarchical levels. For example, C21 will first merge with C13 rather than C22 through the connection between point a, b, and c. As reviewed in section 2, AMOEBA and OPTICS both tried to avoid the single-link effect but the former cannot support a flexible hierarchical clustering while the latter can only avoid the single-link effect at high levels (because its strategy is based on a generating distance) and rely heavily on an index structure. We develop a measurement—k-Deviation-to-Minimum-Length (k-DML)—to detect boundary points, which locate either on the boundary of a cluster (at any hierarchical level) or on a line in a sparse area. The difference between this measurement and the one used in OPTICS is that k-DML is based on variance of edged incident to a point. Thus k-DML can identify boundary points at various levels. By treating these boundary points differently, the single-link effect at all levels can be avoided. K means that only the shortest k edges incident to a point is considered in the measurement. Since in a Delaunay triangulation a point averagely has about 5-7 edges, here we use all edges incident to a point to derive the DML value. For a point p, its DML value is calculated with the following equation.

N is the number of edges incident to point p in the DT, Le is the length of an edge incident to p, and min is the length of the shortest edge incident to p. A high DML value indicates that the point locates on a boundary—some neighbors are very close while other neighbors are far away. We also develop an interface for the user to interactively change the threshold DML value and visualize those boundary points on a map (figure 6). The identification and special treatment of boundary points will make a difference in the construction of a MST and hence a difference in the ordering and encoding.

 

Figure 6: interactively determine the DML threshold for identifying boundary points.

 

(a)(b)

Figure 7: (a) A common MST, no consideration of boundary points. (b) MST considering boundary points  (cyan points). The single-link effect is avoided at many levels. Core points (purple points) are directly connected to other core points in the MST, while boundary points can be connected to either boundary points or core points.

 

We now name those non-boundary points as core points. In the improved MST, core points can only be connected through core points, i.e., on the path from one core point to another core point in the MST, no boundary point is allowed. Boundary points can connect to other boundary points or core points (figure 7b). The procedure for constructing an MST with special treatment of boundary points is sketched below. As an object in the programming language, each point has a variable named core_neighbor, which is the closest reachable (might through other boundary points) core point for a boundary point in the MST. The construction time of a common MST is O(nlogn). Additional steps in the following procedure that might be time consuming include step 15, 20, and 24. The size of the OrderedNewEdges is much smaller than the size of the data set (n). With binary insertion, adding a new edge to the OrderedNewEdges takes O(logn) time (step 15 and 20). If a connected graph has at least one core point, then all the boundary points in the graph should not have null core_neighbors. Thus step 24 is traversing a connected graph consisting only boundary points. Such a graph should be very small, as can be seen in figure 7. So, step 24 takes nearly a constant time. Thus the new MST construction procedure remains an O(longn) time complexity.

 

1                    MST_new (SetOfPoints, OrderedEdges)

2                        InteractivelyLabelBoundaryPoints (SetOfPoints);

3                        Construct OrderedNewEdges with no edge in it;

4                        WHILE OrderedEdges or OrderedNewEdges has edge(s) not processed yet

5                               Edge E = the shorter one in {OrderedEdges.next(), OrderedNewEdge.next()};

6                               IF both of E’s end points are core points

7                                       Consider E in the MST;

8                               ELSE IF one end point P of E is a boundary point

9                                   IF P.core_neighbor is null

10                                    Consider E in the MST;

11                                    IF E is added to MST

12                                        Set P.core_neighbor = the other end point of E;

13                               ELSE IF P.core_neighbor not null AND P.core_neighbor <> the other end point of E

14                                   Construct a new edge NE = < P.core_neighbor, the other end point of E>;

15                                  Insert NE into OrderedNewEdges;

16                           ELSE IF both of E’s end points (P1 and P2) are boundary points

17                               IF P1.core_neighbor not null AND P2.core_neighbor not null

18                                   IF P1.core_neighbor <> P2.core_neighbor

19                                       Construct a new edge NE with the two core_neighbors;

20                                      Insert NE into OrderedNewEdges;

21                               ELSE IF P1(or P2).core_neighbor is null

22                                   Consider E in the MST;

23                                   IF E is added to MST

24                                      Find all boundary points with null core_neighbors connected to P1(or P2) in current MST;

25                                       Set their core_neighbors equal to P2 (or P1).core_neighbor;

26                                       Set P1 (or P2).core_neighbor = P2 (or P1).core_neighbor;

27                   END; //MST_new

 

3.3  An Example with US Cities

To further demonstrate the above procedure for hierarchical spatial clustering, ordering and encoding, a point data set consisting 2850 USA cities (here we only consider their locations) is used to apply the above method. First, a Delaunay triangulation is constructed and boundary points are identified interactively (see figure 8).

Figure 8: Delaunay Triangulation (DT) with 2850 USA cities and interactive selection of boundary points based on DML values. Red points are boundary points and green points are core points. Dragging the DML threshold bar (current equal to 0.4) can dynamically change the threshold and see the according boundary points on the map.

 

Then an MST (yellow edges in figure 9) is constructed from the DT with special treatment of boundary points that are selected in the above step. It can be clearly seen that core points are directly connected to core points in the MST, i.e., on path from one core point to another core point, no boundary point is allowed. Thus the single-link effect is avoided at many levels.

Figure 9: the MST constructed from the DT with special care of boundary points.

 

Then a spatial cluster ordering/encoding is derived from the MST and visualized with the cluster ordering graph (middle in figure 10 and 11) and the trend plot (bottom in figure 10 and 11). With the visualization interface, one can interactively explore and interpret the spatial clusters at different hierarchical levels (figure 10 and 11).

Figure 10. Current distance (edge length) threshold is 2.709 and 8 major clusters (size>=10) are identified.

 

Figure 11. Current distance (edge length) threshold is 0.752 and 29 major clusters (size>=10) are identified.

 

4.      OREDERING/ENCODING FOR HIGH-DIMENSIONAL GEOGRAPHIC KNOWLEDGE DISCOVERY

As introduced in the previous section, the hierarchical cluster structure within a point data set can be well represented with the cluster ordering and encoding. Spatial clusters at various geographic scales are preserved. By transforming the complex hierarchical spatial clusters into a 1-D encoding, the integration of spatial and non-spatial information is made simpler since the spatial cluster structure is reduced to a single axis in feature space. A general-purpose high-dimensional clustering method can then process this combined feature (attribute) space (consisting the spatial cluster encoding and many other non-locational attributes) to automatically search multivariate spatial clusters.

 

An interactive hierarchical subspace clustering method is developed to search multivariate (high-dimensional) clusters within a very large and high-dimensional data set. By integrating the spatial clustering ordering/encoding with the hierarchical subspace clustering method, a working demo with a census data set (consisting various race population attributes) of 2850 USA cities is developed to demonstrate the usefulness of the spatial cluster encoding for multivariate spatial clustering and geographic knowledge discovery. But this part of research is not the focus of this paper. Related material can be found at http://www.geovista.psu.edu/members/dguo/subspace_clustering.html.

 

5.      CONCLUSION

The paper presents a novel approach for integrating the spatial clustering structure within the non-spatial attribute (or feature) space. The contribution can be summarized into three parts. (1) An efficient and effective hierarchical spatial clustering method is developed for identifying arbitrary-shaped 2-D clusters, which avoids the single-link effect. (2) The above hierarchical clustering method can generate a spatial cluster ordering and encoding that fully preserves all hierarchical clusters. By transforming hierarchical spatial clusters into a linear ordering and encoding, the integration of spatial and non-spatial information is made simpler since the spatial cluster structure is reduced to a single axis (a “common” attribute) in the feature space. With the ordering and encoding, a general-purpose high-dimensional clustering method can effectively identify multivariate spatial clusters. (3) The clustering method is implemented in a fully open and interactive manner with support of various visualization techniques. The user can interactively control parameters of the clustering methods and see the immediate result corresponding to the parameter change.

 

 

ACKNOWLEDGEMENT:

This paper is partly based upon work funded by NSF Digital Government grant (No. 9983445).

 

REFERENCE:

Agrawal, R., J. Gehrke, et al. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD international conference on Management of data, Seattle, WA USA.

Ankerst, M., M. M. Breunig, et al. (1999). OPTICS: Ordering Points To Identify the Clustering Structure. Proc. ACM SIGMOD’99 Int. Conf. on Management of Data, Philadelphia PA.

Baase, S. and A. V. Gelder (2000). Computer Algorithms, Addison-Wesley.

Bradley, P., U. Fayyad, et al. (1998). Scaling clustering algorithms to large databases. KDD-98.

Duda, R. O., P. E. Hart, et al. (2000). Pattern classification and scene analysis. New York, Wiley.

Ester, M., H.-P. Kriegel, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, AAAI Press.

Estivill-Castro, V. and I. Lee (2000). Amoeba: Hierarchical Clustering Based On Spatial Proximity Using Delaunaty Diagram. 9th International Symposium on Spatial Data Handling, Beijing, China.

Fayyad, U., G. Piatetsky-Shapiro, et al. (1996). From data mining to knowledge discovery-An review. Advances in knowledge discovery. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusay. Cambridge, MA, AAAI Press/The MIT Press: 1-33.

Gahegan, M. (2000). “On the application of inductive machine learning tools to geographical analysis.” Geographical Analysis 32(2): 113-139.

Gordon, A. D. (1987). “A Review of Hierarchical Classification.” Journal of the Royal Statistical Society. Series A (General) 150(2): 119-137.

Gordon, A. D. (1996). Hierarchical Classification. Clustering and Classification. P. Arabie, L. J. Hubert and G. D. Soete. River Edge, NJ, World Scientific Publ.: 65-122.

Guibas, L. and J. Stolfi (1985). “Primitives for the Manipulation of General Subdivisions and the Computation of Voronoi Diagrams.” ACT TOG 4(2).

Harel, D. and Y. Koren (2001). Clustering spatial data using random walks. Proceedings of the seventh conference on Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California.

Hinneburg, A. and D. A. Keim (1999). Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. Proceedings of the 25th VLDB Conference, Edingburgh, Scotland.

Jain, A. K. and R. C. Dubes (1988). Algorithms for clustering data. Englewood Cliffs, NJ, Prentice Hall.

Kang, I.-S., T.-W. Kim, et al. (1997). A spatial data mining method by Delaunay triangulation. the 5th international workshop on Advances in geographic information systems, LasVegas, Nevada.

Miller, H. J. and J. Han (2001). Geographic Data Mining and Knowledge Discovery: an overview. Geographic Data Mining and Knowledge Discovery. H. J. Miller and J. Han. London and New York, Taylor & Francis: 3-32.

Ng, R. and J. Han (1994). Efficient and Effective Clustering Methods for Spatial Data Mining. Proc. 20th International Confer-ence on Very Large Databases, Santiago, Chile.

Openshaw, S. (1991). Developing appropriate spatial analysis methods for GIS. Geographical information systems. Vol. 1: principles. D. J. Maguire, Longman/Wiley: 389-402.

Openshaw, S., M. Charlton, et al. (1987). “A Mark 1 Geographical Analysis Machine for the automated analysis of point data sets.” International Journal of Geographical Information Science 1(4): 335-358.

Schikuta, E. (1996). Grid clustering: An efficient hierachical clustering method for very large data sets. 13th Conf. on Pattern Recognition, IEEE Computer Society Press.

Tung, A. K. H., J. Hou, et al. (2001). Spatial clustering in the presence of obstacles. The 17th International Conference on Data Engineering (ICDE'01).

Vandev, D. and Y. G.Tsvetanova (1995). Perfect chains and Single Linkage Clustering Algorithm. Statistical Data Analysis, Proceedings SDA-95.

Wang, W., J. Yang, et al. (1997). STING : A Statistical Information Grid Approach to Spatial Data Mining. 23rd Int. Conf on Very Large Data Bases, Athens, Greece, Morgan Kaufmann.

Xu, X., M. Ester, et al. (1998). “A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases.” IEEE.

Zhang, C. and Y. Murayama (2000). “Testing local spatial autocorrelation using k-order neighbors.” International Journal of Geographical Information Science 14(7): 681-692.