|
GISdevelopment.net --> Application --> Urban Planning
Automated interpretation of the spatial distribution of socio-economic conditions
![]() Dr Jack Massey Principal Research Fellow, GISCA The University of Adelaide, Australia jmassey@gisca.adelaide.edu.au
Abstract
Over the past decade GIS has achieved recognition, consolidation and, for the most part, commercial success. We – the practioners – can be proud that GIS will have a grand future. But what should we do next? I consider it important to identify our big theoretical problems and start to solve them. The big theoretical problem I am trying to solve is the automated interpretation of the spatial distribution of socio-economic conditions through the manipulation of population and housing census data. I am solving this problem by the quantitative analysis of what I call spatial cognition surfacesTM – perspective views. The results of this work will be most relevant to specific purpose problem solving by members of governmental agencies and private sector entities. Currently, most investigators intuit their decision. Then they find the data and make from these several choropleth maps. The data and the maps substantiate the intuition. At best a choropleth map is an inappropriate representation of the spatial distribution, and, at worst, it contributes manifestly to the already difficult task of interpretation. Contour maps are aesthetically unpleasing, but at least they go part way to characterizing the continuous nature of the spatial distribution of a socio-economic condition. Regardless, similar to choropleth maps, interpreting reality from a contour map is a long and tortuous job. A perspective view gives a good – even exciting – idea of the spatial distribution of a socio-economic condition. However, apparently, it is only a qualitative characterization. Regardless, what is it about perspective views that captures my attention and holds it? Perspective views simulate the way humans construct an answer to a question about the spatial distribution of a socio-economic condition. Surface shape is the key to understanding and, in particular, quantifying this simulation. I have started to investigate the distribution of the volume beneath a spatial cognition surfaceTM and an intersecting plane above and parallel to the x,y-plane which represents the study region. There appears to be an important relationship between the nature of the spatial distribution and the change in the area circumscribed by what are termed “volume contours”. Success with this research will lead naturally to solving the problem of the automated interpretation of the spatial distribution of a socio-economic condition. If this can be achieved then we are looking forward to a time when substantiation will be replaced by an approach based on a strong, quantitative bridge between data and decision. Caveat This paper must not be referred to or quoted from without the written permission of the author. The author reserves the right to edit this paper and publish it elsewhere. The author asserts his moral right to the intellectual property vested in this paper, and states that this intellectual property is owned in its entirety by Spatial Cognition Surfaces Pty Ltd. GIS Context Past and future There have been remarkable advances associated with the implementation of the aftermath of the quantitative revolution in geography. By the late 60s geographic information systems (GIS), then commonly referred to as exercises in “geographic data handling”, represented an important but still only one quantitative sub-discipline of geography. There were also determined efforts in studying applications of univariate statistics, distributions of point patterns, methods of modeling processes, and techniques of multivariate data analysis. However, as time passed many would-be quantifiers drifted back to their idiographic research, while others jumped first on the social justice and then on the post-modernism bandwagons in what has proved to be a futile attempt to adapt a nomothetic explanatory form. In particular, the post-modernists can be singled out as those responsible for bringing geography that was so strong at the start of the 20th Century into disrepute by the end. There were exceptions to this trend and these persons comprised the young and hard-core investigators of GIS. They persevered, and by the early 80s quantitative geography and GIS were virtually synonymous. GIS development accelerated in the early 80s and has been growing ever since. It has followed closely the development and commercial availability of computational technologies. By the end of the 70s raster and vector technologies - using television monitors as the primary presentation device - made high quality desktop mapping possible. However, it was still restricted to few investigators because of the expense of both monitors and the associated mini-computers. This changed in the early 80s with the introduction of the IBM PC. Within a short period computing in general and color raster graphics monitors in particular became affordable; the impact on GIS was dramatic. GIS development in the 80s could be described as an intellectual and commercial roller coaster ride, but the 90s has been characterized by consolidation and commercial success for those who got into GIS in the early days, maintained their faith in its potential, and worked very hard. If you want evidence of this assertion examine the history of ESRI. Those of us old enough to remember have much of which to be proud. The others can be proud of becoming part of a strong discipline that has a grand history. GIS was born in the irrefutable paradigm of geographic regionalisation or “areal differentiation” of the first half of the 20th Century and it matured during the second half by identifying appropriate applications for which appropriate technologies were developed and applied. Regardless of our age and our experience we can all be proud that GIS – as a result of our efforts – will have a grand future. It is to the shaping of that future that this paper is directed. So the question is: What should we do next? It can, in part, be answered in the negative by urging a slow down in new software feature development. We do not need an integer increment in a software version number every several months, but it would be nice to think that efforts were being made to find and correct bugs in existing general-purpose GIS software and improve its performance. Why not a moratorium for the next two years on anything but decimal point increments of version numbers? Additional functionality serving little purpose, referred to as “flabware”, may do more harm than good to the reputations of the established vendors of GIS software products. A positive response to the question is exemplified in the applications orientation of Map India 2002. There seems to be the explicit recognition, at least among Indian practitioners, that GIS is something to be done, done now, and done with what tools are available. There are paper sessions in this conference on applications in agriculture, applications in telecommunications, applications in banking and insurance, applications in highways and transportation, and the list goes on. Each is a frontier that represents important applied problems which when solved in the Indian setting with its multifaceted complexities and subtleties will provide wonderful models – “solution models” - to be adapted to problems elsewhere. We can not afford not to solve these problems; it is time that GIS practitioners realized the broad significance and fundamental importance of our discipline. Another positive response is to urge the GIS community – students, academics, consultants and vendors – to take stock of what GIS has achieved and, in particular, what it has not. As a result we must identify the big theoretical problems of GIS. Out of all the work around the world over, say, the last forty years can’t we distill the essence of these big theoretical problems? As a function of lack of time and/or money they remain as obstacles to continued strong growth. They also remain as obstacles to the acceptance of GIS as a legitimate academic field instead of just a sub-discipline of geography obsessed with the development and application of technology. I identified one of these problems five years ago and since then my research has as its goal the finding of a solution. But this is the effort of only one investigator working alone and without funding. So here is the challenge to the GIS community: Identify the big theoretical problems of GIS and start to solve them. To do otherwise may result in stagnation of the growth we have experienced for more than twenty years. ![]() Fig. 1: A choropleth map of Burnside Focus There are so many frontiers which GIS investigators are offered as representing important problems. Many have solutions with immediate and substantial financial reward. Substantial financial reward – but perhaps not immediate – will also be attracted by the solution of the other problems, even those with a large theoretical component. Financial reward is a strong motivator and another motivator is the satisfaction of solving a complex problem. It would be difficult to argue that the problems of developing and applying GIS are not complex. GIS means many different things to many different investigators. To me it means all aspects of the computational manipulation of population and housing census data. This will be the focus of this paper. And, given this focus I will discuss what I consider to be a major theoretical research frontier of GIS and I will describe what I am doing to solve the problem it represents. I hope that this paper will stimulate others to contribute to the solution of this problem. The frontier is the automated interpretation of the spatial distribution of socio-economic conditions through the manipulation of population and housing census data. I am solving the problem that this frontier represents by examining surfaces that characterise the spatial distribution of a socio-economic condition described by the census data. Problem Solving Environment General purpose vs specific purpose There are two ways of approaching an investigation of spatial distributions of socio-economic conditions through the manipulation of population and housing census data. First, an investigator may wish to find what is commonly referred to as the “general structure” in the data. This implies looking for broad patterns and more likely than not broad patterns of interrelationships among a relatively large number of variables. General purpose studies were popular in the 60s and 70s with the major tools being principal components analysis and cluster analysis. The search for general structure has experienced renewed interest with recent experiments in data mining and, in particular, the use of a variety of artificial neural network techniques. All this work can be justified on academic grounds but it is not too easy to justify it in the pragmatic problem solving environment of many governmental agencies and most private sector entities. In these environments specific-purpose investigations are most common. Therefore, the second approach typically involves the examination of individual census data variables each of which is considered relevant to the solution of a tightly defined problem. An “aged care provision” problem may be solved by examining the spatial distribution of persons 65 years old and over; “teaching English as a second language” – persons born in non-English speaking countries; “marketing first home loan products” – married couples with one dependant child; and, “selling income protection insurance” – self-employed persons. In other words given a tightly defined problem the investigator chooses one or several census variables to study in detail. And the investigator also chooses the geographical region, such as a metropolitan area, and the level of generalisation, such as the street block. The geographical choices, just as with the choices of census variables, are a reflection of a tightly defined problem. It is in this problem solving environment that the research discussed in this paper is considered most relevant. Analysis and mapping One might think that given a tightly defined problem the selection of the census variables and choices related to geography would constitute the most difficult aspects of problem solving. After all there is a dazzling array of software products to allow the investigator to effect all manner of data and statistical analyses; many of these products have mapping capabilities. As well there are many other products that offer mapping as their primary feature including a wide variety of ancillary data and statistical analysis capabilities. But having software and knowing how to effect high quality analyses and mapping is not the same. High quality analysis implies years of professional training at the postgraduate level and years of on-the-job experience. I argue that obtaining training in the use of mapping procedures is even more difficult. There are relatively few professional training programs and many of the existing programs are strongly oriented to particular proprietary mapping software products. The number of professionals with extensive on-the-job experience in mapping census data is very small. In summary, of the hundreds of thousands of users of population and housing census data few have professional training in the use of the basic tools of interpretation and even fewer have much on-the-job experience. So what do most of the users – the professional investigators - do? A cursory examination of a small number of reports purporting to interpret the spatial distribution of a socio-economic condition reveals a common treatment. A typical report consists of a brief description linking the spatial distribution to the purpose of the study. This prose is accompanied by several tables of census data and a few color choropleth maps. It is unlikely to find an explanation of the processes that account for the spatial distribution. Also, it is unlikely to find the results of anything but the most rudimentary of data and statistical analyses, and these are often of questionable relevance to the purpose of the study. Maps appear to be included for their cosmetic value barely supporting the arguments for the spatial distribution of the socio-economic condition being investigated. This is what I call “the substantiation approach”. The user intuits the solution and then finds tables of census data and then makes choropleth maps using these data in order to support, ie substantiate, the intuition. The reports are used for making decisions, so if the intuition is good then there is the possibility of a good decision. If the intuition is bad, well …? The bridge between data and decision is intuitive or qualitative and, therefore, weak. The research reported in this paper constitutes the beginning of an attempt to change this. It is an attempt to build a quantitative – an objective - method for the automated interpretation of the spatial distribution of socio-economic conditions through the manipulation of population and housing census data. I assert that this is a major theoretical research frontier of GIS. One motivation for solving the complex problem it represents is satisfaction, and the second is the substantial financial reward that the solution is likely to attract. Conventional Graphics Choropleth maps Figure 1. is a map showing the spatial distribution of Italian born as a percentage of the total population for the City of Burnside in the southeast of the greater Adelaide metropolitan area of South Australia, Australia. Burnside is a statistical local area (SLA using the terminology of the Australian Bureau of Statistics (ABS)) comprising 80 census collection districts (CCDs again using the terminology of the ABS). Blue indicates 0.0% to less than 1.1% (21 CCDs), green – 1.1% to less than 1.7% (21 CCDs), yellow – 1.7% to less than 2.4% (17 CCDs), and red – 2.4% to l8.3% (21 CCDs). Each of the CCDs contains several hundred households. The top of the page is north and a square surrounding the area is roughly 49 square kilometers so you have an idea of the scale. Isn’t this a nice map to look at? That was a facetious, rhetorical question. You see this type of map every day! I have been looking at computer generated choropleth maps since 1966 when I saw one of the first outputs of SYMAP V, a FORTRAN IV implementation for line-printer output created at the Northwestern Technological Institute and popularized by the Harvard Laboratory for Computer Graphics and Spatial Analysis. In my opinion the map is not just boring. At best, it is an inappropriate representation of the spatial distribution and, at worst, it contributes manifestly to the already difficult task of interpretation. Five years ago I identified what I did not like about choropleth maps and the following section is a brief exposition of my criticisms. But I don’t intend just to complain; if I did possibly I would be labeled a post-modernist and, in reality, I am a logical positivist. Instead, I have been trying to find a more effective way of representing the spatial distribution of a socio-economic condition. ![]() Fig. 2: Two CCDs of Burnside Criticisms of choropleth maps First, persons who are faint-hearted are advised to avoid the interpretation of choropleth maps. The path from map to reality is circuitous with many traps for the uninitiated. Consider one coloured patch (in the case of the map in figure 1 – one of eighty) and this patch represents one of two or more intervals of derived data. The derived data are usually expressed as percentages with the denominator as the total population, but this is by no means the only valid denominator – it all depends on the purpose of the investigation. The raw data used as the numerator are frequency counts, eg in CCD 4121302 there are 52 persons who were born in Italy. Note well, the raw data are summary data – not unit record data – so of the 52 persons born in Italy we are unlikely to discover how many have a high school diploma, nor is it likely that we will discover how many are members of families residing in a semidetached house. Therefore, the choropleth map may stimulate enthusiasm for executing interesting and valuable secondary investigations, but the data used to produce these maps and confidentiality constraints on the use of unit record data often preclude the undertaking. Each frequency count is for a CCD containing several hundred households, so to see one of the 52 Italian born on the ground within CCD 4121302 may be considered a fortuitous experience. But on the ground is reality and the choropleth map – the starting point for the interpretation of the spatial distribution of the socio-economic condition – is a long and tortuous way from that reality. Second, look at the solid colors, the straight lines, and the crisp boundaries. Visually, a choropleth map is a strong statement. It implies intellectual rigor and scientific integrity. Rather than a hypothesis generating starting point, it appears to be a conclusion – something arrived at after much experimentation, analysis and contemplation. But I know, and so do you, that a little change in one of the mapping specifications will dramatically alter the characterization of the spatial distribution of the socio-economic condition. I could specify equal interval instead of equal frequency classification; or, I could specify five classes instead of four. In other words the choropleth map in figure 1 is just one of an infinite set. It might be a reasonable characterization of the spatial distribution, but it might not! Third, the spatial distribution of the socio-economic condition, Italian born as a percentage of the total population, is not discontinuous. But look at two contiguous CCDs from figure 1 which have been enlarged and are presented in figure 2. Figure 2 implies that there is a sudden jump (a discontinuity) from relatively few Italian born living in the western (blue) CCD to relatively many Italian born living in the eastern (red) CCD. Quite frankly this is not true. We will never know the “exact” spatial distribution of Italian born as a percentage of the total population, but one thing we do know is that at all but the individual human level of generalization the distribution must be considered continuous. Therefore, it is characterized poorly by the colored patchwork quilt effect of the choropleth map. It is easy to write computer programs to generate choropleth maps. I suspect that this is the principal reason that choropleth mapping is the most common form of cartographic representation of census data. But if we accept the criticisms of choropleth mapping and we are inquisitive, then we are motivated to consider alternative types of graphics as the basis of interpretation. Two alternative types are contour maps and perspective views. Contour maps and perspective views Is the spatial distribution of a socio-economic condition characterized best by a 2-dimensional or by a 3-dimensional graphic? This is not an easy question to answer so, as argued, let us first agree that the spatial distribution of a socio-economic condition is continuous. Now let us agree that at least at the macro level the landform of the surface of the earth is continuous; and, how would I characterize the landform of the surface of the earth? I could use a contour map (a 2-dimensional graphic), and to provide an immediate impression of the mountains, valleys and plains I could use a perspective view (a 3-dimensional graphic in a 2-dimensional space – the surface of the paper). Can I not do the same with Italian born as a percentage of the total population? After all it is just another continuous distribution. The answer is yes. ![]() Fig. 3: A contour map of Burnside Figure 3. provides an example of a contour map. The black indicates virtually no Italian born and the red patch indicates a lot of Italian born relative to the total population. While I do not find contour maps aesthetically pleasing at least they go part of the way to characterising the continuous nature of the spatial distribution of a socio-economic condition. Regardless, similar to choropleth maps, interpreting reality from a contour map is a long and tortuous journey. There are many contour mapping options, and a change in any one will modify, often dramatically, the characterisation of the spatial distribution. Also, the contour map provides a strong visual statement suggesting the result of hypothesis testing, not what it really is – the starting point of an investigation, a vehicle for hypothesis generation. Finally, there are many methods of contour interpolation (or extrapolation) and smoothing, and these and their implications are considered briefly in the following discussion of perspective views. A perspective view looking from the southwest at Burnside and showing Italian born as a percentage of the total population is presented in figure 4. It gives a good – even exciting - idea of the spatial distribution of this socio-economic condition. There is a “mountain” of Italian born in the northeast corner and this mountain will be the subject of considerable analysis in my ongoing research. As with most things there is good news and bad news; let us deal with the bad news about perspective views first. Using the census data for Burnside you prepare to generate a perspective view by building a centroid set containing 80 subsets (one for each CCD) with each subset being an {x,y,z}-tuple. The x and y elements are the easting and northing respectively of the centroid for one of the CCDs and the associated z element is the percentage datum for that CCD. Notwithstanding the fact that the locations of the centroids are usually not determined with anything approaching the rigor of applied plane geometry, thus far the investigator has little to worry about. But that is about to change. The centroid set is used to build a matrix or “grid” with rows and columns. The intersection of a row and column is a cell containing a number which is an interpolated (or extrapolated) and smoothed datum. The grid data will be used to generate the perspective view so the nature of the view is dependent completely on the method used for interpolation (or extrapolation) and smoothing; the same statement applies to the generation of a contour map, as discussed. The selection of this method is anything but straightforward, since even after the method has been selected usually its application can be “tuned” by adjusting several parameters. Kriging is the method used to build the grid data that, in turn, were used to generate the perspective views presented in this paper. But other methods include minimum curvature, polynomial regression, and triangulation with linear interpolation. There are many others and the application of each results in a grid containing very different interpolated (or extrapolated) and smoothed data and, hence, will result in the generation of very different looking perspective views (and contour maps). The question is: Which one is the best perspective view? But wait, there is more. Given the grid data, the technology of generating a perspective view is very involved with a vast array of options. For example, view point and distance from the view are but two of the choices to be made by the investigator, and they are perhaps the easiest to make. More subtle is the application of colour, the nature of shading and the aspect ratio. ![]() Fig. 4: A perspective view of Burnside – Italian born The most unsettling aspect of perspective views is related to exactly what is so engaging about them in the first instance: they give a good – even exciting – idea of the spatial distribution of a socio-economic condition. But this, apparently, is where it stops. After looking at the view from several different points and distances and changing the colors and aspect ratio and adjusting a number of other parameters, you are drawn to the conclusion that the perspective view is only just a qualitative characterisation of the spatial distribution of the socio-economic condition. It cannot be analysed quantitatively. The good news–Spatial Cognition SurfacesTM. We could leave the topic of perspective views with the conclusion that they are nice to look at but of limited scientific and technological value since they are not quantifiable. However, I chose five years ago to examine the nature of perspective views beyond this seemingly reasonable conclusion. This was done in the hope that the experience might give me direction in my attempt to build a strong, quantitative bridge between data and decision and as part of this solve the problem of designing a method for the automated interpretation of the spatial distribution of a socio-economic condition. Therefore, instead of dwelling on aspects of what perspective views show me, eventually, I asked the intriguing question (the formulation of which took almost five years!): Why do I find perspective views so engaging? Or, what is it about perspective views that captures my attention and holds it? The answer is obvious, yet startling. Perspective views simulate the way humans think about spatial distributions; perspective views simulate the way humans construct answers to questions about spatial distributions. Because of this, the perspective views I have generated for and presented in this paper, and I intend to analyze quantitatively I call “spatial cognition surfacesTM”. They will help us know spatial distributions. ![]() Fig. 5: A perspective view of Burnside – United Kingdom born Let me elaborate with another example. I assume you know Adelaide well, and I ask you: Where is there a concentration of Italian born? Almost instantaneously you amass a large amount of factual and fictional elements of data and information about what you perceive is meant by the phrase, Italian born. Each element has a (probably fuzzy) locational reference. And it is the high correspondence or coincidence of these locational references that implies a volume of evidence for Italian born concentrated in an area of Adelaide. The volume could also be referred to as a “mountain” – a mental mountain – not dissimilar to the Italian born mountain in figure 4 generated from the census data for Italian born as a percentage of the total population. What is important about this mountain from the point of view of interpretation is not its size but its shape. It is the shape of the mental mountain that leads to the interpretation of the spatial distribution of Italian born as concentrated in the northeast portion of Burnside. See the mountain peak in the northeast corner of Burnside in figure 4. The mental mountain and the data mountain lead to the same conclusion. Now, compare figure 4 with figure 5, a perspective view of United Kingdom born as a percentage of the total population. Figure 5 intimates a plateau and this implies not a concentrated but a ubiquitous spatial distribution. And, if I ask you: Where is there a concentration of United Kingdom born? You might shrug your shoulders and say: “Sorry, that’s a tough one. They are pretty much everywhere.”. And you would be right. But the point is your interpretation and the data plateau of figure 5 lead to the same conclusion. Observations from and implications of the perspective views in figures 4 and 5 motivate the next phase of my research. This is the quantitative analysis of the shape of the surface characterising the spatial distribution of a socio-economic condition. ![]() Fig. 6: One-tenth of the volume of the Italian born spatial cognition surfaceTM Figure 6 is a perspective view from the northeast of Burnside showing only one-tenth of the volume of the spatial distribution of Italian born (ie the top of the data mountain). It indicates the very early stage (where I am at the time of writing) in thinking about the quantitative analysis of spatial cognition surfacesTM. I have computed the volume beneath the red data surface to the blue volume “determination” plane that is perpendicular to the z-axis at a given ‘z’ value. Then I projected the intersection of the surface and the plane onto the x,y-plane containing the base map of Burnside showing the CCD boundaries. The intersection is indicated by the red volume contour lines. There appears to be an important relationship between the nature of the spatial distribution and the change in the circumscribed area from one volume contour to the next. So for the concentrated Italian born spatial distribution the area circumscribed by the one-tenth volume contour is not much smaller than the area circumscribed by the one-fifth volume contour. But suddenly, there is a much larger area circumscribed by, for example, the one-half volume contour indicating that that volume plane is intersecting much more of the data surface than just the mountain. The z-value for the one-half volume plane is much smaller than the z-value for, say, the one-tenth volume plane; the one-half volume plane is much closer to the x,y-plane. Such a marked change is not noted in an examination of the surface for United Kingdom born. There are relatively small increases in the area circumscribed by the volume contours as one considers a relatively small volume first and gradually progresses to larger volumes (ie intersecting the data surface with progressively lower volume planes). These initial observations will take on greater or lesser significance as mathematical techniques are applied to the spatial cognition surfacesTM to effect quantitative analysis in a formal sense. But the point is – and this is crucial – spatial cognition surfacesTM are amenable to quantitative analysis and a start on this research has just begun. Success with this research will lead naturally to solving the problem of the automated interpretation of the spatial distribution of a socio-economic condition. If this can be achieved then we are looking forward to a time when the substantiation approach will be replaced by an approach based on a strong, quantitative bridge between data and decision. This will not preclude or even inhibit investigator creativity. On the contrary, it will allow the trained and experienced investigators to feel confident that their intuitions and the data are “saying” the same thing about the spatial distribution. It will allow the untrained and inexperienced investigators to feel confident that they have a reliable starting point. Regardless, the time that is saved by minimising qualitative aspects of interpretation can be allocated to the much more arduous and important task of describing relevant socio-economic processes, ie explaining, the spatial distributions. |
| © GISdevelopment.net. All rights reserved. |