Detection of Leukemia Epidemiology in Iran using GIS and Statistical Analyses
Zohreh Mossoomy
Student of Geospatial Information Systems
Faculty of Geodesy & Geomatics
K.N.T University of Technology, Tehran, Iran.
Email: z_massoomy@yahoo.com
M. Saadi Mesgari
Assistant professor of Geospatial Information Systems
K.N.T University of Technology, Tehran, Iran.
Email: mesgari@kntu.ac.ir
ABSTRACT:
Public health managers often want to map the locations of the occurrences of a disease and to study its spatial distribution in the area. Proper visualization of the spatial pattern and distribution of the disease can guide such managers in finding the priorities for allocating budget, personnel, equipment, etc. Yet by merely visual interpretation of such maps we cannot specify where a disease occurs dramatically and more than normal. In other words, the identification of the alerting concentrations of the disease cannot be done by visual interpretation of the locations of the disease only. Intensive geo-statistical analysis is required to do this.
In this research, different methods for study and visualization of spatial distribution pattern of leukemia were compared and tested. Consequently, a suitable method for detection of spatial clusters of leukemia was selected. Finally, using selected method, the statistics of death caused by leukemia in 18 province of Iran was used to determine the clusters of the disease. The spatial unit of this statistics is the township.
The result of the research shows that in the central part of the country and in some areas of the north-west part important clusters of the leukemia can be found.
1. INTRODUCTION
Welfare conditions and disease distribution are closely related to the geography of the area and follow the spatial distribution of other related phenomena. This is the reason why GIS are used for the study and analysis of the diseases, their causes and the relations they have with other spatial occurrences.
One of the most important and popular applications of GIS in welfare management is to detect and visualize the spatial clusters of the occurrences of different diseases. This can be achieved using the combination of statistical and spatial analysis capabilities of GIS. Different methods of clustering used for different types of diseases apply suitable filters on the data of the occurrences of the disease, in order to distinguish the normal and unimportant clusters from the meaningful and warning clusters.
Methods and activities related to the detection of disease spatial patterns have been changing during the evolution of GIS and its application in epidemiologic studies. In 1970, for the first time, clustering methods for point data along with GIS were used for the study of children leukemia-cancer occurrences near the nuclear facilities of England. Much of the shortcomings of traditional methods were related to the limited data processing possibilities of the computers in the past. Such shortcomings are mostly overcome by the continuous improvements in both the storage and processing capabilities of today computers.
On the basis of the disease type and the methods used for the registration of the disease happenings, a variety of different methods, for clustering and visualization of diseases, have been developed and used. They can be categorized in two groups; those used for point-based data (coordinates of disease happenings) and those used for polygon-based data (disease happening counts in an area).
2. CLUSTER DETECTION ANALYSIS IN POLYGON DATA
2.1 Choropleth Maps
Cromley (2002) states that the most popular and easiest method for the analysis of the spatial clusters of a disease is to generate the so-called choropleth maps of the disease occurrences. Such maps show the Incidence and prevalence of a disease. The most important point that should be considered is the distribution and density of population in the study area. If the difference in density of population in different parts of the area is high, then the reliability of the disease cluster estimation in those areas are also different. Therefore, it is important to find a way to evaluate and judge about all polygons equally and consistently.
As presented in the choropleth map of Figure 1, the effect of population is handled by calculating the death numbers per 100000 populations. Although this method is easy and fast, but it does not provide an accurate and reliable criteria for judgment on the prevalence of the disease. The reason is that, in this method, no criterion for the distinction of the real and unreal (insignificant) clusters is provided.

Figure 1- The choropleth map of leukemia in 18 provinces of Iran
2.2 Probability Mapping
Probability mapping is a statistical method, in which, instead of real disease counts, a significance level is used. To decide on acceptance or rejection of a cluster, goodness of fit test can be used. The significance level is defined on the basis of probability value (Cromley, 2002). There are a variety of statistical methods for the calculation of probability. Amongst them, the most popular method is to model the disease using a statistical function. The selection of the statistical function is based on the statistical characteristics of the disease. For example, Poisson distribution function is the proper model for cancer disease. Because, as mentioned by Larson (1992), Poisson distribution function can be used when the disease occurrences are rare and when only values of 0 and 1 could be considered for the events, i.e. when the disease exists or does not exist.
The problem of defining the border between significant and insignificant clusters, in polygon-based data, can be solved using probability maps. However, as noted by Cromley (2002) this method is suffering from two limitations:
- Instead of showing the rate of disease distribution, these maps represent the probability of the disease happenings, which is obtained from statistical calculations Probability maps depend on the selected significance level. On the other hand, significance level also depends on the size of the sample set. The bigger the sample size, the more certain and reliable the conclusions are.
3. ANALYSIS METHODS FOR POINT DATA
Different methods are available for the detection of spatial clusters in point-based data, among which a few popular method will be discussed here. These methods can be classified in two groups:
- methods that represent the clusters in the whole study area
- methods of concentration test, that represent the cluster and concentration of the phenomena around a point source.
Methods of the first group are more suitable to be implemented using GIS analytical and visualization capabilities. Therefore, here, the main methods of concentration test around point sources are presented.
3.1 Spatial Clustering Using Geographical Analysis Machine
The most frequently used method for spatial clustering of point based data is the Geographical Analysis Machine (GAM). The first stage in GAM method is to overlay a grid of points on the study area. Around every point of the network, some circles with different radius sizes should be drawn. For every point, the minimum radius size is selected such that the neighboring points are covered by the circle (Figure 2).

Figure 2- Grid points and their related circles in GAM method
In the next stage, it should be calculated which one of the circles cover a meaningful amount of disease concentration. To do this, according to the type and prevalence rate of the disease, a proper statistical distribution function is fitted to the population. Then, statistical tests are used for judgment on the disease prevalence. In other words, for the area inside each circle, the real count of disease is compared with the expected count (predicted by statistical function on the basis of population density). Circles with high real counts, in comparison with predicted counts, are drawn on the map. This procedure is repeated for all grid points in the study area.
This method has a few limitations. The result is a collection of circles in the study area that can only be visually conceived and there is no statistical analysis to count for the concentration of the circles in different parts of the area. In addition, because of the overlap between the areas covered by circles, the statistical test for evaluation of the disease in any circle is not independent. Figure 3 shows the spatial clustering of leukemia cancer around the nuclear facilities of England, generated by this method.

Figure 3- Application of GAM method for clustering of leukemia cancer in England (Cromley, 2002)
4. DETECTION OF LEUKEMIA CLUSTERS IN IRAN
4.1 Data and Method Used for Clustering
As part of a disease control program, statistics of the death occurrences in 2001, along with data about age, gender and cause of death are collected for 18 provinces of the country. Because of the limitations of the program, the townships were selected as the spatial unit. It means that for the whole township the overall death occurrences for each combination of the classes of age, gender and causing diseases are registered. Therefore, the most important aspect of the analysis, which is related to the spatiality of the diseases, is limited by the selection of a large spatial unit.
Since the death reports are related to the whole township, the point-based clustering methods cannot be used. Among the polygon-based methods, probability mapping was selected because of its advantages as discussed in section 2-2.
4.2 Selection of Statistical Distribution Function
An important part of probability mapping is the selection of a proper distribution function, which should be based on the type of the disease and the data collected. The Poisson distribution function is the most proper choice. The first reason is that the function is discrete. Second, the probability in Poisson function for high values is low. Similarly, the occurrence of cancer disease is discrete and the happening of the disease is generally rare. That is, the probability of having large numbers of cancer diseases in a normal population is low.
After the selection of the distribution function, the appropriateness of the function should be evaluated statistically. This is carried out using ‘Kolmogorov-Smirnov’ test. This test statistically determines how well the real cancer data fits the Poisson function. The hypothesis of the test was that the real cancer data fits the Poisson function. This hypothesis was accepted with the probability of 93%.
4.3 Calculation and Visualization of Disease Clusters
In the next stage, the created statistical society of leukemia happenings with Poisson distribution function was compared with the data of each township. In other words, the probability of membership for each township in this society is calculated. Townships that are the member of the society with a significance level less than 0.05 (a = 0.05) are selected as townships with high significance level (here, means meaningfully high rates) of the disease.
In order to remove the effect of population, in the whole process, the data are normalized to the population.
In the next stage, using the visualization tools and concepts of GIS, the clusters and spatial distribution of cancer disease was represented. Figure 4 shows the result of this analysis. Townships with a = 0.05 are represented as having ‘high’ risk of leukemia. In order to provide more separation in the result, townships with 0.05 = a = 0.07 are also separated and classified as ‘moderate’ risk class. Other townships are considered as having ‘low’ risk of leukemia.

Figure 4- The result of probability mapping for leukemia in Iran
5. RESULTS, CONCLUSIONS AND RECOMMENDATIONS
It was resulted that the occurrences of leukemia disease has a statistical distribution that fits the Poisson function with the probability of 93%. As shown in Figure 4, leukemia clearly has a spatial pattern and three main clusters can be marked in the middle and north-west of the country. This provides the managers with the first level of information needed to allocate their efforts regarding the prevention and treatment of the disease.
Similar methods can be used for judgment on the equitable allocation of physicians, hospitals, equipment, etc. on the basis of population or disease distribution. In general, GIS can be used effectively for the storage and retrieval of welfare related spatial data that is the first requirement for both monitoring and management of public welfare. Moreover, there are a wide variety of spatial analysis, including monitoring, modeling, simulation and prediction, which are of grate use in public welfare and could not be carried out without GIS.
The data used in this research was originally collected for a different purpose. As a result the spatial unit was too big and was not appropriate for our objective. If spatial unit was smaller, then the results would be more realistic. Besides, the data include only the death reports which are related to the township where death are reported and not related to the living location of the patients. In other words, sometimes, the place of living is different from the place of treatment or death.
Yet there is another serious problem regarding the data used: The reports that are the basis for the generation of the data do not include the occurrences of diseases when they are cured. In general if the data is related to all occurrences of the disease, no mater cured or not, and the spatial unit is the population centers such as towns and villages represented as points, then more advanced spatial analysis capabilities of GIS could be used more effectively to calculate and visualize the clusters and spatial pattern of the disease.
References
- Larson, H. J., 1992, Introduction to Probability Theory and Statistical Inference, 2nd edition, (New york: John Wiley and Sons).
- Mood, A.M., Graybill, F.A., and Boes, D.C., 1974, Introduction to the Theory of Statistics, 3rd edition, (New york: Mc Graw-Hill Book Company).
- Burrough, P.A., 1986, Geographic Information Systems for Natural Resources Assessment, 2nd edition, (New York: Oxford University Press).
- Cromley, E.K., and McLafferty S.L., 2002, GIS and Public Health, 1st edition, (New York: The Guilford Press).
- Elliott P., Wakefield J.C., Best N.G., and Briggs D.J., 2000, Spatial Epidemiology Methods and Applications, 1st edition, (New York: Oxford University Press).
- Melnick, A.L., 2002, Introduction to Geographical Information Systems in Public Health, (Gaithersburg: Jones and Bartlett Publishers).