Data Mining Of Geospatial Database For Agriculture Related Application


C.Kiran Mai
Asso Professor
VNRVJ Institute of Engg. & Technology
Hyderabad, India


I V Murali Krishna
Asso Professor
JNT University,
Hyderabad, India


A.Venugopal Reddy
Asso Professor
Osmania University
Hyderabad, India.


ABSTRACT
Data mining, or Knowledge Discovery in Databases (KDD) is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data[1]. Knowledge discovery in databases (KDD) is the process of identifying a valid, potentially, useful and ultimately understandable structure in data. The study reported in this paper is concerned primarily with thematic information related to agriculture which has spatial attributes also. Spatial data mining methods can be applied to extract interesting and regular knowledge from large spatial databases. This study aims at discerning trends in agriculture production with reference to the availability of inputs. The Predicted and Real vs. Counter graph illustrates how closely the PolyAnalyst prediction follows the actual value of the attribute over the range of the dataset. Applying the data mining techniques to agriculture the target for different food grains is achieved. The techniques help to decision maker to meet the goal. The study demonstrates for the scope for application of spatial mining tools for a utility study and analysis. The specific application of Polyanalyst gave a clear scope for evaluation and comparison of predicted and real values.

INTRODUCTION
India is a predominantly agriculture based country with more than two thirds of its population living in rural areas where agriculture is the main occupation of people. A major task in agriculture production is water and fertilizer management. Excessive fertilizer application not only wastes limited financial resources of poor farmers but also pollutes the environment. Deficit application, however, may limit the growth of crop. Due to the large spatial variation of agriculture field environment (e.g. soil, climate, terrain, etc.), spatial data plays a very important role in identifying the issues critical for crop growth management. Agricultural operation is closely connected with natural resources that have an obvious spatial character which is considered essential character of Geographic Information Systems (GIS). Thus spatial data has an important function to play in agriculture production, especially in field irrigation and fertilizer application.

IMPORTANCE OF SPATIAL DATA MINING
Data mining, or Knowledge Discovery in Databases encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies. The study reported in this paper is concerned primarily with thematic information related to agriculture which has spatial attributes also.

Spatial data mining, i.e., mining knowledge from large amounts of spatial data, is a highly demanding field because huge amounts of spatial data have been collected in various applications, ranging from remote sensing, to geographical information systems (GIS), computer cartography, environmental assessment and planning, etc[2]. Recent studies on data mining have extended its scope from relational and transactional databases to spatial databases. Spatial data carries topological and/or distance information and it is often organized by spatial indexing structures and accessed by spatial access methods. These distinct features of a spatial database pose challenges and bring opportunities for mining information from spatial data. In this study the data base has been mined using the PolyAnalyst software package

POLYANALYST
PolyAnalyst[8] is a powerful multi-strategy data mining system that implements a broad variety of mutually complementing methods for the automated data analysis It works with data extracted from flat files or relational databases, and can work with numerical (floating-point), integer, yes/no (binary), date, and discrete (categorical or string) variables. Using this data and PolyAnalyst’s suite of analytical algorithms, relationships in the data are discovered, predictions made, and the data is classified and organized. The study reported in this paper essentially aims at rigorous analysis of the Geospatial data derived from thematic maps and integration with non-spatial data and apply tools of data mining for specific information related to agriculture. As the specific goal of any data mining this study aims at discerning trends in agriculture production with reference to the availability of inputs.

STUDY AREA:
Andhra Pradesh, India's fifth largest state with an approximate area of 277,000 sqkm and a 1000km coastline, on the Bay of Bengal. The altitude of the land varies from sea level along the coastal plains to 1500m in the Eastern Ghats. The mean annual rainfall ranges from 500mm to 1,000mm and the mean annual temperature from 12oC to 42oC. Four types of vegetation zones, Tropical Dry Deciduous, Tropical Moist Deciduous, Dry Tropical Thorn and Littoral and Dry Evergreen, are found in the State. Agriculture is the main occupation and 70 percent of population is engaged in agriculture and related activities. Rice is the major food crop and staple food of the state. The state has 23 percent area covered with forest. The details of land utilization are given in Table :1.

Table 1: Land utilization particulars of A.P from 1991 to 2001


REQUIREMENT ANALYSIS:
A requirement is a feature that includes a method of capturing or processing data, producing information, controlling any activity or supporting management. The data mining techniques help to know the description of data and predict the future trends or variations. These techniques as applied to the agriculture in this study can help the decision makers to suggest methods for the improvement of agriculture productivity or identify the factors responsible for its increase or decrease.

In general data mining technique can be divided into two broad categories viz. Discovery data mining and Predicting data mining. Both techniques are used in this study. The interpretation and deployment of the results provides a wealth of information that can be difficult to interpret. Therefore this step often requires assistance from a domain expert who can translate the mining results back into the application understanding. To assist in this process a range of tools are available which enable visualization of the results. These tools provide the necessary statistical information necessary for facilitating the interpretation. The progress can be achieved by improving agricultural techniques and improved seed/crop varieties and also convert the unutilized land to utilized one. The database schema is illustrated in the table 2.

Table 2: Attributes of database


Year wise : -Year for which the mining is being attempted say 2002-03.
District wise : -District name in the state of Andhra Pradesh
Unutilisedland :- Cultivated land in Hectares which is not utilized.
Supplyof_water :- the water supply to each district in percentage.
Demand of_water:- the demand of water in percentage for each district.
Supplyof_fertiliser:- the percentage of fertilizer supply to AP district

Demandof_fertiliser:- s the demand of fertilizer.
Supplyof_pesticides:- the percentage of pesticides supplied

Demandof_pesticides:- the demand for pesticides http://www.megaputer.com
Seed Quality: - The quality of seeds used, it is of Boolean type.
A set of rules are formed for evaluation of trends as given below :
water_demand:-
if( "Supplyof_water(in%)" >= "Demand_of_water(in%)" ,~1 ,~0 )
This rule act like attribute in world dataset to check whether supply of water is greater than the demand of water.


Figure 1: District Wise Demand of Water



Figure 2 : 2-D Chart by Supply of Water (%) Vs Demand of Water in 2000-2003


fertiliser_demand:-
if( "Supplyof_Fertiliser(in%)" >= "Demandof_Fertilisers(in%)" ,~1 ,~0 )

This rule act like attribute in world dataset to check whether supply of fertilizer is greater than the demand of fertilizer.

pesicides_demand:-
if (Supplyof_pesticides >= Demandof_pesticides, ~1, ~0)

This rule act like attribute in world dataset to check whether supply of pesticides is greater than the demand of pesticides.

all_conditions:-
if ( "Supplyof_water(in%)" >= "Demand_of_water(in%)" and
"Supplyof_Fertiliser (in %)" >= "Demandof_Fertilisers (in %)” and
Supplyof_pesticides >= Demandof_pesticides, ~1, ~0)

This rule act like attribute in world dataset to meet all the conditions satisfied.

REPORTS:-
Reports are the main asset of any analysis. These reports give valuable information regarding the goal. The Data mining Techniques generate the reports-
Summary Statistics
Linear Regression
Decision Tree
Find Dependencies
Nearest Neighbour

SUMMARY STATISTICS:
The Summary Statistics exploration engine provides basic statistics about the data, including means, standard deviations, and frequencies. In addition, the Summary Statistics report includes frequencies charts for each category, string, and yes/no variable. Typical examples are given in Figure: 3 and Figure: 4 which specific information of statistics pertaining to demand and supply of water for irrigation purpose in selected districts.


Figure 3: Frequencies Chart giving statistics report of Demand of Water



Figure 4: Frequencies Chart giving statistics report of Supply of Water


LINEAR REGRESSION:
The predict vs. real shows that the percentage in demand of water is not accurate with predicted value. The Figure : 5 indicates that a given value of demand of water should be near to central diagonal that means demand of water is not fulfilled according to the predicted value. The Pred and real vs. counter showing that predicted value vs. real value with respect to the record number. It gives that in record number 1 the predicted value should be 34.188 against the value 32.


Figure 5: Linear Regression graphs for Demand of Water


LR_Demand_of_water (in %)
The reports give a comparative study of predicted and real values based on available data and predicted values.

Predicted vs. Real:
The Predicted vs. Real graph shows all the data points in the actual dataset being explored along with where these data points would have been predicted to fall by the model produced. This allows to get a quick look at the accuracy and predictive power of the model. The real location of the data point is shown along the x-axis, while the predicted location is shown along the y-axis. If the predictive rule was perfect, all the points would lie along the central diagonal – the more accurate the predictive rule, the closer the points will be to the central diagonal. In addition, the shape of the distribution of points tells about the nature of the predictive model – for instance, if points are consistently below the central diagonal toward the edges of the chart but right on it toward the middle, this indicates that the model is predicting values too low at the extremes, but correct for average values. Perhaps this would indicate a need to try the same exploration in a different exploration engine – commonly, a curved shape in the Predicted vs. Real graph of a Linear Regression exploration indicates a need to try a Find Laws or PolyNet Predictor exploration instead.

Predicted and Real vs. Counter:
The Predicted and Real vs. Counter graph allows to see how closely the PolyAnalyst prediction follows the actual value of the attribute over the range of the dataset. The red line represents the PolyAnalyst prediction, while the blue line represents the real values. Record number is plotted along the x-axis. If the predictive rule is perfect, the red line would entirely cover the blue line over the course of the entire graph – if the prediction deviates strongly during certain records, the lines are further apart at that point on the graph. This graph is most informative when representing time series data. This graph is also interactive. Dragging the mouse left and right on the graph, controls the position of a vertical green line. This line represents the current record number. While the line is moved, the values of the real, predicted, and counter variables appear in the correspondingly colored text boxes on the right side. In addition, a pair of horizontal dashed lines appears, showing the particular values of the real and predicted variables.

DECISION TREE:
Decision Tree algorithm helps solving the task of classifying cases into multiple categories. It takes a target values to categorize the data set, depending upon its dependencies.It takes the most influencing parameters and according to that it finds the dependencies, so that the predicted value and the dependencies generated by Find dependencies match to the demand of water.


Figure 6: Decision Tree Algorithm output


Decision Tree algorithm helps solving the task of classifying cases into multiple categories. It takes a target value to categorize the data set upon its dependencies. The target attribute is Utilised land. The decicion tree categorises utilized land into less than 695 hectares and greater than 695 hectares. The sub-tree of the utilised land greater than 695 hectares analyzes district-wise the demand and supply of the water, fertilizers, pesticides per year.

FIND DEPENDENCIES
The Find Dependencies algorithm discovers existence of arbitrary, even weak, multidimensional, or non-linear relations in data. It also gleans some statistical properties of the found relations. Find Dependencies solves the following main tasks:
  • "Discovers a set of independent variables most influencing the target variable without determining the exact form of this dependence
  • Sifts out exceptional far outlying points that do not match the rest of data
The Find Dependencies exploration engine obtains results fast and can be used as a preprocessing module for other more calculative intensive PolyAnalyst engines. This synergy significantly reduces the time of search by the Find Laws, PolyNet Predictor, or Memory Based Reasoning algorithms through selecting only a few most influential independent variables for further exploration.


Figure 7: Predicted Graphs by Find Dependencies for Demand of water


MEMORY BASED REASONING
PolyAnalyst Memory Based Reasoning (MB) algorithm performs classification to multiple categories and prediction of numerical values using a combination of the Nearest Neighbor algorithm and Genetic algorithm. This algorithm solves the task of determining the value of a numerical or categorical variable X (target attribute) for some database record by selecting a set of the most similar records from the training dataset and averaging the values of X in these neighboring records (or choosing the most frequent value of X in the case of classification problem). The measure of similarity, the number of points in a neighborhood, and other parameters which provide the most exact prediction are selected by a genetic algorithm. Thus, the method can be utilized for solving two problems:
  1. Selection of the best metrics or the measure of similarity between records.
  2. Prediction of values of numerical or categorical variables using the metrics found in step a or specified by the user.

Figure 8 : Predicted Graphs by Memory Based Reasoning for Demand of Water


CONCLUSIONS:
Applying the data mining techniques to agriculture, the target for different food grains is achieved. The techniques help to decision maker to meet the goal. This project helps the Government to draw attention to the agriculture sectors. It emphasis on the Agriculture inputs, and if the inputs are satisfied the productivity of particular food grains is improved.

The study has demonstrated for the scope for application of spatial mining tools for a utility study and analysis. The specific application of Polyanalyst gave a clear scope for evaluation and comparison of predicted and real values. It also brought out the deficiencies in meeting the demands of agriculture inputs. Though the study at this stage could not provide the scope to get the results on turnkey basis, the utility of the technique as well as the software for discerning trends is highlighted. Any further study on this requires large database.

REFERENCES

  1. R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.
  2. W. Lu, J.Han and B.c.Ooi. Discovery of General Knowledge in Large Spatial Databases. In Proc. of Far East Workshop on Geographic Information Systems, World Scientific, Singapore, pp.275-289,1993.
  3. R.T.Ng and J.Han. Efficient and Effective Clustering Method for Spatial Data Mining. In Proc. Of 1994 Int. Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, CA, pp.144-155, 1994.
  4. Koperski.K, and Han, J, 1995. Discovery of Spatial Association Rules in Geographic Information Databases. Advances in Spatial Databases.
  5. M. Easter, H.P. Kriegel, and J. Sander. Spatial Data Mining: A Database Approach. In Proc. Int. Symp. On Large Spatial Databases (SSD '97), Berlin, Germany, 1997, pp. 47 - 66.
  6. TERI Energy Data Dictionary and Year Book
  7. Agriculture Statistics at a glance New Delhi : Minister of Agriculture , Government of India
  8. http://www.megaputer.com