GISdevelopment.net ---> Technology ---> Geographic Information System

Natural language processing to query a Geographic Information System (India) knowledgebase

Mukesh Kumar Rohil
Information Systems Group, Faculty Division - I
Birla Institute of Technology and Science
Pilani - 333 031 (Rajasthan)
rohil@bits-pilani.ac.in


Abstract:  
In the work presented in this paper, a successful effort is made to answer the user's English-like simple queries about geography of India and the user is not required to know about the structure of the database at logical level because the software is implemented using Logic Programming constructs and uses, knowledge-base, rather than a database. A user-interface system is developed which can accept queries, keyed-in in natural English language. It can further understand and interpret the meaning of the question asked by the given query. Finally the system can analyze and search the knowledgebase to answer the stated query, by reducing search space using artificial intelligence techniques. If the query is not understood by the system it reports to the users about the words not available in the knowledgebase and also reports the particular relations about the entities not being currently answered. The knowledgebase of about 32 K is stored about the following geographical entities and their inter-relationship that are related and limited to Indian peninsula:
  • State with capital, area, population, growth-rate and ratio of female to male.
  • Name, area and population of districts of states.
  • Location of major cities along with PIN, STD code, population, growth-rate, percentage of literacy.
  • Length, origin and name of rivers along with the list of states and countries through which these rivers flow.
  • Name and area of lakes, rivers causing the lake if it is not artificial, state in which the lakes are located.
  • National boundary of states and boundary of states common with international boundary.
  • Highest and lowest points in state and mountains of country.
  • Some roads (national highways only), the list of cities along them and distances between these cities.

Introduction
Geographic data are being acquired, stored, processed for analysis and decision making to an scope never before possible by using satellite imaging, low cost and reliable high memory, digital computers and highly specialized software called Geographical Information Systems (GIS).

GIS stores both geographically referenced spatial and attribute data about entities and their relationship. The attribute data are widely stored using data models like RDBMS and object-oriented extensions to RDBMS. To query these data, users are provided with a high-level query language called SQL and some object-oriented flavors of SQL. But none of these languages are closer to natural language like English because they still require a particular set of constructs for structuring the queries and user still need to know about the structure of the database at logical level. Moreover, the user interface design with most of the GIS software expects high degree of skills from the various fields like Geography, Image Processing, Remote Sensing, Computer Graphics, Color Models, File formats and computer fundamentals. So, it is always desirable to have an easy to use interface like natural language interface [1].

Developing programs that understand a natural language is a difficult problem. Natural languages are large and they contain infinity of different sentences. Also there is much ambiguity in a natural language because many words have several meanings such as can, bear, fly and orange and sentences can have different meanings in different contexts. This makes the creation of programs that "understand" a natural language, one of the most challenging tasks in design of user interface. It requires that a program transform sentences occurring as part of a dialog into data structures that convey the intended meaning of the sentences to a reasoning program. In general, this means that the reasoning program must know a lot about the structure of the language, the possible semantics, the beliefs and goals of the user, and a great deal of general world knowledge [2, 3]. We say a program understands a natural language if it behaves by taking a (predictably) correct or acceptable action in response to the input sentence. The component forms of knowledge needed for an understanding of natural language are classified according to the following levels [3]: 1) Phonological: This is the knowledge that relates sounds to the words we recognize. 2) Morphological: This is a lexical knowledge related to words constructions from basic units called morphemes (the smallest unit of meaning) i.e. root "remote" and suffix "ly" makes the word remotely. 3) Syntactic: This knowledge relates to how words are put together or structured to form grammatically correct sentences in the language. 4) Semantics: This knowledge is concerned with the meanings of words and phrases and how they combine to form meaning of a sentence. 5) Pragmatic: This is a high level knowledge that relates to the use of sentences in different contexts and how the context affects the meaning of the sentence. 6) World: World knowledge relates to the language, a user must have, in order to understand and carry on a conclusion.

Methodology
Natural language interface is usually designed as domain specific rather general-purpose user-interface. Domain specific natural language interface uses entities in the system, attributes of the entities and relations defined for the system. For example, in present implementation the schema (description of logical structure of a knowledgebase) is the "entity network" for the language. A schema entry follows the form "Entity Association Entity", which signifies that the two entities are bound together by the given association. Some examples are "district in state", "area of lake", "rivers in state", "roads of state" etc [3]. The knowledgebase is developed using logic programming construct, first order Horn clauses as described in [4, 5].

The query processor is designed using 1) Relational operators: under, less, shorter, smaller, specifies less than (<), over, more, longer, greater, bigger, above specifies more than (>), and "same" specifies equal to (=). 2) Associations: in (in, running through, runs through, passing through, of), with (traversed, traversed by, has, have), border (limit, boundary, frontier, borders, borders on) and of (in). 3) Minimum: smallest, shortest, least, lowest, minimum. 4) Maximum: greatest, largest, biggest, most, longest, highest, maximum. 5) Synonym: town and city, point and places, population, people, citizens and inhabitants and may be others. 6) Size of entities: river long, city large, mountain high, lake big etc. 7) Units of entities: kilometers, square kilometers, females per thousand males, citizens etc. 8) Words to ignore: the, an, a, list, all, some etc [1, 3, 4].

The syntactical structure of a sentence is determined by parsing. To detect errors, once the string has been converted to words, the word list is checked against the language knowledgebase (domain specific) to see if it is a known word. The unknown words are collected into a list that the system reports on.

Results and Conclusion
Software is designed and developed to support natural language interface to a GIS (India). The system displays correct results to queries which are stated in simple English without logical constructs like "and", "or" and "not". Types of sample queries to which the system has reported correct results are:
  1. districts of the state which borders the state with capital hyderabad.
  2. population of the districts of the state which borders the state with capital hyderabad.
  3. area of the lakes of the largest state which borders the smallest state.
  4. length of the rivers of the state with district kanpur.

Scope for further development
The system neither recognizes the logical constructs (and, or, not) to combine the sentences nor it displays the graphics output to the desired level. System can be enhanced to incorporate these features.

References
  1. Silberschatz Abraham, Korth Henry F. & Sudarshan S., 1997: 275-289, 381-434, 710-719, Database Systems Concepts, Singapore: McGraw Hill Book Company
  2. Rich Elaine & Knight Kevin, 1991, Artificial Intelligence, New Delhi: Tata McGraw Hill Publishing Company
  3. Dan W. Patterson, 1992: 225-265, Introduction to Artificial Intelligence and Expert Systems, New Delhi: Prentice Hall of India Private Limited
  4. Steel Brain D., 1998, Programming Guide, LPA Prolog, London: Logic Programming Associates Limited
  5. Clocksin W. F. & Mellish C. S., 1989, Programming in PROLOG, New Delhi: Narosa Publishing House
© GISdevelopment.net. All rights reserved.