GISdevelopment.net --> Proceedings --> GISDECO

Locating socio-economic activities with GIS in Chines cities

Zhengdong Huang Ian Masser
(hzhd@itc.nl) (masser@itc.nl)
ITC, P.O. Box 6, 7500 AA Enschede, the Netherlands



Abstract
In building urban information systems, it is important to identify the spatial locations of socioeconomic phenomena. In Western urban areas, street addresses, street intersections and postcodes have been widely used for locating socio-economic activities. However, these methods may not operate equally well in developing cities, partly due to the inconsistent and incomplete referencing bases (street address and postcode), and partly due to different ways of address expressions. This paper discusses these problems and proposes possible solutions. The existing practical referencing systems in different sectors are briefly introduced. Two basic schemes of location references, the name-based and the street-based scheme, are considered with respect to their possible applications in China. Taking Wuhan, a metropolitan city in China, as a case, experiments are carried out to show the possibilities of the proposed methods. To implement the schemes, data consistency and completeness have to be achieved within organizations, and close cooperation is needed among public agencies that have control over the base referencing data.

Introduction
In urban areas it is important to identify the spatial locations of socio-economic activities, the process of which is called location referencing. Many location referencing methods have been developed and implemented in different fields, e.g. street address and postcode. The TIGER/line files developed by the Census Bureau of the US contain line, landmark and polygon features that provide locational references for field staff and map users (The Bureau of the Census, 1997). In urban areas the street address ranges have been widely applied for address matching. The postcode system from postal service is another important referencing base that has been utilized for location referencing in travel and activity survey. From a geographical point of view, postcodes are becoming a widely used and general method of describing the position or location of places, areas or objects on the earth (Raper et al, 1992). A postcode may be designed to indicate a very small area or even an individual property, such as the cases of Postal Address File (PAF) in the UK and the ZIP+4 in the US. In a detail level, postcodes and street addresses are linked with the properties they represent, which can provide more flexible means of location referencing.

The methodologies of location referencing have been discussed from different perspectives. The Geographic Information System (GIS) itself is a spatial referencing system for geographic phenomena. Laurini and Thompson (1992) declare that positioning objects in spatial referencing systems involves several considerations, such as the geometric character of the reference system, the measurement metrics, nature of the origin, and discrete or continuous references. The referencing bases in GIS generally refer to either coordinate or grid system. In urban applications, more attention has been given to relative referencing methods that can locate socio-economic activities within existing spatial bases. Goodwin et al (1995) have summarized several location referencing methods, i.e. link ID, linear referencing system, coordinate system, street addresses, cross-street matching, and their combinations. Apart from the coordinate system, these methods require only names or numbers, rather than coordinates, to locate the sites.

In the field of urban transport, location has been regarded as a key to integrate socio-economic data with transport data (e.g. Ries, 1995). A well-established location referencing system may facilitate transport data integration and data sharing. A GIS-T enterprise data model that incorporates transport features, transport events and other related attributes is necessary to meet the need of transport data sharing (Dueker & Butler, 1998; Dueker & Butler, 2000). The importance of referencing bases is also emphasized in building transport information systems. For example, Wright et al (2000) contend that road networks be divided into lowest common-denominator in order to build an integrated urban transport information system (UTIS). A closely related matter in location referencing is the accuracy of road centerlines that serve as referencing base in transport (Noronha & Goodchild, 2000).

For most Western cities, although there are still uncertainties in location referencing, the referencing systems such as street address and postcode have been set up in a comprehensive way with relatively complete data. These systems can generally satisfy the needs of transport data processing. In most developing cities, however, there have been no consistent referencing frameworks, and data on referencing bases are incomplete. For example, postcodes in Chinese cities represent spatial areas that are too large to make any sense in location referencing. Street numbering scheme is theoretically systematic but its real application is confronted with many errors. Moreover, even these referencing data is available, the techniques of location referencing have to be adapted to meet the local requirements. The address expression in China is very different from the Western style, which frustrates direct application of the address matching process that is readily available in most GIS packages.

This paper discusses these problems and proposes solutions in the context of China. Existing referencing methods are explored under the general rubric of two schemes: the street-based and the name-based scheme. An experimental implementation of these methods is presented based on data from Wuhan, a metropolitan city in central China. Institutional issues are discussed concerning data availability and data sharing in the context of Chinese cities.

Methods of location referencing in China

The process and schemes

The process of location reference is to identify the spatial locations of socioeconomic activities in the context of geographical base, i.e. to assign spatial coordinates to the activity sites. The structure is shown in Figure 1.


Figure 1 The process of location referencing


People usually express a location in quite different ways. In a trip survey, for example, some may provide street address as required, some may have no idea about street number but may give their institution names, and some may only be willing to give a general description on the place where they are from. A location reference system should be able to cope with these situations and find correctly the activity sites.

Based on their spatial characteristics, existing referencing methods can be categorized into three types of location referencing schemes: the name-based scheme, the street-based scheme, and the coordinatebased scheme. The first two schemes mainly deal with locating socio-economic activities by semantic expression, and will be examined in detail. Examples of the first two schemes are:
  • Name-based (point/polygon): building, place name, large or popular site, street intersection, street, administrative unit, telecom zone, postcode zone
  • Street-based (line): street address, mile-post
The name-based scheme provides direct linkage between activities and base maps, using semantic names as keywords. The referencing base is a set of geographical entities with which names are associated. A scoring system can be set up to cope with mismatched cases. In the absence of correct matching, a remedial process can be started to search possible candidates and to call for human interacting. The street-based scheme is basically a linear matching method in which the relative distance from a known origin is needed to determine a location. The scheme is applied on road networks to locate traffic events and road condition on roads. The main feature of this scheme is that the process includes linear interpolation to identify a location along a road.

Referencing bases

The conditions of the location referencing bases under the two schemes vary from country to country. Some of the practiced or possible methods in China are briefly discussed here.

Street name and street intersection Since the street network forms the spatial structure of a city, streets can be effective in location referencing. The problem with street location lies in its geographical precision. A shorter street indicates a more precise spatial location than a longer street. When street addresses are not available, street names could be used as a substitute. In many cases street intersections are used as referencing points, e.g. in traffic navigation and incident management. The potential problem with street name and street intersection matching lies in its incomplete or wrong expression of street names.

Street address. The street address has been the most popular and important means of locating socioeconomic activities in space in well-developed cities. Street addresses provide geographical locations. The system has been created and maintained by the sector of public security in China, and has been widely used in post for address expression. However, the expression of address is not directly suitable to a standard address matching process. The major difference is that it is very difficult to extract the road name and the address from a conventional address expression. It is apparent that a translating program is necessary if the Chinese address is to be applied in the standard address matching process. Another problem is that the Chinese street numbers are sometimes not assigned in a systematic way. This is partly due to the indifference or inefficiency of the offices, and partly due to the fast-growing renovation or reconstruction during economic booming.

Postcode. Postcodes are tightly associated with mail delivery. The formulation of postcode is a systematic scheme that takes a hierarchical allocation. The accuracy of postcode is reflected in the spatial units represented by the lowest level of postcodes. Postcodes can also be represented in detail by point, as is the case of the ZIP+4 address point in the US postal database (Cowen, 1997). The postcode in China is a national 7-digit system. If used for location reference in a city for such study as transport modeling, the postcode zones are too big in terms of geographical area. For example, the built-up area of Wuhan is about 230 Km2, yet it is divided into only 25 postcode zones. Furthermore, the shapes and sizes of postcode areas are very different from one zone to another.

Place name. From a geographical point of view, a conventional place represents an area that may relate to a legend or an event in history. If all the places in a city are geo-referenced, they could be used for less-precise location reference. One of the characteristics of places is that their areas may vary dramatically and they don’t have clear boundaries. This phenomenon poses troublesome problem of data representation. A practical solution to this dilemma is to find a point that can best represent the place and to regard the place as a point entity in computer database. All points of places in a city together may cover the urban area by using irregular tessellation representation in a GIS (Worboys, 1995). An interesting fact is that bus stops are mostly named after the conventional names of the place. Considering the importance of public transport in Chinese urban systems, the names of transit stops are well known by travelers. More importantly, the stops cover a large proportion of the build-up area. Therefore, these stops could possibly serve as location references in the lack of precise location measures. Although the spatial accuracy of place reference is not very high, the collection of these data is not demanding. This means the reference base can be set up with relatively less input but can still satisfy the needs of some applications.

Site of employer (work unit). A site of employer (or a work unit in China) is a place with landmark where employment taking place, such as a public agency, a business park, a university, or a shopping center. Because of their popularity, these sites are frequently referred to as references by travelers in everyday life. These sites themselves could be precise in geographical location and cover a moderate area. In urban travel demand modeling, employer’s sites are important attracting points and are especially emphasized in disaggregate transport survey. In a metropolitan city like Wuhan, many large work units can serve as referencing places because of their popularity or prominence.

Building. The most accurate way of identifying locations is to designate the building (landmark) in which activities happen. These buildings are available from topographical maps or cadastral maps that are available from departments of land administration in Chinese municipalities. If the buildings are to be utilized as a referencing base for activities, their names have to be clearly identified in the database. Using buildings as referencing base is still a task that facing many difficulties. The biggest problem lies in the possibility of generating and maintaining such a huge data set. The closest successful example is the Address-Point dataset maintained by the Ordnance Survey of UK, which contains more than 25 million addresses.

Administrative unit. The administrative hierarchy of an urban municipality in China is like this: the city - Districts - Streets - Resident Committee. The system has been used for resident registration and socioeconomic statistics. Due to its strong structure, the hierarchy is frequently applied in location expression of activities. However, the resident committee rarely appears in location expression, and the geographical boundary has not been very clear. In address expressions the administrative units are usually used to supplement street addresses.

Implementing location referencing in Wuhan, China
Wuhan is a metropolitan city with a population of about 7 million in central China. It bears typical characteristics that can be found in other large cities of China.

The components of address expressions
The standard postal address should contain street address and postcode. In Wuhan, as the postcode areas are large and don't correspond with any other spatial units, they cannot fulfill the needs of location reference alone. Street address has been recommended by postal offices and has been used more and more frequently. In the absence of street addresses, names of companies or institutions are usually used as substitutes in postal service. In practice, various methods may be used together when stating an address. This poses the problem of identifying different parts in an address expression. An ad-hoc survey usually contains pre-defined forms that separate different address sections clearly. But for addresses in some ordinary files, there are no space and comma used to separate different parts. In such cases an extracting algorithm is necessary to differentiate various components (Huang & Masser, 2001).

A list of addresses from various sources was set up to explore patterns of address expression. The list contains about 1,430 addresses with tremendous variations in expression. The addresses can be broken down to several parts or items that can be classified into different types of units. These units have been discussed in previous part and are referred to as address items or address units in this context. The address items include:
  • Administrative city (AC) – the administrative city.
  • Administrative district (AD) – the administrative district.
  • Administrative street (AS) – the administrative street.
  • Place name (PN) – conventional place name.
  • Street name (SN) – street name, including road intersection.
  • Street number (SNR) – street number.
  • Work unit (WU) – enterprise, agency, school, hospital etc.
  • Building name (BN) – prominent building name.
  • Building number (BNR) – building or housing number in work unit or residential area.
  • Relative orientation (RO) – position relative to a known place or building, including distance.

The sequence of units in addresses is generally from large to small, without any space to separate them, i.e. AC->AD->AS->PN->SN->SNR->WU->BN->BNR->RO. Understanding this sequence will help to identify different units in an address. A process on the sample list reveals that a majority of addresses (about 84%) contain street names, in which addresses with both street names and street numbers dominate the sample set (Table 1). This statistics indicates that street names and street numbers are popular components of address expressions. It also implies that the street-based address matching in Chinese cities is as important as in the Western ones. However, this situation may not lead to a conclusion that name-based matching is not necessary, because name descriptions are also popular in address expressions.

Table 1 general component of the sample addresses

Address group Cases Percentage
Addresses with street names
In which: with both street names and street numbers
with only street names
Addresses without street info
1210
1029
181
222
84.5%
71.9%
12.6%
15.5%
Total 1432 100%


Street as referencing base

The basic structure of street numbering in Wuhan follows the general principle, i.e. odd and even numbers are allocated separately on two sides of a street. Particularly, odd numbers are assigned to the western or northern side of a street, while even numbers are allocated to the eastern or southern side. However, some special characteristics of street address exist in Wuhan.

Firstly, while the numbering method is systematic, numbering along some streets may not be complete. Wuhan, like other cities in China, has experienced dramatic changes on spatial structure due to economic and social development since 1950s and especially after the economic reform during the last twenty years. The fast changes of land uses alongside streets have been accompanied by address merging or dividing, which makes it difficult to allocate street numbers in a comprehensive way. The upgrading of street numbers has not been able to keep pace with the changing reality, during which mistakes may happen.

Secondly, some exceptions may happen. One case is that the street numbers of some special agencies are prefixed by a character “te”. For example, in the previous 1423 sample addresses set, 44 cases are found containing the special character. The special character denotes the importance of the agency in industry and the numbers are in line with their surroundings. Another case is that an old address unit is divided into several units but there is no new number available for these new units or the new units are members of a new group. In this case, the new units may be assigned sub-address numbers, e.g. 22-1, 22-2, and so on. These special numbers can’t be reflected in the address range of a street, and only main numbers are assumed available in address matching process.

Thirdly, sometimes a street may have experienced two much changes, and it is necessary to re-assign street numbers to cope with new situation and to achieve better numbering system. However, a change in street address may have fundamental societal effects and bring inconvenience to the address holders. In a period both the new number and the old number may exist. For example, one trunk road in the study area of Wuhan jump its street number at a intersection from 369 to 975, indicating a coexistence of the new and old schemes.

Finally, some technical barriers have to be overcome. The techniques of Chinese character matching have to be developed. In Chinese expression, a name is composed of several characters, each of which is represented by two bytes in computer. But in some cases, different characters with the same pronunciation (pinyi) are used for the same name. There are also similar characters in Chinese language, which may also need to be considered in character matching. If the address units are successfully decomposed, the standard street address matching process that are available in most GIS packages can be utilized. Sometimes information on street name and address is missing, e.g. in a travel survey, respondents may not know street address. In the absence of street numbers, it is assumed other information could be provided, such as company name, and living place.

Figure 2 shows address matching results in part of the study area using Mapinfo and ArcView. The two systems give the same results for addresses with complete information. For incomplete information the two systems apply different matching strategies. The two points circled in the two maps represent the same street address but are interpreted differently in the two GIS systems.


Figure 2 Address matching sample results

The correctness of address matching results can be checked either by field visit or by comparing with existing data. Figure 3 illustrates three examples of verifying matching results with existing topographic map in the study area. The topographic map contains, among other things, building outlines and annotations of large agency names which are shown as background features in the figure. Agency names are also collected and are kept together with their street addresses in the address table, which are shown in the lists with gray background in the figure. The stars are the matched locations from the listed street addresses, and the name annotations from topographic source are also displayed. Comparing the annotations with agency names in the lists indicates these matches are successful.



Figure 3 Check the address matching with topographic maps


From the above example it can be concluded that, given a reliable referencing street base and address expression, the general process of street address matching in GIS packages is applicable to the Wuhan situation. However, in real applications it is exactly the quality of source information, i.e., either the referencing street base or the addresses themselves, that dominates the final successfulness of the matching process. The street numbering system is incomplete and error-prone. There is no standard for address expression, which makes it difficult to separate various components from the expression. Address matching in Chinese environment requires more efforts of character processing than that of western one.

Name as referencing base

Name-based location referencing requires that the spatial locations of entities serve as referencing bases. Work units and buildings are important references as their locations can be identified precisely. These data can be derived from cadastral maps or building permit maps, and supplemented by field investigation. Work units usually refer to large governmental agencies, institutions, and enterprises. Therefore a work unit may have many buildings.

The case of the place name is more complicated. A place has no definite spatial boundary, and may either be part of another larger place or contain other smaller places. Conventional place names are available from an office under the local municipality. Places are generally named after their geographically related stories in history. For example, urban Wuhan is divided into three parts by two rivers, and the three parts are historically called “the three towns of Wuhan”. The names of the three “towns” frequently appear in address expression and can be used as the highest level in the place hierarchy. The names of districts are sometimes also borrowed to indicate places, which can be considered as the second level. Other smaller places are associated with individual locations and can be regarded as the third level.

The relationships among the name-based referencing bases have to be clarified so that an appropriate referencing spatial database can be set up. Figure 4 illustrates the hierarchical structure and relationships among the entities. According to areas being covered, place names are elaborated into two levels: level 1 covers area larger than that of districts, and level 2 falls between work unit and administrative streets. The dominant relationship among the entities is one-to-many.



Figure 4 Hierarchy and relationships among name-based referencing entities


Figure 5 presents an example of organizing referencing layers in GIS. For the matching process, the hierarchy of these layers has to be maintained according to their spatial resolutions. In principle, name-based matching is a one-to-one match between names in referencing base and names to be located, regardless of the types of spatial entities in the referencing base. The matching of names is not a problem as long as the names are correctly spelled. However, the real situation is much more complicated. On the one hand, it is difficult to extract names from address expressions, as there have been no rules to apply. Even if the names are extracted, it is still hard to identify which referencing level a name corresponds to. On the other hand, names in referencing bases belong to different layers at different levels. For efficient search, these names have to be linked properly.



Figure 5 An example of name-based referencing bases


Institutional Issues
The public agencies in Chinese cities possess most of referencing base data. If GIS technology is applied in these agencies, various referencing methods will benefit the general public for different uses. For example, the Bureau of Public Security in Wuhan maintains a set of street address book, which has interested many other institutions. Yet no GIS database for the addresses has been established.

Table 2 shows those agencies that are holding information on the referencing bases.

Table 2 Referencing data and their holders in Wuhan

Referencing bases
Street name, Place name
Street address
Postcode
Work unit & building
Administrative unit
Telecom zone
Available from
Office of Place Names
Bureau of Public Security
Bureau of Post
Urban Construction Committee
Municipality
Bureau of Telecommunications


To make the best use of referencing data, it is extremely important for these agencies to share their information with the general public. However, data sharing among public agencies has been a major problem in many governments. Research on British local government has shown that only half of the organisations are in favor of cooperative information sharing (Masser & Campbell, 1995). Institutional aspects are regarded as the most difficult factor in data sharing among governmental agencies, which is especially true in the Third World (Batty, 1992). In Wuhan (and other cities in China), agencies have been traditionally separated and supervised by several provincial or national sectors. For their own benefits and security reasons, municipal agencies have been reluctant to exchange information with each other unless they have to do so. Although the situation has been improved much due to the widespread use of information technology, such as the government online project, information exchange among municipal units remains difficult. A point has to be found to balance the benefits to the agencies and the requirements of information technology, which, as Alfelor (1995) has pointed out, is a challenge to make every one better off as a result of adjustment brought about by information sharing. This is applicable to the data sharing both within an agency and among different agencies.

As different agency may utilize the same sort of data (e.g. streets), duplicated data collection should be avoided in order to acquire efficiency and keep the integrity of data. To achieve this, consensus on data sharing should be made. Under such a sharing mechanism, updated data from executive agency can be utilized immediately by other sectors. An example of adding a new road to the road database in Wuhan is proposed in Figure 6.



Figure 6 The handling of urban streets in a data sharing framework


To turn the possibility into reality, the public institutions need technological and personnel input. Some agencies have been building spatial databases. For these databases to be served as location referencing base, more information is necessary from related agencies. It has to be recognized that a common referencing base is not only to the benefit of the participating agencies themselves, but also to the good of the general public.

Although many possible methods exist, they vary in terms of precision, geo-reference, and completeness. In practice no comprehensive location system has been developed in Wuhan (and any other city of China). This may due to the following facts:
  • The methods themselves are not consistent.
  • Incomplete data. Each municipal department may keep a relatively complete data set for its own purpose. Usually the update of the data cannot keep pace with fast changing urban environment.
  • Awareness of usefulness. Apart from their own usage, departments have limited understanding of the implication of their data to others. The linkage between departments has been weak.
  • Lack of standard. Even if the departments would like to cooperate, under current situation they would meet with many technical problems. The lack of standard makes it difficult to link different data sets. The data frameworks may vary in terms of coordinate system, entity representation, scale of data collection, and underlying spatial database system.
  • Lack of appropriate technology. Although GIS technology has been quite popular in some departments, not all agencies are aware its advantages or have the necessary technical staff to do the job.
Conclusions
Location referencing is a two-fold process. On one hand the addresses to be geo-coded has to be clearly expressed so that they can be decomposed into appropriate address units. On the other hand, the referencing bases have to be properly organized according to referencing methods. Street-based referencing will be the dominant matching method because of its popularity in use. The name-based referencing provides useful compensate for cases when no street information is available. These methods can be implemented in a flexible way, depending on the requirements from various tasks. A fully developed location referencing system should be able to handle different kinds of location descriptions. The system should incorporate various location referencing methods. The two schemes are realized in different ways in computer, with one using linear interpolating and the other using direct name matching. The name-based methods are fundamentally operated in relational database system (RDBMS), and the address matching method is a standard function of GIS. Therefore, to incorporate these methods also implies to integrate the systems as well as the data. This role could only be fulfilled in a GIS environment in which different systems are linked together. Judged by the requirements of location referencing, existing GIS systems are not yet sufficient enough as to constitute a comprehensive location referencing system.

Public agencies are important sources of data for setting up location referencing bases. These agencies maintain the data in the course of their functional implementation, which ensures continuous update of the data. Generally speaking the agencies with referencing data have not realized the potentiality of their data in applications with information technologies. It has to be realized that one agency alone will not be able to make such a system, because its data declares only one aspect in a referencing system. To build such a system, data from all related agencies have to be incorporated, and this will necessitate cooperation among the agencies.

References
  • Alfelor, R.M. (1995) GIS and integrated highway information system. In: Sharing Geographic Information (ed. Onsrud, H. and Rushton, G.). Center for Urban Policy Research, New Brunswick, New Jersey.
  • Batty, M. (1992) Sharing Information in Third World Planning Agencies: Perspectives on the Impact of GIS. Technical report 92-8, NCGIA, Buffalo.
  • Bureau of the Census (1997) 1997 TIGER/Line® Files Technical Documentation. Washington, DC.
  • Cowen, D. J. (1997). Discrete Georeferencing. NCGIA Core Curriculum in GIS. http://www.ncgia.ucsb.edu/giscc/units/u016/u016_f.html.
  • Dueker, K. J. and J. A. Butler (1998). GIS-T enterprise data model with suggested implementation choice. URISA Journal 10(1).
  • Dueker, K. and A. Butler (2000). A geographic information system framework for transportation data sharing. Transportation Research Part C. 8.
  • Goodwin, C.W.H., Gordon S.R., Siegel D. (1995) Reinterpreting the location referencing problem: a protocol approach. Proceedings of GIS-T 95 Symposium. Washington DC. American Association of State Highway and Transportation Officials (AASHTO).
  • Huang, Z. and I. Masser (2001). Decomposing address expressions for location referencing in Chinese Cities. In: Urban Geoinformatics - 2001 International Conference Proceedings. Beijing: Publishing House of Surveying and Mapping.
  • Laurini, R. and D. Thompson (1992) Fundamentals of spatial information systems. London etc. : Academic Press.
  • Masser, I. & Campbell, H. (1995) Information sharing: the effects of GIS on British local government. In: Sharing Geographic Information (ed. Onsrud, H. and Rushton, G.). Center for Urban Policy Research, New Brunswick, New Jersey.
  • Noronha, V. and M. Goodchild (2000) Map accuracy and location expression in transportation - reality and prospects. Transportation Research Part C: Emerging Technologies. 8.
  • Raper, J., D. Rhind, et al. (1992). Postcodes: the New Geography. Longman Group UK Limited.
  • Ries, T. (1995) Integrating governments for transportation purpose using a geospatial framework. GIS-T Proceedings.University of Nevada/Las Vegas.
  • Worboys, M. F. (1995). GIS: A Computing Perspective. London etc.: Taylor & Francis.
  • Wright, S., M. Bell, et al. (2000) Introduction to the UTIS database for Newcastle upon Tyne. Traffic Engineering and Control. 41(3).
© GISdevelopment.net. All rights reserved.