Spatially enabling high-end business applications
Xavier R. Lopez Ph.D. Sr. Product Manager Oracle Corporation USA Introduction Organizations have long been data rich and information poor, resembling information wastelands incapable of providing useful business intelligence to management either because of incompatible datasets and dysfunctional networks or as a result of ambiguity of definitions leading to irrelevant data. The ability to now store spatial data in enterprise heralds the emergence of a new capability for spatially enabling high-end business applications like: customer care, data warehousing, ERP, financial, and marketing. In short, we are now moving toward a strategy of empowering end-users by giving them the information they need and when they need it. This paper will look at the server-based capabilities necessary for carrying this out. Demand for Business Intelligence Tools Understanding and responding to customer demands for products and services is an important business driver in the telecommunications and utilities space. Until recently, telecommunication and utilities focused on gaining market share through mass marketing of products and services. This value-proposition was predicated on increased advertising and marketing costs while discounting products and services. While this has, in some cases, led to increased market share, customer loyalty is not guaranteed. With recent changes in deregulation, customers are no longer locked into a single utility provider. They now have more power in the market to switch providers in response to alternative services or competitor discounts. Since the switching costs associated with a competitor's alternative are relatively low, traditional service providers are watching their disconnect rates soar as customers, and hence, profits hemorrhage. In response, organizations are looking to utilize their information and technological assets to reduce churn and increase customer loyalty. Ana/ytica/ information systems that generates business intelligence about customers, who they are, where they live, usage patterns, elasticity to demand, and some knowledge of what kinds of services they expect are key to increasing customer care. Deregulating utilities are now expending large efforts to sustain their customer base, while new start-ups aggressively market products and services to capture and retain new customers. In this environment, organizations must anticipate and meet very high customer expectations for unique products and services. Data warehousing provides the underlying technology for supporting such analytical business functions. The decision support functions of successful business in today's hyper-competitive environment demands that organizations leverage their organizational information systems to meet analytical and strategic objectives. Some common decision support and analysis functions are listed in Table 1.
Nearly all of the reporting and analysis activities noted in Table 1 have a spatial element. Spatially enabled data and tools greatly enhance decision support. For example, customer penetration and profitability can be spatially enabled through a simple map graphic depicting clustered point data. Likewise, customer complaint analysis can be analyzed with a network infrastructure and network utilization to investigate whether reported service problems are resulting from localized infrastructure problems or temporal capacity constraints. Other examples include the generation of statistical models to examine a customer's likelihood to:
Data Warehousing: High-end Business Analysis Tool Data warehousing has quickly taken center stage in the information technology infrastructures of many organizations. A data warehouse is a collection of data in support of management's decisions. It is first and foremost, an ana/ytica/ tool oriented around the major subjects of the enterprise, e.g., customer, vendor, product and productivity. It focuses on data modeling and database design exclusively, and it excludes data that is not useful for decision support processing. In contrast, the operational (OLTP) applications are designed around processes and functions (e.g., loans, savings, bank card and trust transactionsfor a financial institution). Organizations are adopting data warehouse technology in response to an evershifting competitive landscape. Data warehousing is now a $17billion global market and growing 10-15 percent annually. According to META Group of Westport, Connecticut, by early 1996, over 90?40of large companies had implemented or were planning to develop data warehouses. Hardware, software, and consulting vendors have quickly developed numerous new products and services that specifically target this rapidly expanding market segment. In short, successful organizations have recognized the benefits of deploying data warehousing to address customer demands and market forces with great speed and dexterity. Data Warehousing and On-line Transaction Processing (OLTP) Corporate applications are essentially optimized for OLTP (On-line Transaction Processing) applications. OLTP transactions are characterized by fast access times, no end user customization, and multiple simultaneous user access. The majority of legacy systems are OLTP based and consequently, the data resides in a variety of data sources. The data sources can be in flat files, hierarchical databases (like IMS), network structured databases (like IDMS), inverted list databases (like DATACOM/ DB), relational systems (like Oracle) or more commonly a combination of the above. Businesses need to access the wealth of information that is stored in these data sources, collate the data and determine trends. Organizations that have been able to obtain such data are naturally better able to base decisions and strategy on past historical trends, and in doing so, gain competitive advantage. The problem is data stored across multiple databases and repositories is inconsistent, incomplete and optimized for task-dependent OLTP transactions. The data structure itself, does not lend itself to analysis and data query. In addition businesses cannot allow the production systems to be exposed to massive queries and statistical computations that might reduce their performance. The requirement is simple: collate data from these multiple sources into one logical database that is optimized for OLAP (On-line Analytical Processing) transactions and allow knowledgeable data users to access this data. This is what is referred to as a data warehouse. The notion of data warehousing is not a new idea but is rather the formalization of lessons learned in Decision Support Systems (DSS) and the growing demand for historical data to be analyzed and manipulated by end users with little or no IS involvement. An important difference between the data warehouse and operational applications is how they model data. Operational data maintains an ongoing relationship between two or more tables based on a business rule that is in effect. In contrast, data warehouse content spans a spectrum of time, maintains many relationships, and represents many static business rules (and correspondingly, many data relationships) between two or more tables. In other words, all data in the data warehouse is "time variant", i.e., accurate as of some moment in time, whereas in the operational environment data is accurate as of the moment of access. The time horizon represented for the data warehouse is much longer (which can involve years) than that for the operational environment (which ranges from the current values of today to ninety days). Every key structure in the data warehouse contains an element of time either implicitly or explicitly. Lastly, the data warehouse is nonvolatile. Data warehouse data is a long series of snapshots, and cannot be updated once correctly recorded, while record-to-record real-time updates -- inserts, deletes, and changes -- are done regularly to the operational environment. That is, once data is loaded into the warehouse from the application-oriented operational environment (and/or external sources), it does not change, but is merely accessed there. Therefore, there is no need to be cautious of the update anomaly, an important factor to consider in operational application systems; nor does data warehousing require the complex technologies supporting backup and recovery, transaction and data integrity, and the detection and remedy of deadlock. Data "updating" in the data warehousing environment consists of periodic mass loading of data from the operational environment. The simplicity of data management and the much less rigid response time requirements allow data warehouse designers to take liberties in optimizing the access of data. De-normalization of the physical data model is conducted to enhance performance and simplicity, which are more prominent for data warehouse operations because the amount of data involved is typically very large.
Data contained in a data warehouse is integrated in a singular and consistent structure. Considerable effort must be undertaken in coordinating the acquisition, transformation, and consolidation of information from various departmental and external sources. Because data from disparate application platforms is likely to be inconsistent, data scrubbing is required for generating credible and consistent results. This task is non-trival and generally involves the greatest cost. Implementing consistent naming conventions, measurement of variables, encoding structures, physical definition of attributes are challenging organizational issues, but are key to successful warehouse creation. For example, coming to terms with a consistent customer naming and addressing conventions are real challenges where multiple legacy systems are in place. Further, the integration of new data types, such as spatial, image, and time series must be suppoft at the data server tier to support analysis. As data is loaded from an operational environment or some external source, it is transformed and then goes into the current detail level, where it is used in the calculation of the lightly summarized data and the highly summarized data or in some ad hoc queries. Data Warehouse Query Processing In the area of query processing, a quick comparison against transaction processing applications is useful. Typical OLTP transactions are predictable and each unit of work is small, accessing or changing just a few rows. The real challenge is in terms of performing hundreds, perhaps thousands, of these well-defined, relatively small units of work concurrently. In contrast, decision support queries usually involve large amounts of data and thus each unit of work is typically much larger. There may not be quite as many concurrent users, although that is starting to change. Further, typical data warehousing queries involve complex operations such as multi-table joins, sorting, and aggregation. Often these operations are set-oriented, operating on groups of records meeting a specified criteria, as opposed to the record-level operations of transaction processing. Another main difference relates to the organization of the data being accessed. Unlike OLTP systems that employ transaction-centric modeling focused on efficiency, warehouses and data marts typically utilize query-centric dimensional models such as star schemas that facilitate easy visualization and analysis. And finally, unlike transactional systems, where SQL is carefully hand-written and pre-tuned by developers, decision support queries for the most part are not even written by humans. They are dynamically generated by end-user tools, posing some unique challenges for the query optimizer. These differences translate into a need for specialized techniques in the database server, in every aspect of query processing query optimization, access and join methods, and query execution. Query Processing The database server needs to provide parallel query technology to enable efficient execution of queries against large volumes of data. The primary benefits of parallelism scale-up and speed-up are basic requirements in warehousing. The database server has to offer a comprehensive set of parallel operators that cover every operation involved in query execution, including table and index scans, sorts, joins, aggregation, and grouping, as well as the creation of summary tables, inserts, updates and deletes. The parallel query technology needs to support all the parallel hardware architectures: Symmetric Multi-Processors (SMP), Clusters, and Massively Parallel Processing (MPP) systems. Most importantly, all of these query processing technologies have to be integrated. Contrary to what some vendors may claim, there is no one silver bullet for warehouse query processing. That is, meeting the complex, varied query needs of data warehousing is not just a matter of having a single specialized technique such as star query optimization or bitmap indexes. It's not even achieved by implementing a comprehensive set of techniques; that is just the pre-requisite. The real challenge lies in the seamless integration of all the capabilities in query optimization, access and join methods, and query execution that enables them to work together and deliver the full power of the query processing engine. Data Management Data Management refers to all of the activities related to getting data into the warehouse, organizing it for efficient access and easy manageability, and maintaining it. In database terminology, it covers operations such as data loading, building indexes, collecting statistics on the data, enforcing constraints, reorganizing tables and indexes, building aggregates or summaries, and data purging. While these activities are not unique to data warehousing, the sheer volume of data places special requirements in terms of scalability and performance. To effectively meet the needs of data warehousing, the database server has to provide a data management infrastructure that delivers:
In the context of data warehousing, there are two sides to scalability:
Data warehouses and data marts are fast becoming business-critical systems, requiring the same high levels of availability that we typically associate with operational transaction processing systems. The database server at the heart of the warehouse needs to deliver data access despite failures and during maintenance operations. Below are some key requirements:
In addition to the traditional requirements, spatial data warehousing in this network computing era presents some significant new challenges in distributed environments. First, there is a 'iSize Challenge. " As most of us have witnessed over the past few years, warehouses and data marts seem to have a natural tendency to grow in size. As business users start to realize value from data warehousing, they typically want to grow the warehouse data adding more history and more detail, as well as capturing more attributes. The end result is that warehouses that start out in the gigabytes quickly grow into the hundreds of gigabytes or terabytes. Larger ones that start at the terabyte level are often projected to grow into the tens of terabytes. This explosive growth in data volumes presents significant challenges to the database server in terms of data management and query processing capabilities. Second, there is a Users Challenge. As data warehousing goes mainstream, helped in part by the ubiquitous intranet technologies, it's not just for a few analysts anymore. It's virtually for every employee, and may sometimes reach further out to include partners and suppliers as well. This growth in user population translates into scalability and resource management challenges for the database server. The next new challenge is a Performance Challenge. Data access is quickly moving from today fs combination of batch reports and occasional interactive analysis to a direct model involving predominantly interactive, self-service access to decision support information, This change in usage translates into faster query processing and more stringent response time requirements for the database server. The final challenge is the "Availability Challenge. J' Today, data warehousing is not just useful, it's business-critical. This transformation means that warehouses and data marts have to be available all the time, irrespective of failures and even during maintenance operations. In short, the new reality is that the database server has to support more data and more users, provide faster performance, and be available all the time. Data Warehouse Engine The core data warehouse technology is the database engine. The database stores the information and drives the analytical application. Data warehouse implementation places a premium on the adoption of relational database technology to deliver high performance, scalability, flexibility, openness, and ease of use. Over the last couple of years, object-relational database management systems (ORDBMS), or Universal Data Servers, have emerged as the new standard for overcoming the limitations that relational databases have in terms of storing a limited number of data types: integers, dates, character strings, etc. The main features of ORDBMS include: High performance Scalability Enterprise access to complex data types (spatial, image, time series, video) Robust storage, index, and query Open standards-based data access Extensible frameworks to support application development Major database vendors like Oracle, Informix, and IBM have adopted ORDBMS technology as the next logical advance in data warehousing. These vendors recognize the value of storing all of the enterprise data in the database and of managing enterprise data in open, standards-based, SQL-accessible data formats. These latter requirements help to overcome the proprietary data "lock-in" that prevents applications from sharing data. Spatial Obiect Technolocw Extensible RDBMS technologies break new ground by providing an efficient way to store, access, and manage spatial, along with imagery, video, and time-series information in a single database. By effectively managing spatial and attribute information in the database, users benefit from standard features like a flexible clientherver architecture, object capabilities, robust data management utilities, ensuring data integrity, recovery, and security features that are virtually impossible to obtain in hybrid GIS architectures. Spatially enabled databases can now: Manage complex data types as native objects rather than BLOBS. Eliminate data synchronization issues common in hybrid file system/database solutions Reduce redundant data stores and servers Provide efficient data administration through reduced overhead and maintenance Minimize training and support for stand-alone database solutions Overcome proliferation of non-interoperable data types in different data servers Put all the data in common database environment Scale from gigabyte to terabyte Integrate with enterprise applications (OLTP, ERP, GIS, CAD, CIS) Performance Database performance features to consider when evaluating data warehouse systems include: query optimization, cost based optimizers, data model definition, and query performance and management. Quetv Optimizer The query optimizer is often described as the brain of the query processing engine. The optimizer has the responsibility for identifying the most efficient execution path for each query, taking into account a variety of factors relating to the data, query execution capabilities of the server engine and available system resources. Cost-Based Optimizer Sophisticated query optimization for data warehousing began with the introduction of the cost-based optimizers (CBO). The CBO provides a solid foundation for data warehouse operations query optimization that takes into consideration details related to the size of tables, range of attribute values, and the distribution of data. CBOS utilizes data value histograms to effectively deal with real-world data skew. Data Model Definition The data model for the data warehouse is of paramount importance to the overall performance, flexibility, scalability, and capability of the system. This is generally not to be taken lightly, and is best undertaken with the assistance of data modeling specialists. The leading data warehouse providers, including Oracle, IBM, and Informix, generally recommend a star schema for data modeling. Star schemas are optimized for performance by taking advantage of database features such as: parallel query processing, table partitioning, parallel indexes, and intelligent optimizers. For example, the Parallel Bitmap Star-Join in combination with intelligent optimizers as the foundation for the Oracle data warehouse. This data model offers a complete parallel architecture to support parallel query, parallel insert delete, and parallel load. In addition, intelligent optimizers analyze table volume, skew, and partition strategy to determine the best execution plan for each query. Performance and Management Performance and scalability are two key features to consider in data warehouse solutions. The system must be built on technology that can scale effectively as data volumes grow. In addition, it must provide the management capabilities and tools to support databases that will likely increase into multiterabyte size. The management of the data warehouse is also something to consider when looking at available technologies. Scalability: Telecom and utilities warehouses must scale to very large sizes from gigabytes to hundreds of terabytes. Also, support for large object (LOB) storage along with multimedia (spatial time-series, and image) data types is increasingly important. Table Partitioning: The ability to partition tables eases loading, dropping, and localizes maintenance operations on single table partitions, and isolates failure so that unaffected data remains available. Parallel Operations: Parallelism can support loading of data, index creation, constraint enforcement, index scanning, statistics generation. In addition, parallel insert, update, and delete statements can be enabled leading to significant performance increases when dealing with large insertion rates. Automated Recovery Management: Automated management supports data archival, manages records of archived processes, reduces human error during back-up, and enables incremental backups for large databases. Management of Complex Data Tvpes: Data warehouses must support emerging complex data types such as spatial, time-series, image, and video. For example, the spatial enabled analysis can bring a high return on investment by reporting observed spatial clustering of customer segmentation, fraud, and customer profitability. In short, the database is the engine that drives data warehousing, just as it drives OLTP. Organizations must carefully select database technology that provides the performance, scalability, openness, and ease of use to support complex data warehouses. As new data types like spatial are incorporated into the analytical process, object-relational databases offer new opportunities to spatially-enable the data warehouse. Conclusion Data warehousing has taken the information industry by storm, and it is now poised to transform it. Warehouses from vendors like Oracle have become essential technologies to success in a highly competitive, dynamic, and information-intensive market. Large organizations now realize the strategic value of data warehouse tools and being competitive today means harnessing organizational knowledge through these tools. With the addition of spatial data types, organizations can now incorporate spatial tools to enhance analytical processes. Through tools such as OLAP, end users are provided with the ability to drill down and massage information and have full use of complex data types to enhance their decision making tasks. | ||||||||
|
|