The Data Warehouse
William R. Donaldson Senior Consultant, Convergent Group, 6200 South Syracuse Way, Englewood, Colorado 80111 Abstract “A data warehouse is a subject-oriented integrated non-volatile, time variant collection ofdata suited to the needs of management. ” [Inmon, 1994] This paper will take that definition apart and describe how the data warehouse can provide decision support information and contribute to the integrated solution of AM/FM/GIS in the enterprise data environment. The data warehouse is more than a central copy of system data and less than a cure-all. The data warehouse can be an effective and efficient means of delivery of the information created and maintained by the AM/FM/GIS. It can be the solution that makes the difference between success and failure of a project. The paper will illustrate the technical and economic factors to be examined when considering a data warehouse in the project architecture. It examines the details of the 12 criteria defining what the data warehouse is (and is not), explains the five elements of data warehouse architecture, and concludes with a review of the potential hazards and measures of success. The Enterprise Environment “The biggest mistake any company can make is to promise that one of the GIS benefits will be that it will help reduce personnel. “ [Muench, 1996] In a large utility company, AM/FM/GIS may not be cost effective as a stand-alone application. This assertion was considered almost heretical a few years ago, but more and more we are seeing projects falling short of projected benefits. The most commonly seen justification for implementing an AM/FM/GIS is that it will reduce cost by eliminating people. Planners project improvements in productivity and related savings to be 30 percent or more for a stand-alone system. That makes for a pretty attractive return on investment. Yet, when we look at actuals from systems that have been installed for a year or more, we see few, if any, headcount reductions. On the other hand, the case is being made that the information kept in the AM/FM/GIS is incredibly valuable, not only to facilities engineers, but to the entire company-but only if it’s available in a usefhl form to everybody who needs it. It’s not the AM/FM/GIS application that has value to company; it’s the information the system maintains that has value. Or to paraphrase a catch-phrase from the 1992 presidential campaign, “It’s the data, stupid.” Experts are coming to realize how the value of the information multiplies many times over when it can be shared across the company. There are four architectures that can be used to satisfy the needs of enterprise data:
Most systems today use the central database architecture, either as a single corporate database, or multiple, stand-alone databases located in districts, Distributed databases are defined as multiple, homogeneous databases connected by a network and having data services that direct the application to the correct location for specific data. Few AM/FM/GIS systems today use this architecture. Federated databases are similar to distributed databases except that they are heterogeneous and thus more suited to integration with legacy systems however, the difficulty and cost of integrating federated databases are enormous. Pair-wise interfaces between systems and databases do not constitute distributed or federated databases and are a poor substitute for enterprise data architectures. Enterprise data is that which meets the following criteria
The Data Warehouse Defined The definition of a data warehouse seems to be quite simple, ye~ as Hackathorn [1993] says, “there has been much confusion and even controversy over what constitutes a warehouse.” Hackathom attempts to define the subject as “a collection of data objects that have been packaged and inventoried for distribution to a business community.” This definition seems too generic and ambiguous in describing a warehouse and how it differs from other data stores. For example, his deftition could describe a simple database extract. We are inclined to ask, so what’s the big deal with definitions so long as it stores my data? In fact, the data warehouse is as different from other data stores as a database is from a flat file. It’s critical to understand the differences, at least at the conceptual level, if we are to communicate requirements and design options among users, designers, and suppliers, and to make good choices in our project planning. Bill Inmon, the self-proclaimed “Father of the Data Warehouse,” describes 12 rules of the data warehouse that pin it down in greater detail and clarity. Inmon’s Twelve Rules Rule 1: Warehouse data is logically and physically separate from operations data. The operations environment contains data that is needed to run the everyday operations of the business. The data warehouse contains data that is used to support strategic decision making. Warehouse data must be kept separate to prevent decision support queries from degrading performance of the real-time operations support systems. An example of this occurs in the telco industry with loop assignment applications, like the Loop Facilities Assignment and Control System (LFACS). The application is designed as an operations system and works well to support loop assignments, but when facilities engineers get into the application to count cable ffls--a decision support function--the performance of the application is degraded. A warehouse could solve this problem by taking periodic summaries of cable fills that the engineers could access independently Ilom the real-time system. Rule 2: Warehouse data is integrated. There must be a single, uniform representation of data throughout the warehouse, including aspects of element names, measurements, attributes, and physical definition. Integration is necessary for users to receive a uniform, coherent representation of the business. When data is kept redundantly in multiple applications there are inevitably variations in the form and content horn one system to another leaving users to wonder which is valid. Perhaps, they are all valid for the purposes of each specific application, but data that serves the enterprise must be valid for the whole. There cannot be ambiguities or inconsistencies. Rule 3: Warehouse data is historical. This is not to suggest the warehouse is merely an archive of operations data. Rather it is selected data that represents a view of the business at points in time. The warehouse is not intended to duplicate the real-time operations systems. It is not a central archive or backup of district databases. Such an archive maybe a valid requirement of a system, but that is not the purpose of a data warehouse. Rule 4: Warehouse data is a “snapshot” view of the business at a particular point in time at which the data was relevant. As such, the warehouse data will not be updated. The warehouse may contain many such historical snapshots. Moreover, since warehouse data is intended to be aimed at decision support, it is unlikely to be a complete replication of operations systems data. Rather, it will be selected aggregate data taken from operations data. As an example, the warehouse would not keep a complete inventory of each section of cable in the network, but it might be used to record a monthly summary of sheath-miles by gauge. It would not keep a record for each terminal in service, but it might keep a weekly record of how many of each type of terminal were in service. Rule 5: Warehouse data is organized by subject along the guidelines of subject data areas without influence from applications or functions. Users who may not be familiar with the source application need to be able to browse and locate required data. Users are much more likely to be able to locate data stored by subject, e.g., “customers” or “products,” than that stored along the lines of an application or function they are not familiar with. Restructuring data from its functionally orientated design to one of subject areas for the data warehouse can be challenging. Most data in the AM/FM/GIS would likely fall into the subject area of Installed Plant (which maybe further divided by the warehouse into plant classifications), but some data stewarded by the AM/FM/GIS system--for example, customer address--may fit better under the subject area classification, “Customer.” The data warehouse polling applications need to be designed to collect the data from the correct source and put it into the correct area. Rule 6: Source data is operations data. “Under all normal conditions data is not directly entered or changed in the dda warehouse. ” [In.rrwn, 1994] Since the warehouse data represents a snapshot of selected operations data from a point in time, once a snapshot is taken it is unlikely that data will ever be changed unless it is discovered the data was not a true representation of the business at that time, and it was judged worthwhile to correct it to make a valid historical record. Rule 7: Development life cycle is different for warehouse data Systems development life cycle is typically requirements-driven and the warehouse life cycle is data-driven. The warehouse is typically developed in an iterative fashion with each step building on the previous. Unlike “RAD’ or prototype development of systems, the fust warehouse iteration is usually a production database of a subset of the ultimate warehouse data. Rule 8: Warehouse data must have a standard structure. A key design issue is that of granularity of the warehouse data. The data is typically structured in several levels of granularity. For example, the warehouse may contain current (most recent snapshot) detail data, historical detail data, lightly summarized data, and highly summarized data. These design criteria will have a large impact on the volume of data maintained and the performance of the warehouse. Rule 9: Warehouse data technology is different. Operations systems are designed to support rapid, real-time transactions. The transaction architecture is usually kept small and the operations are supported by high-performance processing. A warehouse, on the other hand, typically requires handling huge volumes of data in large transactions without updates. Processing speed is less critical. Rule 10: There must be only one “source of record.” Each data element must have only one operations systems source in order to maintain high-quality, integrated warehouse data. Identitlcation of the single source is a critical design criterion. As noted in Rule 2, operations data is likely to vary in format and content from one system to another. The warehouse designers must choose which operations system contains the values that will be suitable to the entire enterprise. That system will be the one and only source of record. This rule is consistent with the enterprise data mandate that data be stewarded. The single stewarding application or database will also be the source of record for the warehouse. The warehouse itself will never be the source of record. Rule 11: Warehouse data contains metadata. Metadata, i.e., data about data, is maintained as a part of the warehouse. This information includes:
Rule 12: There must be a chargeback structure for warehouse data. It is essential to demonstrate the value of the warehouse by tracking its usage and charging users accordingly. Tools that are considered infrastructure, with the cost spread across the enterprise, tend to be overdeveloped or “gold plated.” A chargeback process tends to limit the warehouse design to its most efficient functionality. A concise definition that seems to accommodate Inmon’s 12 rules is: A data warehouse is a separate database of integrated subject area data designed specifically for use as a management decision support system and made up of snapshots of historical data derivedfrom operation systems. The Five Elements Haderle [1994] describes five elements that are essential in the design of a data warehouse. The fiist component of the complete data warehouse system is that of the operations data stores that may be in any of several operating environments. This is consistent with Inmon’s assertion that operations data is logically and physically separate from the data warehouse. This data, in “several operating environments” describes the legacy data environment precisely. The second element is the access to a distribution network. The suitable network will support the delivery of data from the operations systems to the data warehouse, and, consequently, the information from the data warehouse to potential users across the entire company. In a large company, this will likely constitute a complex system of local area networks and wide area networks. The concept of enterprise data requires that any user needing information and having adequate security access be able to retrieve the information easily and quickly. The third element of the warehouse is data delivery, the ability to move data from the operational sources to the data warehouse. This requires not only the network system described above, but a process and system for extracting and summarizing data from operations systems at the correct intervals. A graphical user interface (GUI) front-end with the ability to locate available data is the fourth element. Here is an essential component for using the data warehouse as enterprise data that Inmon doesn’t mention. The warehouse must be accessible to all who need to use its data. This is a weakness of the stand-alone AM/FM/GIS in that the only people who can access it are the users of the application. Without direct access, if the CEO wants a summary of sheath-miles of cable, it’s necessary to call the engineer, or someone else with an application workstation, to retrieve and deliver it. With the warehouse, the CEO can retrieve the information directly. The fifth and final component, according to Haderle, is that of end-user knowledge tools that provide decision support functionality. I would argue that the data warehouse is valuable as a repository of information even without additional knowledge tools. So long as the warehouse is designed and indexed adequately with a supportive GUI, users will be able to find and compile the information they need. A Practical Application Let’s take a look at a mythical large telecommunications company called Foobar Telecom. Like many companies in the field, Foobar is overwhelmed with data and looking for better ways to manage that data. The company is over 100 years old, has 10 million access lines in 1,500 wire centers, and spans 900 miles across several states. Foobar has hundreds of operations systems, each with its own data. Decision support data is scattered across multiple systems, non-integrated, redundant, often inaccessible, contradictory, inaccurate, and costly. Countless staff are dedicated to compiling information in a form suitable for decision support and preparing reports for the executives. Foobar is also in the process of implementing an AM/FM/GIS system. They have budgeted $100 million over seven years to complete the project. The system is designed with a central data server in each engineering district linked by local area network to the engineers’ workstations. Systems administration and backups are managed by the Information Systems Department at company headquarters performing nightly backups from each district over the company’s internal broadband wide area network. Knowing the value of enterprise data, the company has attempted to integrate the new system with the existing systems. All attempts to link the AM/FM/GIS system to legacy systems with simple interfaces have failed, however. Either the legacy systems don’t have a suitable open architecture, or they are not up to the task of supporting the additional load of real-time inquiries and updates launched by the new system. Foobar knows replacing all the legacy systems with modem databases is impractical, interfaces are not satisfactory, and yet they still want to achieve enterprise data. The addition of a data warehouse (Figure 1) offers one solution. The data warehouse is maintained centrally. It is designed around subject areas; for example, “Customers,” “Facilities,” “Billing,” “Markets,” and so on. At designated intervals (different intervals for different sets of data) the operations systems are polled during off hours for specific current information. The data is aggregati and summarized as decision support information and stored in the central data warehouse where it can be retrieved on demand by anyone in the company with the need and proper security access. ![]() Figure 1 The data warehouse is a read-only data store. It can only be updated by the periodic polling of operations systems. It may not be changed by end users. It polls for operations system data only during off hours so the operations systems’ performance will not be degraded. Unlike the operations systems servers, the warehouse server is a database machine optimized to manage terabytes of information in large transactions where real-time, sub-second response is not important. Inmon emphasizes how the data warehouse must be carefully designed to anticipate and support the data requirements of decision support functions. It will not be a complete replication of operations da~, that would be inefficient and impractical, perhaps even impossible. Rather, the system should retain summarized information in the smallest form that satisfies the anticipated need. For example, a manager may set out to learn how many poles are in a distribution area. The decision support system, anticipating someone would ask that question one day, should keep the summarized number of poles and not a detailed list of each and every pole, which can still be found in the operations system if necessary. Conclusion The data warehouse is a useful solution for integrating AM/FM/GIS data into the enterprise. It is not a substitute for the AM/FM./GIS system, but it can be used to make selected information available to many more people than will be able to access it through the AM/FM/GIS application. It is relatively easy and inexpensive to implement and it can be implemented in stages with each stage quickly returning full benefit. The data warehouse does not provide a linkage or interfaces to legacy systems to ensure consistent data between systems. It does, however, act as a repository of “official” data selected from many systems, so that users will know that if there are ambiguities between systems, the data in the warehouse is the correct, stewarded information. It’s possiblc+even temptin&o limit the initial design of the warehouse to only AM/FM/GIS data to quickly gain the benefit of being able to share the data with users outside the application. However, the warehouse should be designed with an eye toward the future and the ultimate needs of the entire company. There should be one warehouse for all subject data areas of the company. Thus, when considering a data warehouse in support of the AM/FM/GIS application, the system planner should be working closely with enterprise data planners. The successful data warehouse will be application-independent. It will provide summary information in a form useful across the company. Users will be able to access the system and collect needed information without extensive training or assistance. It will quickly become the first choice of users to research decision support information. The result will be executives, managers, and engineers making better, more informed, more consistent decisions faster with less time and effort spent on searching for information. Bibliography
| ||
|
|