Spatially enabling high-end business applications
Data contained in a data warehouse is integrated in a singular and consistent
structure. Considerable effort must be undertaken in coordinating the acquisition,
transformation, and consolidation of information from various departmental and
external sources. Because data from disparate application platforms is likely to be
inconsistent, data scrubbing is required for generating credible and consistent
results. This task is non-trival and generally involves the greatest cost. Implementing
consistent naming conventions, measurement of variables, encoding structures,
physical definition of attributes are challenging organizational issues, but are key to
successful warehouse creation. For example, coming to terms with a consistent
customer naming and addressing conventions are real challenges where multiple
legacy systems are in place. Further, the integration of new data types, such as
spatial, image, and time series must be suppoft at the data server tier to support
analysis. As data is loaded from an operational environment or some external
source, it is transformed and then goes into the current detail level, where it is used
in the calculation of the lightly summarized data and the highly summarized data or
in some ad hoc queries.
Data Warehouse Query Processing
In the area of query processing, a quick comparison against transaction processing
applications is useful. Typical OLTP transactions are predictable and each unit of
work is small, accessing or changing just a few rows. The real challenge is in terms
of performing hundreds, perhaps thousands, of these well-defined, relatively small
units of work concurrently. In contrast, decision support queries usually involve large
amounts of data and thus each unit of work is typically much larger. There may not
be quite as many concurrent users, although that is starting to change. Further,
typical data warehousing queries involve complex operations such as multi-table
joins, sorting, and aggregation. Often these operations are set-oriented, operating on
groups of records meeting a specified criteria, as opposed to the record-level
operations of transaction processing.
Another main difference relates to the organization of the data being accessed.
Unlike OLTP systems that employ transaction-centric modeling focused on
efficiency, warehouses and data marts typically utilize query-centric dimensional
models such as star schemas that facilitate easy visualization and analysis.
And finally, unlike transactional systems, where SQL is carefully hand-written and
pre-tuned by developers, decision support queries for the most part are not even
written by humans. They are dynamically generated by end-user tools, posing some
unique challenges for the query optimizer. These differences translate into a need
for specialized techniques in the database server, in every aspect of query
processing query optimization, access and join methods, and query execution.
Query Processing
The database server needs to provide parallel query technology to enable efficient
execution of queries against large volumes of data. The primary benefits of
parallelism scale-up and speed-up are basic requirements in warehousing. The
database server has to offer a comprehensive set of parallel operators that cover
every operation involved in query execution, including table and index scans, sorts,
joins, aggregation, and grouping, as well as the creation of summary tables, inserts,
updates and deletes. The parallel query technology needs to support all the parallel
hardware architectures: Symmetric Multi-Processors (SMP), Clusters, and Massively
Parallel Processing (MPP) systems.
Most importantly, all of these query processing technologies have to be integrated.
Contrary to what some vendors may claim, there is no one silver bullet for
warehouse query processing. That is, meeting the complex, varied query needs of
data warehousing is not just a matter of having a single specialized technique such
as star query optimization or bitmap indexes. It's not even achieved by
implementing a comprehensive set of techniques; that is just the pre-requisite. The
real challenge lies in the seamless integration of all the capabilities in query
optimization, access and join methods, and query execution that enables them to
work together and deliver the full power of the query processing engine.
Data Management
Data Management refers to all of the activities related to getting data into the
warehouse, organizing it for efficient access and easy manageability, and
maintaining it. In database terminology, it covers operations such as data loading,
building indexes, collecting statistics on the data, enforcing constraints, reorganizing
tables and indexes, building aggregates or summaries, and data purging. While
these activities are not unique to data warehousing, the sheer volume of data places
special requirements in terms of scalability and performance.
To effectively meet the needs of data warehousing, the database server has to
provide a data management infrastructure that delivers:
-
Capacity to deal with large data volumes. Again, as in query processing, this
need translates into a requirement for parallelism in the execution of data
management operations.
-
Efficient operations within finite maintenance windows. This need is commonly
achieved through fine-grained maintenance operations.