The new InfoSphere Warehouse 10 and DB2 10 put a strong
emphasis in delivering faster query performance in the warehouse space. I’d like to take this opportunity to share
with you some of the new features and enhancements. These features and enhancements are on by default;
there are no configuration settings or changes to the SQL statements required. So, our customers will enjoy the performance
Let’s begin with hash join.
It is an efficient join method often chosen by warehouse queries. In Version 10, we’ve made numerous
improvements in hash joins, which include better memory usage, code
optimization, early filtering, and better optimizer costing. These improvements together provide
performance boost in hash join processing.
Furthermore, we allow hash joins to be considered in a wider scenario. For example, hash joins will now be considered
even if the two columns in the join are not of the same data type, or the join
predicate contains an expression like T.C1 =
UPPER(T1.C3). For the latter
case, we were restricted to using nested loop join in the previous releases.
When you have a star schema, DB2 provides two special join
methods, namely semi-join with indexAnding (a.k.a. star join) and Hub join (a.k.a.
Cartesian join). In the new release, a
new join method called zigzag join (ZZJOIN)
is introduced to expedite the processing of queries based on a star
schema. This join method filters the
fact table on two or more dimension tables simultaneously and skips probes into
fact table which are known to be unproductive.
Zigzag join has been shown to be an efficient method for querying large
fact table, and significantly reduces IO.
Aggregation (e.g., sum, count, distinct) is also common in
warehouse queries. Version 10 introduces
3 new operations to improve the processing of aggregation. They are partial early distinct (pUnique), partial early aggregation (pGRPBY), and multiple distinct progressing (MGSTREAM). Partial early distinct partially removes
duplicates early in the processing of a query if unique/distinct processing is
required. Partial early aggregation, on
the other hand, attempts to do partial aggregation of data early on. Both of them aim at reducing the amount of
data that must be processed later in the query evaluation and reducing the chance of
running out of sort heap memory, leading to a significant speed-up in query
performance. Finally, for multiple
distinct queries like SELECT K, COUNT(DISTINCT
C1), SUM(DISTINCT C2) FROM T GROUP BY K, we now have a new method to
process this type of queries, which will avoid redundant work of sorting on the
grouping key and reading the base tables.
As many of you know, queries against tables with composite
indexes present a particular challenge.
Ideally, a query’s predicates align with a table’s composite
index. However, this is not always the case.
Consider, for example, a table T with a composite index (C1, C2,
C3). When we have a query SELECT * from T WHERE C1=10 and C3=5, we have
a gap in C2. Using the composite index
will likely result in processing many unnecessary keys. In Version 10, we can now skip over
parts of the index that does not yield any results, and improve the performance
of an index scan. This new feature is
known as jump scan. Also because we can now
make an effective use of composite indexes, the number of indexes required may
be reduced. Talking about index access,
I have to mention our new smart index prefetching, which enables more efficient
prefetching for index scan (IXSCAN).
Similar approach is available for data prefetching when we use an index
to retrieve the data page (ISCAN-FETCH).
Smart index and data prefetching help reduce synchronous IO and the need for table/index reorg.
Large data warehouse systems often employ database partitioning
feature (DPF) where data is distributed across multiple partitions. These partitions may be resided in different
physical machines to provide us with a scale-out solution. In
such an environment, table queues are the mechanism used by DB2 for passing
data between partitions during query evaluation. Hence, they play an important role in how
fast a query will perform. In Version 10,
we’ve made several table queue enhancements, e.g., adding intelligence to
improve table queue’s throughput, and reducing potential spilling from table
There are many, many new features and enhancements in the
new InfoSphere Warehouse 10 and DB2 10 to improve overall query
performance. The above is by no means an
exhaustive list. Version 10 has demonstrated
significant performance boost in warehouse queries over the previous releases. I’m sure that our customers will be pleased
and enjoy this faster query performance.
Michael Kwok, Ph.D.
Manager, DB2 LUW / ISW Warehouse Performance
IBM Toronto Lab