InfoSphere Warehouse Cubing Services is designed to provide
a multidimensional view of data stored in a relational database, thus
dramatically improving the performance of online analytical processing (OLAP)
queries. The Cubing Services Cube Server provides access to mainstream OLAP
Multi-Dimensional eXpressions (MDX) functionality by using industry-standard
OLAP connectivity. MDX is a relatively new defacto industry standard used to
efficiently and intuitively express OLAP queries. The Cube Server supports MDX
queries from many users against many different OLAP cubes.
What are OLAP services?
OLAP is a category of software technology that enables
business analysts, managers, and executives to gain insight into data through
fast, consistent, interactive access to a wide variety of views of information.
The raw data has been transformed into information that reflects the real
dimensionality of the business. OLAP provides the following benefits:
- Multidimensional views: OLAP services provide the
multidimensional views, business logic, and data aggregations of interest to
- Interactive analysis: OLAP helps you to synthesize
business information through the analysis of historical and projected data in
various "what-if" scenarios.
- Speed: OLAP services provide rapid responses to queries,
regardless of database size and complexity. You can pre-aggregate consolidated
business data with the hierarchies in all dimensions to reduce the runtime
costs for building OLAP reports.
What are InfoSphere Warehouse Cubing Services?
InfoSphere Warehouse Cubing Services dramatically improve
multidimensional access to relational data. These services provide an
infrastructure to model, optimize, and deploy OLAP cubes. There are services to
model the multidimensional metadata and map the metadata to relational data.
Other services streamline the deployment and management of the OLAP cubes.
The Cube Server is designed to enable fast multidimensional
access to relational data that is referenced by the OLAP cubes defined in the
Cubing Services metadata database. The Cube Server relies on the relational
data source to store the persistent low-level and aggregated data, which is
fetched into Cube Server memory only as needed. When the Cube Server starts, it
loads the cubes that are configured to start at the same time as the Cube
Server. The Cube Server loads all of the dimensional metadata of each of these
cubes into memory (the member cache).
After a cube is loaded, MDX queries can request data from
the cube. The Cube Server checks to see whether the results are in the data
cache. If they are, the results are immediately available. If not, data cache
misses are serviced through SQL queries submitted to the relational database;
hence, SQL query performance is crucial. The presence of appropriate materialized
query tables (MQTs also known as materialized views, summary tables, or
aggregate tables) and indexes on the database tables can make (speed of
thought) InfoSphere Warehouse Cubing Services analysis possible. Cube modeling
also plays an important role in the performance of the Cube Server.
Cube Server performance
The first thing to note about performance is that the Cube
Server uses DB2 relational OLAP (ROLAP) technologies. The Cube Server itself
does not provide any persistence for the data objects of a cube in
multidimensional form. Instead, the Cube Server relies on the relational data
source to store the persistent low-level and aggregated data, which is fetched
into the memory of the Cube Server only as needed.
Dimension member information is loaded from the underlying
relational data source. Thus, when a cube is started, several SQL queries are
submitted to generate the dimensions: one SQL query per dimension and one SQL
query for each level that has member properties. Each of these queries is a
SELECT DISTINCT() query, so for large dimensions, such queries can take a long
time. The dimension metadata (members) can be stored in one of two modes in the
- Static caching mode: All dimension metadata is stored in
memory. This has implications for scalability.
- Dynamic caching mode: The dimension members are read
from the database and stored in an indexed format in a user-specified
repository on the file system (thus, it is outside the relational database).
These members are cached in memory as needed by the queries. Thus, at any given
time, only a subset of the dimension members is in memory.
After the cube has been loaded, dimensional metadata is
either entirely in memory (if static caching is used) or partly in memory (if
dynamic caching is used), but the cube cell data itself is still in the
relational data source.
Several factors play significant roles in the performance of
an MDX query:
- Whether the cube cell data required by the query is in
the data cache
- How many SQL queries must be submitted to the relational
data source, for cells not in the data cache
- How well the SQL queries perform
The cube definition itself can impact the performance of the
Cube Server. As the number of dimensions in a cube grows, the size of the cube
cellset address grows, and the more expensive cellset address comparisons
become. In addition, with more dimensions in a cube, the SQL queries that are
generated to fetch cellset values generally require more joins. Essentially,
fetching cube cellset data for a cube with n dimensions from a relational data
source might require n-way joins. This complexity might impact query
optimization, and MQT rerouting might also become an issue.
Explore the creation of artificial levels in the hierarchy
if it is too flat. Dimension members with many (thousands of) children can
adversely affect performance, and the data cache becomes less effective.
Dimensions with many members but few levels (short, wide dimensions) produce
the same problems. In general, dimensions with many levels are supported
without a problem in the Cube Server, and you should use such dimensions if
they result in members without thousands of children.
Level keys uniquely identify a member at a level. Follow
these recommendations while specifying the level keys:
- Keep the level keys small: For example, a level key that
consists of level expressions from all ancestors of the member, such as
[USA].[West].[CA].[Santa Clara County].[San Jose], is typically less efficient
than a compact level key such as [<City_id of San Jose>].
- Avoid redundant information in the keys: For example, in
a Time dimension, the Month level keys do not need to contain the Quarter
level: that is, [Year].[Month] is a more efficient representation than
[Year].[Quarter].[Month]. The elimination of redundant information also keeps
the keys small.
- Avoid using a level key whose expression resolves to a
string: Level keys are used for grouping expressions in SQL queries; in
general, grouping over numbers is more efficient than grouping over strings.
String objects in Java have a large fixed overhead (about 48 bytes on a 64-bit Java Virtual Machine [JVM]) and a
level key that evaluates to a string uses more memory. The only exception to
this is when the string value is also the member name; the Cube Server stores
the member name as part of its core metadata.
Cube Server scalability
You can consider scalability in a number of ways, such as
the number of cubes, the size of a cube, the number of dimensions, the number
of members in a dimension, the size of the relational fact table related to a
cube, the number of concurrent queries, or the number of users.
The Cube Server runs as a single Java process, so in one
way, the scalability of the Cube Server is fundamentally limited by the Java
heap size, process size, or both. A 64-bit system and 64-bit JVM can
theoretically use very large memory configurations. However, a realistic limit
on the size of a Java process might be somewhere between 10 GB and 100 GB (at
which point, the Java garbage collection cost increases dramatically). When
that limit is reached, the performance of the process suffers significantly,
and scalability is effectively limited.
Performance implications of dynamic member caching
A dynamic member caching policy has performance implications
in the following areas:
- Cube-loading time
- Query performance
- Memory and disk space utilization
The cube-loading time is the elapsed time needed to make a
cube (query ready). Under dynamic caching, dimension rows are retrieved from
the underlying database exactly as they are retrieved for static caching.
Additionally, as the members are extracted from the rows, they are stored in an
indexed form at in a repository. Thus, the cube-loading time consists of the
time needed to retrieve the dimension rows, the time needed to write to the
repository, and the time needed to build the indexes in the repository.
Three factors are crucial to cube-loading performance under
- Performance of the underlying data source and the JDBC
connection. (This also applies to static caching.) Each cube dimension maps to
a separate (load) thread that independently accesses the underlying data source
and retrieves rows. All these threads share the same JDBC connection. Thus,
performance depends on the number of dimensions, the number of rows in each
dimension, and the type of SQL query used for this dimension object.
- Size of the member cache. This value determines the
amount of buffer cache space that is used to buffer the members in memory
before they are flushed (committed) to disk. For example, a value of 10 000
means that up to 10 000 members can be held in memory before a commit is
performed. Fewer commits means better performance
- Performance of the underlying file system. A file system
that is mounted over a network incurs a much higher performance penalty for
each write operation than a file system that exists on a local machine. The I/O
performance also depends on the amount of write I/O bandwidth available. For
example, a striped volume allows dimension members to be spread over multiple storage
devices and thus improves I/O bandwidth, reducing the overall loading time.
Query performance under dynamic member caching depends on
- Size of the member cache. This value determines the
maximum number of members that can be cached in memory. The query performance
depends on the hit rates in the member cache, which in turn depends on the size
of the member cache.
- Locality of access among queries. The member cache hit
rates of the queries also depend on the amount of locality (sharing) of access
among queries. For example, if the majority of the queries are accessing
information about the cities of California, the members corresponding to these
cities are cached in memory and are likely never discarded as a result of cache
- Performance of the underlying file system. The I/O
performance issues identified earlier for cube-loading performance also apply
to query performance. The main difference is that queries depend on the
read-I/O performance, whereas loading depends on the write-I/O
Where to find more information:
InfoSphere Warehouse Cubing Services User’s Guide and Reference
InfoSphere Warehouse Cubing Services Best Practices
Multidimensional Analytics: Delivered with InfoSphere Warehouse Cubing Services
Speed your complex queries with InfoSphere Warehouse Cubing Services
Optimize your data warehouse using IBM InfoSphere Warehouse Cubing Services for OLAP