Unleashing the Power in DataStage 8.5
Issue 2 - Caching in the Transformer
This is the second in a series of blogs on the new features of InfoSphere DataStage 8.5. In this series I'll provide some detail on the new functionality and explain areas where I think customers will receive significant benefit. The first issue, "Looping Transformer" is available at the following link. This blog will build on some of those concepts, so you may want to review that post before reading this one.
One of the guiding principles for the DataStage roadmap is how we can simplify even the most complicated data integration challenges. When we can achieve that, and make data processing more efficent, it's the best of both worlds. The new cache mechanism is a benefit to both of those goals. We'll illustrate this with an example that illustrates calculating percentages within a group.
What's a Transformer Cache?
The Transformer Cache is an in-memory storage mechanism that is available from within the Transformer stage and is used to help solve complex data integration scenarios. The cache is a first-in/first-out (i.e. FIFO) construct and is accessible to the developer via two new functions:
- SaveInputRecord: stores an input row to back of the cache
- GetInputRecord: retrieves a saved row from the front of the cache
These functions should be called from the stage variable or loop variable sections of the transformer in most cases.
Developers will find the cache most useful when a set of records need to be analyzed as a single unit and then have a result of that data appended to each record in the group. Let me describe some scenarios where using a cache will prove VERY helpful:
The input data set is sorted by fund id and valuation date in ascending order. We have an unknown number of records for each fund. The requirement is to output the five most recent valuations for any fund and if there are not at least five, do not output any.
There is a varying number of clients (N) related to each salesperson. The requirement is to label each such client detail record with a label that reads "1 of N".
An input file contains multiple bank accounts for each customer. The requirement is to show the percentage of the total balance for each individual account record.
Perhaps one or more of these sounds familiar to you. I've needed to implement the third scenario a few times myself and have seen many customers do so with a similar design pattern. While that standard pattern was effective, the cache solution will prove more resource efficient. I'll describe this solution below. You may also refer to the Information Server InfoCenter for more detail
on this solution.
Percentage of Total Scenario
I've included a picture of the transformer with a few illustrations that will help aid understanding. The logic of this transformer is as follows:
- input data is read and stored in the cache via the "SaveInputRecord" function
- key breaks are tested via the "LastRowInGroup" and "LastRow" functions
- a summing balance is calculated while the group is the same and the total balance field is set at key break
- if there is a key break, the "RowsAtBreak" variable is set to the number of records on the cache and we'll skip to step 6
- if there is no key break, the Loop Condition will test false (since RowsAtBreak will be 0). This will cause the next input record to be read and processed (note the loop pictured on the top left)
- finally, the transformer will loop while "@ITERATION <= RowsAtBreak" and read the cache records (this is the loop pictured on the bottom right)
- for each output record processed, the percentage is calculated as Balance/TotalBalance.
Most of this logic will seem fairly intuitive but two items need additional clarification.
In step 5, we noted that RowsAtBreak must be "0" until we are ready to process output data. This will prevent the transformer from entering the Loop construct since @ITERATION is equal to "1" before we enter the loop. Since 1 <= 0 will test false, the transformer will not enter the loop construct and instead will read the second input record. Assuming we then hit a key break, RowsAtBreak will be set to 2 and we will then iterate through the loop twice.
The second item to clarify is a simpler one. In the loop variable we call the GetSavedInputRecord function. When this is called, the output derivations for "DSLink2" redirect to the row pulled from cache. This allows us to code a single output derivation (for instance "DSLink2.Balance/TotalBalance") which would work for either a cached row or one coming direclty from the input link.
At this point, we've solved scenario 3. You probably can surmise the variations that would be required to solve scenario 1 and 2... they only require a few small tweaks to this model.
Earlier I mentioned that there are other ways to solve these data integration challenges in Datastage. One of the more common design patterns would have required splitting the data down two links, aggregating by ID on one of those links and then joining the data back together to put the ID total balance on each detail record prior to calculating the percentage. In comparison, the cache solution described in this post requires less system resource. Developers familiar with 3GL coding will also likely appreciate the loop condition construct as a more natural way to solve the challenge.
OK, so now here's a challenge for you. I have a hardcopy of the redbook "Deploying a Grid Solution with IBM InfoSphere Information Server
" that I'll send to the first person who posts a comment to this thread where they describe a variation on this cache solution that requires one less stage variable (I have one in mind ... perhaps you'll think of another).
Also, stay tuned for the next post on vertical pivoting.