Unleashing the Power in DataStage 8.5 Issue #4: Scalable and Intuitive XML
You know the story of the ugly duckling? A mother duck's eggs all hatch and one of the baby "ducks" is rather odd looking. The bird is teased for being different. However, after winter ends and spring arrives, the bird has matured into a beautiful swan.
If you have used the previous DataStage XML pack you would probably agree it was a little awkward; XML in ETL tools has never been simple. It could handle moderate to complex requirements, but it required a lot from the user during the job design process. Additionally, since it relied on a standard industry parser, it wasn't easily scalable.
The perspective on 8.5 is very different... winter has ended and spring has arrived. However, I don't want to call the new XML pack a "swan". Sure, it includes a very attractive UI component, but its runtime engine component is also incredibly powerful. So, if you don't mind me mixing bird metaphors, let me call it a "phoenix" and I'll use this blog to focus on both aspects... pretty and packing a punch.
New UI Components - Schema Library Manager and Assembly Editor The 8.5 XML pack has been built specifically with hierarchical data in mind. This means that the various components of the data integration process - metadata representation, data operations, root selection, field mappings, etc... - all present the hierarchical model in a very intuitive way in order to simplify the job design process and thus accelerate the user's time to value.
Pictured to the right is the "Schema Library Manager" (please click on the pic to see it - the blog emulation doesn't scale well). In 8.5 the xsd files are imported into the repository in their native format. This ensures that the metadata is preserved entirely. The UI has been enhanced to show the tree structure and extended attributes and factes for each node of the hierarchy. Users may build multiple "libraries" in order to group related xsd's into the most logical buckets. The Schema Library Manager is accessible from both the main Designer menu as well as from within the Assembly Editor (discussed below) in order to present an integrated method to move from metadata import into job design.
Once the metadata has been imported, the developer works in the XML Stage's "Assembly Editor" to define the logic for the job. An "assembly" defines a series of steps/tools that parse, compose, and transform hierarchical data. That last point is one I should stress. These tools are designed to work within the hiearchical data. That means that a join can operate to combine two XML data streams at a specific level of the hiearachy. This enables very complex integration tasks to be handled very simply.
Pictured to the left is the the Assembly Outline and Assembly Pallete. The pallete contains each of the tools that can be selected by the user. The user drags any of these tools directly into the outline. Jobs that will read an xml document will use the XML_Parser to extract the fields that are required from the doc. Jobs that are writing data will use the XML_Composer step to form the new xml. Additionally, a user may use both of these tools within a single assembly in order to transform an xml doc from one form to another - without ever having to convert this data into a relational format.
As you would expect, each step contains a set of properties that are necessary to define the logic for that stage. For example, the Sort step requires that the user selects the level of the hierarchy that is to be sorted (i.e. what level of the doc is being sorted) and the key field within that scope. Pictured to the right is the Mappings section for the Output step (in this scenario we are converting the XML data into a relational stream). At the output link level, the user specifies the primary scope element from the hierarchy and can then either map each field individually, or use the Auto Map features which selects the best candidate column through a context sensitive scoring algorithm. You can find a much more comprehensive view of these features in our publically available Information Center
Engine Runtime Component Like the user interface, the engine component has been built specifically for processing hierarchical data. Rather than relying on open source parsing technology which has historically had scalability issues, IBM has engineered a highly performant engine component for this solution. Some of the distinguising features of this runtime are:
Streaming – Support for any document size without specific memory requirements
Parallelism – Multi-threaded - Employ multiple threads in a single stage – Partitioning - DataStage partition parallelism can be used to run multiple instance of the stage on multiple cpus/hosts – Document partitioning - Partition parallelism for a single large document
Optimized – Compile Time optimization to remove unnecessary processing and data. – Runtime Scratch disk optimizations
Large Object support – Can stream large objects from source Connectors through XML stage and to target Connectors
Customers who have already adopted this XML pack have had truly fantastic results. In comparison to using a standard industry parser, one customer found that without even implementing any of the parallelism techniques, they were already able to process the same use case in 1/9th of the time and with a fraction of the resource requirements. Another customer (known for processing very large data) did something of a bake-off by running gigabytes of XML data through both the new XML stage and custom code they had previously written for their use case. While both solutions performed similarly, the DataStage features were compelling since the performance came without having to invest significant time in tuning custom code that only applied to a single solution.
If you have the need to process large and complex XML, I encourage you to take a look at DataStage 8.5. If you're looking for more information, I would encourage you to explore the Information Center link I supplied above or Ernie Ostic's "Real-Time Data Integration
" blog. Ernie is an expert resource on several topics and has a wealth of material he has made available on his site related to XML.
No waterfowl or mythological winged creatures were harmed in the writing of this blog