This week I was speaking with an organization that is considering a data integration purchase and they asked a rather common question ...
why is Information Server unique when compared to <<insert competitor here>>
For those of us who have followed the various technologies in market for years (15 actually... we just had our birthday in January), it's a fun question to answer. I especially like getting that question these days, since over the last two years IBM has done some very unique things that bring a wealth of value to our customers and truly differentiate our suite. While there are several ways we can take that discussion, the question prompted me to get out the blog tonight so I can share some thoughts on one very specific set of distinctiveness .... data integration styles.
So, how many ways are there to move data?
You'd be surprised.... there really are several. As you can imagine, these align to the ultimate business goal an organization is trying to achieve. For instance, if you are interested in a topic like "real-time data warehousing" then you will need to look at a very different set of features than someone who may be looking at a project like "an application migration". For organizations that are looking for a data integration platform that can serve as the middleware layer supporting a diverse set of projects, they may need to consider a set of capabilities that reach beyond the current project at hand.
There are several ways that we can logically divide data integration styles, but my top level swags are always "batch" vs "real-time". Those categories are almost universally understood, so it gives a good construct with which to then dive into more details. In the following sections we'll look at some details on batch style processing, and then turn to the real-time styles.
Batch processing styles
This is where it all began. When data warehousing was growing up, batch style ETL processing was the the only game in town. This is still the most popular style of processing, supporting thousands of organizations data integration focused projects. Batch processing has the most demanding performance and scalability requirements. It requires ETL platforms can not only run native bulk operations against source and target DBMS systems, but also leverage a variety of techniques for parallelism. Many ETL tools in the market will support data pipelining (think of an assembly line with each step being a different process). The market leaders will also support another style of parallelism ... data partitioning (think of multiple assembly lines). Data partitioning allows you to divide the data that must be processed into a set of smaller sets that can be worked on independently. DataStage provides some very unique features in this area, including a shared nothing archictecture for true scalability, as well as a concept known as dynamic repartitioning. Dynamic repatrtitioning (at least the way IBM means it.... don't let your other ETL vendor fool you ) is the ability to provide options at runtime to determine the degree of data partitioning that will be run across any set of hardware ... SMP, MPP or grid. Without the ability to do this, an organization that finds they have a large file to process will likely need to go back to their development environment and redesign their job to now run in N-ways instead of M-ways. This can be a costly proposition and one that Information Server customers don't have to worry about.
Type 2: Micro-Batch
"Micro-batch" is running typical batch style designs on a periodic basis. In some cases, organizations will be interested in micro-batch as a first step toward making their business project closer to real-time... perhaps running these jobs every 10 minutes. In other cases, the organization will have a very specific use case in mind. The classic example when the organization receives files on a regular basis throughout the day and they need to process whatever has arrived as soon as they can. These projects will also generally require more parameterization of the data sources, targets and even data schemas (which is a great use case for DataStage's runtime column propagation). Also, we shouldn't assume that the data volume will be small just because we are running every few minutes. Particularly in certain industries we find that the data sources may still be very large, so there is still a great demand for a performant and scalable data integration engine. One unique feature of DataStage in this regard is to wildcard scan for the number of data sources that are now available and then scale that out over N degrees of parallelism. Finally, there are unique scheduling and workflow requirements around triggering activity based on file arrival or sleeping for a specified period of time. The DataStage sequencer provides these features to support the micro-batch scenario.
Type 3: ELT (or mixed TETLT)
The industry has settled the long war of ELT vs ETL and recognized that each technique has a set of scenarios where it is optimal. I have two scenarios that I use to describe this. Scenario 1 is the conversion of an XML file to a CSV. There is no good reason to move this informaiton into a database only to extract it again - this is clearly predisposed to ETL. Scenario 2 is the creation of an aggregate table from a large detail table. There's no good reason to extract this data from the database since the system is optimized to do this type of transformation and an ETL topology will introduce a good deal of network latency. So, you can imagine that "data proximity" is a good yard stick to the processing type that may be your best fit . The right data integration technology will be able to provide both ETL and ELT and allow developers to switch easily between the two. The Information Server implementation (called "Balanced Optimization") is different than other vendors in that it leverages a single design canvas and single set of widgets/stages to represent both paradigms. This means that to go from ETL to ELT there is no redesign work required - the user simply hits the "optimize" button and they will have the ELT version of their logic. Information Server is also unique in it's blended approach that allows the user to push processing to the source db, target db, or both, as well as subset some processing on the ETL server where required (like our integrated quality components).
Real-time processing style
Type 4: Change Data Capture integration
Change Data Capture (CDC) technology monitors DBMS inserts, updates, and deletes so they can be replicated to another data store. The InfoSphere CDC technology minimizes impact to the DBMS by triggering off of what is written to the database log files without any requirement to further stage the information. Organizations have historically adopted CDC technologies primarily to support Real-Time Data Warehousing, often in support of web based store fronts and call center portals. There is now also a growing interest to adopt CDC technology to replace portions of batch processing that was focused on finding deltas between full volume extracts. CDC offers a great value proposition for minimizing resource investments in these cases. These delta records can be piped directly into a data integration/quality job so they can be transformed and cleansed to the extent required. Information Server can then write these to a heterogenous set of data targets while providing two-phase commit support to guarantee delivery of the data. The integration of data replication, data transformation and data quality into a single runtime architecture (for complete scalability), metadata store (for complete lineage), and design enviornment (to maximize time to value) is unparalleled.
Type 5: Message-queue based integration
For many years IBM has been combining data integration and application integration software to provide unique solutions. Organizations use WebSphere MQ (the leading market technology) for delivering messages between their applications and other information initiatives. For many years we have been working wtih organizations who recognized our development tooling as providing a great time-to-value for satisfying complex transformation capabilities in this domain. Tens of millions of dollars are tranferred using Information Server every day for the largest financial institutions in this fashion. Information Server connects directly to one or more queues as sources and then leverages the full compliment of integration and quality components to transform and cleanse this information. Similar to the CDC scenario, this data can then be commited with guaranteed deliver to a heterogenous set of targets. We refer to this as our distributed transaction stage and it can target all of the leading DBMS vendors along with a range of others data stores. I feel confident with the bold statement that no other ETL vendor is as well integrated with message based technologies.
Type 6: Information Services
Information Services is a form of data virtualization, providing a common access layer to any type of data source and information processing. A key advantage to adopting such a virtualization strategy is to insulate the end user from variations in the source technologies and thus accelerating their ability to deliver business value. Services should be accessible through a variety of invocation bindings - EJB, HTTP, JMS, REST, RSS - and in a variety of formats for the consuming applications - such as SOAP, Text, and XML. Over the last 8 years that Information Server has provided these features, some companies have adopted information services for very specific requirements, such as to provide a person matching function within an ERP or MDM solution, while others are building out an information fabric for distributed and mainframe data access. Since our information services are fully integrated with our design and runtime environment, information services can include multiple dozens of data source types (or "information service providers") for dbms, sequential files, xml, vsam, message queue, Hadoop, ERP, etc... and any of our transformation, profiling and cleansing components. This category can be broken down further into three topologies, each which has it's own targetted use cases
- input/output service: services that are always running/active and receive and can receive an input row at any moment in order to deliver an output result. This topology is typically used to process high volumes of smaller transactions where response time is important. It is tailored to process many small requests rather than a few large requests.
- output service: the service call invokes a job to run where the return values can consist of an atomic value (one column), a structure (multiple columns), or an array of structures (multiple rows). These jobs typically initiate a batch process from a real-time process that requires feedback or data from the results. It is designed to process large data sets and is capable of accepting job parameters as input arguments.
- service triggered batch jobs: the service call invokes a batch job to run on demand. Organizations who use this may be integrating into a business process where an authorized user triggers an event that then launches the delivery of data to a downstream process.
So, how many data integration styles do I need?
Within any single project you are probably looking at just one of two of these. As the next several projects start rolling, you will have probably started to consider where growing into the other data integration styles can help accelerate business value to your organization. A key factor in capitalizing on this value, is having the ability to leverage the existing tools, infrastructure and skill set you have already established. Information Server provides you an on ramp for not only this set of capabilities, but also a wealth of others around scalability, connectivity, profiling, cleansing, quality monitoring, metadata, governance, etc... If you are interested in learning more about these features and the flexibility of Information Server to handle this breadth of data integration scenario you can learn more in our Information Center or send me an email. Always happy to talk anytime.