|System z on Facebook
Online Transactional and Analytics Processing using SPSS, DB2 z/OS, DB2 Analytics Accelerator
Caroline Exum 270004MPQK firstname.lastname@example.org | | Tags:  systemz ibm z/os socc analytics for im accelerator zenterprise db2 spss bao roland_seiffert | 4,133 Visits
by Roland Seiffert, IBM System z-based Workload-optimized Solutions, IBM Systems Optimization Competency Center
IBM System z has a clear leadership position in enterprise data management and online transaction processing (OLTP). However, data warehousing, data mining and business analytics are in many cases implemented on distributed systems. Separating data management for OLTP and online analytics processing (OLAP) has several disadvantages with significant issues for clients running these systems. For example, primary data that originates in OLTP must be replicated to the distributed system for OLAP. This results in complex ETL setups, massive data movement, security issues around data access, and unavailability of real-time OLTP data in OLAP as well as unavailability of historical data in OLTP. For example, predictive analytics for payment fraud detection requires both the access to historical data (observed patterns in the credit card usage of a particular client), as well as real-time data (current transaction in flight and the last few transactions for the card and/or client). Scenarios as described above are typical for what we are calling online transactional analytics processing (OLTAP).
IBM has started to implement a strategic agenda to turn System z into the leading platform for OLTAP. A few examples are included below.
I’m part of an organization called the IBM Systems Optimization Competency Center (SOCC). My team’s role is focused on advancing the System z business analytics strategy through optimized integration. We are currently working on an initiative that investigates an important next step in the realization of the System z OLTAP vision. When the data for OLTP and OLAP resides on the same System z server, business analytics processing is done against DB2, potentially accelerated by DB2 Analytics Accelerator, and predictive analytics (scoring) are executed within DB2.
Today it’s possible for SPSS Modeler to run on zLinux and access data in DB2 for modeling. However, this approach requires that large amounts of data are moved from DB2 to the SPSS Modeler to run the compute-intense mining algorithms on the data.
The key question becomes: Can we enhance the performance of data mining and modeling using SPSS Modeler by applying acceleration technologies? The answer is yes. The basic idea is to push down the execution of the data mining algorithms to the data, in this case to IBM DB2 Analytics Accelerator. The approach promises huge performance gains because it does the following:
Further performance optimizations can be applied, such as keeping temporary data (intermediate results) when executing an SPSS stream inside the IBM DB2 Analytics Accelerator.
In addition to significant performance enhancements, we expect to dramatically improve scalability. The practical issues of the massive data transfer to the data analysis / mining application forces data analysts in today's implementation to work with only subsets of the available data (samples). By applying highly scalable and parallel algorithms directly on the data nodes of a cluster, the full data sets can be used in place for modeling — even in interactive data analysis. This will improve the overall quality of the resulting models.
The SOCC proved the feasibility of the approach by building a working prototype that allows a data analyst to use SPSS Modeler to create a “modeling stream”, push down the stream execution via DB2 to DB2 Analytics Accelerator, generate a model, export the model to DB2 and deploy the model with the SPSS Modeler Scoring Adapter in DB2 for z/OS.
This demonstrates how the workflow involved in data analysis and mining performed by a data analyst using SPSS Modeler can be transparently optimized to take advantage of the DB2 Analytics Accelerator, thereby significantly improving performance and scalability.
The prototype created by the SOCC was used for a performance evaluation of the approach for a number of basic operations regularly involved in predictive analytics. For interactive data analysis, SPSS Modeler offers support like the “Data Audit” node, which generates histograms and statistical summaries for the analyzed data. In the prototype, the existing SPSS Modeler took two-and-a-half hours to process a table with 15 million records, while the IBM DB2 Analytics Accelerator was able to do comparable computations in less than two minutes, with a single histogram on all data being processed in five seconds.
For simple data transformations like data discretization, the size of the processed table does not have significant impact on overall performance of the IBM DB2 Analytics Accelerator up to a certain point: It takes 9.7 seconds to process 100,000 rows; it takes 10.6 seconds to process 2.5 million rows. Even a table with 250 million rows is doable in 45 seconds. In contrast to that, SPSS Modeler Server execution time is directly proportional to the number of processed rows. For the simple data transformation scenario, we measured the IBM DB2 Analytics Accelerator to be eight times faster if processing 100,000 rows and more than 400 times faster if processing 10 million rows.
To create predictive models like decision trees or association models, SPSS Modeler is — not surprisingly — faster for small tables. The DB2 Analytics Accelerator outperforms SPSS Modeler if input data is large enough, being 6.2 times faster than SPSS Modeler if working, for example, on an IBM retail blueprint-based scenario with over 15 million input rows.
The SOCC initiative has clearly shown that analytic processing can be optimized on the System z platform in a way that creates an ideal environment for OLTAP. The tight integration between transactional and analytical processing is a unique capability of the System z platform that provides significant value to clients. IBM will continue to enhance the implementation of this vision by investing in new technology like the work described here.
Please contact me, Roland Seiffert, if you would like more information on this work via LinkedIn. My team and I would appreciate any feedback, comments or suggestions on the topic.
Roland is the technical leader of the Böblingen, Germany division of a new IBM organization called the Systems Optimization Competency Center. He is focused on advancing the IBM System z business analytics strategy through optimized integration. Recently, Roland was part of the zEnterprise core design team, acting as the architect for the hypervisor integration and management for zBX-integrated systems. He has also been lead architect for hybrid technologies for Linux on System z — with a focus on leveraging heterogeneous and host/optimizer structures in hybrid workloads. Connect with Roland on LinkedIn.