LSPR workload categories
Historically, LSPR workload capacity curves (primitives and mixes) have had application names or been identified by a software characteristic. For example, past workload names have included CICS, IMS, OLTP-T, CB-L, LoIO-mix and TI-mix. However, capacity performance has always been more closely associated with how a workload uses and interacts with a particular processor hardware design. With the availability of CPU MF (SMF 113) data on z10, the ability to gain insight into the interaction of workload and hardware design in production workloads has arrived. The knowledge gained is still evolving, but the first step in the process is to produce LSPR workload capacity curves based on the underlying hardware sensitivities. Thus the LSPR introduces three new workload capacity categories which replace all prior primitives and mixes.
Fundamental Components of Workload Capacity Performance
Workload capacity performance is sensitive to three major factors: instruction path length, instruction complexity, and memory hierarchy. Let us examine each of these three.
Instruction Path Length
A transaction or job will need to execute a set of instructions to complete its task. These instructions are composed of various paths through the operating system, subsystems and application. The total count of instructions executed across these software components is referred to as the transaction or job path length. Clearly, the path length will be different for each transaction or job depending on the complexity of the task(s) that must be performed. For a particular transaction or job, the application path length tends to stay the same presuming the transaction or job is asked to perform the same task each time. However, the path length associated with the operating system or subsystem may vary based on a number of factors including: a) competition with other tasks in the system for shared resources – as the total number of tasks grows, more instructions are needed to manage the resources; b) the Nway (number of logical processors) of the image or LPAR – as the number of logical processors grows, more instructions are needed to manage resources serialized by latches and locks.
The type of instructions and the sequence in which they are executed will interact with the design of a micro-processor to affect a performance component we can define as “instruction complexity.” There are many design alternatives that affect this component such as: cycle time (GHz), instruction architecture, pipeline, superscalar, out-of-order execution, branch prediction and others. As workloads are moved between micro-processors with different designs, performance will likely vary. However, once on a processor this component tends to be quite similar across all models of that processor.
Memory Hierarchy and “Nest”
The memory hierarchy of a processor generally refers to the caches (previously referred to as HSB or High Speed Buffer), data buses, and memory arrays that stage the instructions and data needed to be executed on the micro-processor to complete a transaction or job. There are many design alternatives that affect this component such as: cache size, latencies (sensitive to distance from the micro-processor), number of levels, MESI (management) protocol, controllers, switches, number and bandwidth of data buses and others. Some of the cache(s) are “private” to the micro-processor which means only that micro-processor may access them. Other cache(s) are shared by multiple micro-processors. We will define the term memory “nest” for a System z processor to refer to the shared caches and memory along with the data buses that interconnect them.
Workload capacity performance will be quite sensitive to how deep into the memory hierarchy the processor must go to retrieve the workload’s instructions and data for execution. Best performance occurs when the instructions and data are found in the cache(s) nearest the processor so that little time is spent waiting prior to execution; as instructions and data must be retrieved from farther out in the hierarchy, the processor spends more time waiting for their arrival.
As workloads are moved between processors with different memory hierarchy designs, performance will vary as the average time to retrieve instructions and data from within the memory hierarchy will vary. Additionally, once on a processor this component will continue to vary significantly as the location of a workload’s instructions and data within the memory hierarchy is affected by many factors including: locality of reference, IO rate, competition from other applications and/or LPARs, and more.
Relative Nest Intensity
The most performance sensitive area of the memory hierarchy is the activity to the memory nest, namely, the distribution of activity to the shared caches and memory. We introduce a new term, “Relative Nest Intensity (RNI)” to indicate the level of activity to this part of the memory hierarchy. Using data from CPU MF, the RNI of the workload running in an LPAR may be calculated. The higher the RNI, the deeper into the memory hierarchy the processor must go to retrieve the instructions and data for that workload.
Many factors influence the performance of a workload. However, for the most part what these factors are influencing is the RNI of the workload. It is the interaction of all these factors that result in a net RNI for the workload which in turn directly relates to the performance of the workload.
The traditional factors that have been used to categorize workloads in the past are listed along with their RNI tendency in figure 5.
It should be emphasized that these are simply tendencies and not absolutes. For example, a workload may have a low IO rate, intensive CPU use, and a high locality of reference – all factors that suggest a low RNI. But, what if it is competing with many other applications within the same LPAR and many other LPARs on the processor which tend to push it toward a higher RNI? It is the net effect of the interaction of all these factors that determines the RNI of the workload which in turn greatly influences its performance.
Note that there is little one can do to affect most of these factors. An application type is whatever is necessary to do the job. Data reference pattern and CPU usage tend to be inherent in the nature of the application. LPAR configuration and application mix are mostly a function of what needs to be supported on a system. IO rate can be influenced somewhat through buffer pool tuning.
However, one factor that can be affected, software configuration tuning, is often overlooked but can have a direct impact on RNI. Here we refer to the number of address spaces (such as CICS AORs or batch initiators) that are needed to support a workload. This factor has always existed but its sensitivity is higher with today’s high frequency microprocessors. Spreading the same workload over a larger number of address spaces than necessary can raise a workload’s RNI as the working set of instructions and data from each address space increases the competition for the processor caches. Tuning to reduce the number of simultaneously active address spaces to the proper number needed to support a workload can reduce RNI and improve performance. In the LSPR, we tune the number of address spaces for each processor type and Nway configuration to be consistent with what is needed to support the workload. Thus, the LSPR workload capacity ratios reflect a presumed level of software configuration tuning. This suggests that re-tuning the software configuration of a production workload as it moves to a bigger or faster processor may be needed to achieve the published LSPR ratios.
Calculating Relative Nest Intensity
The RNI of a workload may be calculated using CPU MF data. For z10, three factors are used:
- L2LP: percentage of L1 misses sourced from the local book L2 cache
- L2RP: percentage of L1 misses sourced from a remote book L2 cache
- MEMP: percentage of L1 misses sourced from memory.
These percentages are multiplied by weighting factors and the result divided by 100. The formula for z10 is:
Tools available from IBM (zPCR) and several vendors can extract these factors from CPU MF data. For z196 and zEC12 the CPU MF factors needed are:
- L3P” percentage of L1 misses sourced from the shared chip-level L3 cache
- L4LP: percentage of L1 misses sourced from the local book L4 cache
- L4RP” percentage of L1 misses sourced from a remote book L4 cache
- MEMP: percentage of L1 misses sourced from memory
The formula for z196 is:
The formula for zEC12 is:
The formula for z13 is:
The formula for z14 is:
Note these formulas may change in the future.
LSPR Workload Categories Based on Relative Nest Intensity
As discussed above, a workload’s relative nest intensity is the most influential factor that determines workload performance. Other more traditional factors such as application type or IO rate have RNI tendencies, but it is the net RNI of the workload that is the underlying factor in determining the workload’s capacity performance. With this in mind, the LSPR now runs various combinations of former workload primitives such as CICS, DB2, IMS, OSAM, VSAM, WebSphere, COBOL and utilities to produce capacity curves that span the typical range of RNI. The three new workload categories represented in the LSPR tables are described below.
LOW (relative nest intensity): A workload category representing light use of the memory hierarchy. This would be similar to past high scaling primitives.
AVERAGE (relative nest intensity): A workload category representing average use of the memory hierarchy. This would be similar to the past LoIO-mix workload and is expected to represent the majority of production workloads.
HIGH (relative nest intensity): A workload category representing heavy use of the memory hierarchy. This would be similar to the past DI-mix workload.
LSPR Workload Primitives
Various combinations of LSPR workload “primitives” have been and continue to be run to create the capacity ratios given in the LSPR tables. Each individual LSPR workload is designed to focus on a major type of activity, such as interactive, on-line database, or batch. The LSPR does not focus on individual pieces of work such as a specific job or application. Instead, each LSPR workload includes a broad mix of activity related to that workload type. Focusing on a broad mix can help assure that resulting capacity comparisons are not skewed.
The LSPR workload suite is updated periodically to reflect changing production environments. High-level workload descriptions are provided below.
z/OS and OS/390
OLTP-T (formerly IMS) - Traditional On-line Workload
The OLTP-T workload consists of moderate to heavy IMS transactions from DLI applications covering diverse business functions, including order entry, stock control, inventory tracking, production specification, hotel reservations, banking, and teller system. These applications all make use of IMS functions such as logging and recovery. The workload contains sets of 12 (17 for OS/390 Version 1 Release 1 and earlier) unique transactions, each of which has different transaction names and IDs, and uses different databases. Conversational and wait-for-input transactions are included in the workload.
The number of copies of the workload and the number of Message Processing Regions (MPRs) configured is adjusted to ensure that the IMS subsystem is processing smoothly, with no unnecessary points of contention. No Batch Message Processing regions (BMPs) are run. IMS address spaces are non-swappable.
This IMS workload accesses both VSAM and OSAM databases, with VSAM indexes (primary and secondary). DLI HDAM and HIDAM access methods are used. The workload has a moderate I/O load, and data in memory is not implemented for the DLI databases.
Measurements are made with z/OS, OS/390, DFSMS, JES2, RMF, VTAM, and IMS/ESA. IMS coat-tailing (enabling reuse of a module already in storage) is not used; since this activity is so sensitive to processor utilization, it could cause distortion when comparing ITRs between faster and slower processors. Beginning with OS/390 Version 1 Release 1, measurements were done with one or more control regions. The number of data base copies, MPR’s, and control regions (within the limits of granularity) are scaled with the processing power of a particular machine in-order to assure equal and normalized tuning. Performance data collected consists of IMS PARS, and the usual SMF data, including type 72 records (workload data), and RMF data.
OLTP-W -Web-enabled On-line Workload
The OLTP-W workload reflects a production environment that has web-enabled access to a traditional data base. For the LSPR, this has been accomplished by placing a WebSphere front-end to connect to the LSPR CICS/DB2 workload (described below).
The J2EE application for legacy CICS transactions was created using the CICS Transaction Gateway (CTG) external call interface (ECI) connector enabled in a J2EE server in WebSphere for z/OS Version 5.1. The application uses the J2EE architected Common Client Interface (CCI). Clients access WebSphere services using the HTTP Transport Handlers. Then, the appropriate servlet is run through the webcontainer, which calls EJB's in the EJB Container. Using the CTG External Call Interface (ECI) CICS is called to invoke DB2 to access the database and obtain the information for the client.
For a description of the CICS and DB2 components of this workload, please see the CICS/DB2 workload description further below.
WASDB - WebSphere Application Server and Data Base
The WASDB workload reflects a new e-business production environment that uses WebSphere applications and a DB2 data base all running in z/OS.
WASDB is a collection of Java classes, Java Servlets, Java Server Pages and Enterprise Java Beans integrated into a single application. It is designed to emulate an online brokerage firm. WASDB was developed using the IBM VisualAge™ for Java and WebSphere Studio tools. Each of the components is written to open Web and Java Enterprise APIs, making the WASDB application portable across J2EE-compliant application servers.
The WASDB application allows a user, typically using a web browser, to perform the following actions:
- Register to create a user profile, user ID/password and initial account balance.
- Login to validate an already registered user.
- Browse current stock price for a ticker symbol.
- Purchase shares.
- Sell shares from holdings.
- Browse portfolio.
- Logout to terminate the user’s active interval.
- Browse and update user account information.
CB-L (formerly CBW2)-Commercial Batch Long Job Steps
The CB-L workload is a commercial batch job stream reflective of large batch jobs with fairly heavy CPU processing. The job stream consists of 1 or more copies of a set of batch jobs. Each copy consists of 22 jobs, with 157 job steps. These jobs are more resource intensive than jobs in the CB-S workload (discussed below), use more current software, and exploit ESA features. See table 15 for a list of some of the performance metrics for the LSPR batch workloads. The work done by these jobs includes various combinations of C, COBOL, FORTRAN, and PL/I compile, link-edit, and execute steps. Sorting, DFSMS utilities (e.g. dump/restore and IEBCOPY), VSAM and DB2 utilities, SQL processing, SLR processing, GDDM™ graphics, and FORTRAN engineering/scientific subroutine library processing are also included. Compared to CB-S, there is much greater use of JES processing, with more JCL statements processed and more lines of output spooled to the SYSOUT and HOLD queues. This workload is heavily DB2 oriented with about half of the processing time performing DB2 related functions.
Measurements are made with z/OS, OS/390, DFSMS, JES2, RMF, and RACF. C/370, COBOL II, DB2, DFSORT, FORTRAN II, GDDM, PL/I, and SLR software are also used by the job stream. Access methods include DB2, VSAM, and QSAM. SMS is used to manage all data. Performance data collected consists of the usual SMF data, including type 30 records (workload data), and RMF data.
The CB-L job stream contains sufficient copies of the job set to assure a reasonable measurement period, and the job queue is pre loaded. Enough initiators are activated to ensure a high steady-state utilization level of 90% or greater. The number of initiators is generally scaled with processing power to achieve comparable tuning across different machines. The measurement is started when the job queue is released, and ended as the last job completes. Each copy of the job set uses its own datasets, but jobs within the job set share data.
ODE-B - On Demand Environment - Batch
The ODE-B workload reflects the billing process used in the telecommunications industry. This is a multi-step approach which includes the initial processing of Call Detail Records (CDR), the calculation of the telephone fees, and the insertion of the created telephone bills in a database. The CDRs contain the details of the telephone calls such as the source and target numbers along with the time and the duration of the call. The CDRs are stored in flat files within a zFS file system. A feeder application reads the CDRs from the files, converts them into XML format and sends them to a queue. An analyzer application reads the messages from the queue and performs analysis on the data. During the analysis further information is retrieved from the relational database, and the same database is subsequently updated with the newly created telephone bill and new records for each call. The feeder and the analyzer applications are implemented as enterprise java beans (EJB) in IBM WebSphere Application Server for z/OS. Using the concept of multi-servant regions, which is unique to the z/OS implementation of WebSphere Application Server, the threads of the feeder and the analyzer applications are distributed over several java virtual machines (JVM). The WebSphere internal queuing engine is used as the queue for the message transport between the feeder and analyzer.
CB-J - JavaBatch
The JavaBatch workload reflects the batch production environment of a clearing bank that uses a collection of java classes working on a DB2 database and a set of flat files in z/OS. JavaBatch is a native, standard Java application that can be run standalone on a single JVM (Java Virtual Machine) or in parallel to itself on multiple JVMs. Each of the parallel applications instances can be tuned separately. All parallel applications are working on the same set of flat files and database tables. The JavaBatch application is based on a Java-JDBC-framework from an external banking software vendor and has been enhanced and adapted using the Websphere Application Developer tool. Various properties such as number of banks, number of accounts, and more can be adapted for the specific runtime environment. These are kept in a special properties file, keeping the java application unchanged.
The JavaBatch application allows a user to perform the following activities:
- initialize the working database
- create a set of flat files, each containing several hundreds to thousands of payments
- read the flat files, perform various syntax-checks and validation for each payment and store the payments to the working database
- read the payments from the database and route them to destination bank's flat files
CICS/DB2 - On-line Workload for pre-z/OS version 1 release 4
The CICS/DB2 workload is an LSPR workload that was designed to represent clients’ daily business by simulating the placement of orders and delivery of products, as well as business function like supply and demand management, client demographics and item selling hit list information. The workload consists of ten unique transactions.
CICS is used as a transaction monitor system. It provides both an API for designing the dialogue panels and parameters to drive the interface to the DB2 database. The interface between the two subsystems is fully supported by S/390 and exploits N-Way designs. CICS functions like dynamic workload gathering and function shipping are not exploited in this workload. The CICS implementation uses an MRO model, which is managed by CP/SM. The number of AOR (Address Owning Region) and TOR (Terminal Owning Region) used, depends on the number of engines of the processor under test. The ratio between TOR and AOR is 1:3. The utilization of the TOR and the AOR regions is kept under 60%.
The application database is implemented in a DB2 subsystem. One of the major design efforts was to achieve a read-to-write ratio exhibited by OLTP clients. Several data center surveys indicate an average read-to-write ratio to be in the range of 4:1 - 6:1. The read-to-write ratio is an indication of how much of the accessed data are changed as well. For this CICS/DB2 workload implemented on a S/390 or z/Architecture system and using DB2 as database system, an approximation of the read-to-write ratio is the ratio of SQL statements performing 'read' operation, like select, fetch, open cursor to the 'write' SQL statements, like insert, update, delete.
To reduce the number of database locks and the inter system communication required for each database update and to preserve local buffer coherency in data sharing environments, DB2 type 2 indexes have been used. Additionally, row-level-locking has been introduced for some tables. Each table and index is buffered in separate buffer pools for easy sizing and control.
Linux™ on zSeries
WASDB/L - WebSphere Application Server and Data Base under Linux on zSeries
The WASDB/L workload reflects an e-business environment where a full function application is being run under Linux on zSeries in an LPAR partition. For LSPR this was accomplished by taking the WASDB workload (described above under z/OS), and converting it to run both application and data base server in a single Linux on zSeries image. The WASDB/L workload is basically the same as the WASDB workload on z/OS with the exception of being enabled for Linux on zSeries. See the ‘WASDB - WebSphere Application Serving and Data Base’ section for a detailed description.
WASDB/LVm - many Linux on zSeries guests under z/VM running WebSphere Application Server and Data Base
The WASDB/LVm workload reflects a server consolidation environment where each server is running a full function application. For LSPR this was accomplished by taking the WASDB workload (described above under z/OS), and then replicating the Linux- guest a number of times based on the N-way of the processor. Guest pair activity was then adjusted to achieve a constant processor utilization for each N-way. Thus the ratios between processors of equal N-way are based on the throughput per guest rather than the number of guests.
The following terms, also denoted by the symbol (™) on this Web page, are trademarks or servicemarks of other companies.
- Linus Torvalds