DB2 Version 10.1 for Linux, UNIX, and Windows

GDPC High Availability and Disaster Recovery

Geographically dispersed DB2® pureScale® clusters (GDPC) provide high availability and disaster recovery failover when a cluster member goes down.

Geographically dispersed DB2 pureScale clusters (GDPC) can automatically and transparently recover from the same hardware or software failures as a single-site DB2 pureScale cluster. In addition, because a GDPC spans multiple physical sites, a GDPC can also automatically and transparently recover from hardware failures that traditionally affect an entire site, for example, localized power outages or localized network disruptions.

The estimated time for a GDPC to recover from software faults is comparable to the recovery time for software faults in a single-site DB2 pureScale cluster. As with non-dispersed pureScale clusters, if SCSI-3 PR is not being used, there is a slightly longer impact to the workload for hardware failures that affect an entire system. Recovery time is dependent on many factors, such as the number of file systems, file size, and frequency of writes to the files.

Care must be taken to ensure that sufficient space is available for critical file systems such as /var and /tmp because a lack of space on these file systems might affect the operation of the cluster services.

For a single system failure, any members on that system are restarted in restart-light mode, either on other systems at the same site, or on systems at the other site. Note that no preference is given to starting the member in restart-light mode on another system at the same site . Although this might be the intuitive expectation, there is no benefit in terms of overall failure recovery time. The restarting member will need to communicate with members and CFs from both sites equally, so the same member failover logic is used. After a primary CF system failure, the primary CF role will be failed over to the secondary CF at the surviving site.

Since GPFS™ storage replication is a key component of GDPC, and replication failures are transparent to applications, it is important to monitor the GPFS replication status along with the standard single-site DB2 pureScale instance monitoring. To simplify this monitoring task, the GPFS replication status can be queried with the db2instance command. For example:
db2inst1@hostA1>db2instance –list -sharedfs
If there are no GPFS replication issues, the output from the db2instance command is identical to the results from the db2instance –list command. However, if there are replication issues, you will see one of the following two messages:
  • There is currently an alert for the shared file system filesystem_name in the data-sharing instance. Critical data resides on disks that are suspended or being deleted.
  • There is currently an alert for the shared file system filesystem_name in the data-sharing instance. The file system is not properly replicated. Run the db2cluster command: db2cluster -cfs -rebalance filesystem_name.
If these warnings are reported, contact the storage administrator to determine the source of the storage or disk failures and resolve the issue. After the issues have been resolved and the affected disks have been recovered and are available for use, any missing data must be replicated on those disks using the db2cluster command. For example:
db2inst1@hostA1> /home/db2inst1/sqllib/bin/db2cluster –cfs –rebalance –filesystem filesystem_name

The db2cluster -rebalance command is a very I/O intensive operation and can have a significant impact on the running workload, so it is typically performed when the workload is at its lowest. However this must also be balanced with the requirement to re-enable full filesystem replication as soon as possible to be able to sustain future storage, system, or site failures.

Storage replica failure scenario

Consider a scenario where one complete storage replica has failed or becomes inaccessible. This type of failure is handled automatically and transparently by GDPC, however there will be a short period of time where update transactions are expected to be delayed. The exact length of the period where update transactions that are affected depends on the number of disks used for all the database file systems, the nature of the workload, and the disk I/O storage controller configuration settings. Note that each disk in a storage replica is considered an independent entity, and as such, rather than detecting an entire storage replica has failed, GPFS software will be informed by the storage controller separately for each disk in the failed storage replica. Therefore, the length of time until all filesystem accesses return to normal depends on:
  1. How long it takes the workload to drive a filesystem I/O to each of the disks in the failed storage replica,
  2. The length of time for the disk I/O storage controller to report individual disk failures back to the GPFS software, and the time for the GPFS software to mark those affected disks as failed.

Some disk I/O accesses can return to normal, while others are still delayed, waiting for a specific disk to return back an error. After all disks in a storage replica have been marked as failed, filesystem I/O times will return to normal, as GPFS will have stopped replicating data writes to the failed disks. Note that even though GDPC is operational during this entire period, after some disks or an entire storage replica has failed, there will only be a single copy of the filesystem data available, which will leave the GDPC exposed to a single point of failure until the problem has been resolved and replication has been restarted.

As mentioned earlier, one important thing to notice is that the storage failure recovery time is dependent on the storage controller’s configuration – in particular how fast the storage controller will return an error back up to GPFS so that GPFS can mark the affected disks as inaccessible. By default, some storage controllers are configured to either retry indefinitely on errors, or to delay reporting errors back up the I/O stack by a lengthy amount of time – sometimes even long enough to allow the storage controller to reboot. Although this is usually desirable when only one replica of storage is available (avoids returning a filesystem error if the error is possibly recoverable at the storage layer), this will increase the storage failure recovery time significantly, and in some cases will make the storage layer seem unresponsive, which might be enough to cause the rest of the cluster to assume that all members and CFs are also unresponsive, causing Tivoli® System Automation MP (TSA) to stop and restart them, which is undesirable. With GDPC, since there is a second replica of data, and a key requirement is automatic and transparent failure recovery of a wide variety of failures including storage failures, the storage controller failure detection time is reduced. A good starting point is to set the storage failure detection time to 20 seconds – the exact mechanism to do this will be dependent on the type of storage and storage controller being used. For an example on how to update the failure detection time for the AIX® MPIO multipath device driver, see Configuring the cluster for high availability.

After the storage controller is back online, the disks might still be considered down by GPFS. To check the status of the disks, use the mmlsdiskfilesystem -L command . If the storage controller is online, but the mmlsdisk command shows the disks as down, you need to bring the disks back up. This example shows how to bring the disks back into up state, specifying only the computer systems located at the site that contains the affected disks. In this example, assume those disks are located at site A:
root@hostA1>/usr/lpp/mmfs/bin/mmnsddiscover -a -N hostA1,hostA2,hostA3
root@hostA1>/usr/lpp/mmfs/bin/mmchdisk filesystem start -d gpfs disk identifier -N hostA1,hostA2,hostA3
root@hostA1>/usr/lpp/mmfs/bin/mmfsck filesystem -o

Note that the tiebreaker is not specified. To confirm that the disk has moved to up state, use mmlsdisk.

Site failure recovery scenario

Consider a scenario where either site A or site B experiences a total failure, such as a localized power outage, and is expected to eventually come back online. This type of failure is handled automatically and transparently by GDPC. Systems on the surviving site will independently perform restart-light member crash recovery for each of the members from the failed site in parallel. All members that were configured on the failed site will remain in restart-light mode on guest systems on the surviving site until the members’ home systems on the failed site have been recovered, that is if only one member system on the failed site recovers, then only the member configured on that system will failback to its home system. If the site that failed contained the primary CF, then the primary CF role will automatically failover to the secondary CF located on the surviving site. During recovery, there will be a period of time where all write transactions will be paused. The read transactions might be paused as well, depending on whether the data being read is already cached by the member, and whether the data being read is separate from data that was being updated at the time of the site failure. Data that is not already cached by the member must be fetched from the CF, which will be delayed until recovery is complete. The length of time that transactions are paused depends mainly on the time required for GPFS to perform file system recovery. File system recovery time is primarily influenced by the number of file systems as well as the frequency and size of file system write requests around the time of failure, so workloads with a higher ratio of updates might be affected by longer file system recovery times.

After one site has failed, the surviving site has the following abilities and characteristics:
  1. Have read and write access to the shared file systems (that is, there is full access to the surviving replica of data from the surviving members and CFs).
  2. Service all database transaction requests from clients from the members configured on the systems on the surviving site.
  3. Contain the primary CF (Primary role will transparently failover to the CF host on the surviving site if the primary CF was previously running on the site that failed).
  4. Run the members from the failing site in restart-light mode.

It is important that all the hosts on the surviving site as well as host T remain online, otherwise quorum will not be reached (to maintain majority quorum access to all hosts on the surviving site plus the tiebreaker host is needed).

When the failed site eventually comes back online:
  1. The shared file systems must be manually re-replicated to ensure that any data written at the surviving site is replicated to the failed site. This can be done with the mmnsddiscover and mmchdisk start commands as described in the previous section, "Storage Replica Failure Scenario."
  2. Members will automatically failback onto their home hosts.
  3. The CF on the failed site will restart as a new secondary CF.

Connectivity failure between sites scenario

Consider a scenario where all connectivity is lost between site A and site B (such as, the dark fiber between sites is compromised, switch failure, InfiniBand extender failure). To reduce the chance of this type of failure, a redundancy in connectivity between site A and site B is a best practice.

If one site loses all connectivity with the other site, as well as loses connectivity to the tiebreaker site, this form of connectivity failure will be identical to a site failure. The site that can continue to communicate with the tiebreaker site will be the surviving site. Until such time that connectivity is restored to the site, all DB2 members from the systems at the failed site will be restarted in restart-light mode on hosts on the surviving site, and the primary CF role will be moved over to the surviving site, if necessary.

If all connectivity between site A and B is lost, but both sites retain connectivity with the tiebreaker site, the GPFS software detects the link failure between the two sites, and will choose to evict all systems from one of the sites from the GPFS domain. Typically, the GPFS software favors keeping the site that also contains the current GPFS cluster manager (the current cluster manager can be determined by running the GPFS mmlsmgr command). The systems on the losing site will be I/O fenced from the cluster until connectivity is restored. In the meantime, TSA will respond to the loss of the GPFS connection by restarting all DB2 members from the affected systems in restart-light mode on systems on the surviving site, and will move the primary CF role from the failed site to the remaining site if necessary. To reduce the amount of DB2 recovery work needed in the event of a connectivity failure between sites, the site containing the primary CF is the one that remains operational. As such, if the mmlsmgr command shows that the GPFS cluster manager is located on the site that does not also contain the primary CF (as reported by db2instance -list), you can move it to the same site as the primary CF using the mmchmgr command. For example:
root@hostA1> /usr/lpp/mmfs/bin/mmchmgr –c primary_cf_system

As the location of the GPFS cluster manager may change, especially after a node reboot, it is monitored to ensure that it remains on the same site as the primary CF. If instead of a connectivity loss between sites A and B, all connectivity with the tiebreaker site is lost from both sites, the tiebreaker host T will be expelled from the cluster. As no DB2 member or CF is running on host T, there is no immediate functional impact on the GDPC instance. However, in the event of a subsequent site failure, quorum is lost.