Skepticism over Clustering and Mirroring as Backup and Recovery Solutions

I recently came across a control in a client’s processes that threw up a red flag and will definitely get a bit more attention from me during our audit. The control mentioned Clustering and Mirroring as part of their Backup and Recovery solution:

Control Statement:
Data mirroring is implemented at the primary data center for backup and recovery. Virtual Machines are hosted on clustered Hosts. The backup management tool is configured to perform backups per the defined backup policies and procedures.

In this post, I will describe what Clustering and Mirroring are and explain the reasons why they are not suitable Backup and Recovery options (in many cases). In the case of the control above, the auditor should investigate to verify that proper backup and recovery solutions are in place, or if the control simply needs to be re-worded to better represent the recovery and backup solutions in place.

After reading this post, the auditor or IT manager will have a better understanding of the technologies and how they best fit into the IT operations of the organization.

What is Clustering and Mirroring?

Clustering and Mirroring are “High Availability” solutions. In information technology, high availability refers to a system or component that can quickly recover from a failure. (read more)

Clustering and Mirroring make it possible to achieve very short Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

Clusters are a collection of IT components (e.g. servers and virtual machines, routers, network switches, hard disks) that are configured to operate as if they were a single component. This allows for uninterrupted access to the service being provided in the event of the failure of a ‘node’ in the cluster.

Cluster

Mirroring is the replication of one disk to separate hard disks in real time. A popular implementation of mirroring that might be implemented on a personal computer is RAID 1 (Redundant Array of Inexpensive Disks Level 1).

Mirroring

Considerations for the Auditor

Let’s assess the high level issues with using clustering and mirroring as part of a Backup and Recovery solution.

Clustering as a Recovery Solution

While clustering offers a higher degree of availability in the event of a system or component failure, typically the nodes that comprise a cluster reside in the same geographic location (i.e. clustered routers sit in the same rack within the same data center), resulting in the data center being a single point of failure.

Further, in the case of clustered Virtual Machines, all the nodes that comprise a Cluster might reside on a single Host (the physical server that hosts all the Virtual Machines), resulting in the Host being a single point of failure.

Unless WAN Clustering or GeoClustering (i.e. clusters whose nodes are geographically dispersed) is implemented, clustering is not a suitable recovery solution.

Mirroring as a Backup Solution

While Mirroring and Backup involve making copies of data, the two should not be confused.

It is true that Mirroring might allow you to create a backup from a healthy disk in the event that another disk in the array fails, but were Mirroring fails as a backup solution is in the case of data corruption, such as database Input/Output errors or issues caused by viruses.

Once bad/corrupted data begins being wrote to disk, the bad/corrupted data is then replicated to the next disk in the case of mirroring.

On the other hand, a Backup is an image or snapshot of a specific point in time that exists independently of the live disk system. If a disk or database become corrupted, a full restore to a specific point in time can be achieved.

Conclusion

The short and easy is, if you discovery either Clustering or Mirroring being cited as a Recovery or Backup solution respectively, investigate!

Remember:

  • Unless the components of a cluster are geographically dispersed, they are not suitable for Disaster Recovery purposes.
  • Mirroring and backups both provide data redundancy but Mirroring differs from backups in that mirroring is real time redundancy to safeguard against disk failures, while backups are snapshots of data at a specific point in time.
  • Clustering and Mirroring are “High Availability” solutions!

Want to really impress the IT Operations staff? Understand the difference between High Availability and Fault Tolerance and call them out for misusing the terms.

3 thoughts on “Skepticism over Clustering and Mirroring as Backup and Recovery Solutions

  • Impress…or annoy for calling them out 🙂

    I’ve also seen business units use application clusters for load balancing a high traffic application. To save costs, however, they only use a single web/application server in their backup data center. They would conduct annual business resumption tests at the backup site, but only confirm that the application was up and running normally. They didn’t apply any load to determine if the application could handle normal business traffic in the event of a long term contingency event.

    Something else to consider when looking at application clusters.

  • The Application Server Cluster diagram doesn’t include all of the options.

    Most clients enabling clustering, say with VCS (Veritas Cluster Services) or HP ServiceGuard or IBM’s PowerHA will define either a primary/redundant configuration across two data centres. The boot and data volumes for each server are served by the SAN located in each data centre.

    The boot volumes aren’t replicated/copied between data centres, though they will be RAID 5 in the SAN disk arrays, providing a degree of redundancy.

    Data volumes though are replicated from the primary node SAN to the redundant node SAN using a replication tool. That can be performed with a number of products, such as Oracle DataGuard (at the application level), or Veritas Volume Replicator (at the file system level) or, from the SAN side, HORCM (Hitachi Online Replication Manager) also known as TruCopy if used with Hitachi SAN devices marketed and supported by HP. All of the SAN vendors, IBM, EMC etc. have equivalent products.

    For high availability, where the client can’t afford the offline delay whilst a cluster fails-over from a primary node to the redundant node, there are ‘active/active’ cluster topologies available. These require additional tools, such as the Veritas CFS (Cluster File System) or equivalent to enable data volumes to be shared and mounted across multiple hosts (and to address all the potential file-lock issues that can result). Such clusters allow for load-balancing, and for N+1 configurations, where a spare node can be introduced into the cluster to mitigate a failed or failing node.

    You are accurate in saying that clusters aren’t Backup & Recovery solutions. Whatever the topology of the cluster, a blunder written-to-disk will be replicated to different storage devices regardless, and having the means to failover to a redundant node will simply see the same introduced problem replicated on the newly-promoted node.

    Clusters are particularly useful for providing a degree of protection against hardware problems, or introduced software issues (such as routine operating environment patching which introduces a new problem not seen in testing).

    Nonetheless a seemingly genuine Backup & Recovery solution needs to be assessed. Many years ago I attended to a client whose Informix database administrator had managed to accidentally issue and commit a command sequence which wiped the contents of an essential database. The database was backed-up, together with redo log files (or ‘logical logs’ or ‘onbar-logs’ in Informix parlance) providing for a perfect RPO. The snag was, the customer hadn’t purchased a comprehensive module for its HP OmniBack II software. When the database was restored from DLT, the very last thing the backup application performed was to re-run the redo log files, which promptly wiped the newly-restored database! HP provided an eventual solution with the purchase of the upgraded parent product and Informix module which allowed for the redo log file to be replayed only up to a particular moment, but the RTO was missed, causing some revenue loss.

Leave a Reply