Disaster recovery assurance for data centre managers

Harnessing the power of automated disaster recovery. By Kevin Moreau, Managing Director, EMEA, Unitrends

  • 10 years ago Posted in

Introduction
The spectrum of IT protection and preservation challenges is constantly widening and exerting increased pressure on the Data Centre Manager to keep IT services and applications up and running. A new layer of responsibility for complying with sector-driven standards and disaster preparedness legislation has also been added, making Disaster Recovery (DR) a pressing concern and raising the stakes considerably.

Because of this, there needs to be a greater level of assurance attached to DR processes and procedures along with a renewed emphasis on regular, thorough testing. However, legacy mechanisms for DR testing are expensive manual processes for the most part with typical enterprise-level DR budgets running into tens of millions of dollars. Thus, DR testing is infrequent, conducted only on an annual basis for most organisations. If data centres are to continue to provide non-stop services, a new approach to DR is needed – one where DR testing is fully automated and iterated on daily or even hourly cycles.

Three Challenges of Data Centre Recovery

1.Inconsistent recovery
When individual IT services and associated components are backed up independently, or if replication ends abruptly during an incident, recovery may bring up components out of sync with each other. In these circumstances, applications and services can behave inconsistently, unpredictably, or even fail to start at all, which will often require extensive manual intervention to bring them back.

Solution: Safeguard application and IT service components in a consistent state that can be always be recovered.

2. Fragile processes
With huge increases in data volumes and a massive growth in 24x7 IT services, backup jobs for data centres have become larger and more difficult to maintain. This is compounded by a dramatic shrinkage in backup and recovery windows. Thus, when jobs fail, there is rarely enough time to restart them, leaving the backup in an incomplete or unstable state.

Solution: Include a closed-loop mechanism in DR processes to supervise the successful completion of processes and initiate immediate remediation in case of failure.

3. Disaster propagation
Backup or replication software copies data from one device to another and is totally service-unaware. Therefore disasters of a logical nature such as database corruption, bad patches or malware get copied into the secondary (or DR) cloud, and their presence can go undetected until a test is made, or a disaster actually occurs.

Solution: Logical disasters must be detected, and infected/unusable data or components must be discarded immediately and replaced with healthy checkpoints.

DR, Compliance and eDiscovery
As well as dealing with day-to-day business continuity processes and procedures, Data Centre Managers are now an integral part of initiatives that maintain and manage corporate and legal compliance with such regulations as the UK's Civil Contingencies Act. In addition, government bodies like the European Central Bank and standards-seeking organisations are actively issuing guidelines or preparing BCM programmes.

Increasingly, UK Data Centre Managers must also keep one eye on current US laws and be aware of their potential responsibilities to act on Disclosure Requests from the Securities and Exchange Commission (SEC). This is particularly important for data centres serving international companies that have offices in the US, or trade in US dollars. These are sufficient grounds for an SEC eDisclosure Request so Data Centre Managers need to be confident their data can be recovered and supplied to the relevant legal authorities.

Virtualisation offers a new way for Datacenter Managers to address these concerns – particularly using 'frozen-in-time' data snapshots. They are potentially restorable at a later point in time and executable from that point forward, regardless of the underlying hardware on which they are stored or from which they execute. Data Centre Managers can use them to 1) provide long-term storage and recovery plans for IT services, data and applications with the assurance that they can be restarted in different hardware platforms, and 2) enable compliance testing and eDiscovery to be quickly facilitated.

DR Orchestration and Sandboxes
To meet the day-to-day challenges of running at Data Centre and assist with compliance and eDiscovery requests, Data Centre Managers could consider DR orchestration, which creates a completely automated system that increases IT services resiliency. This is achieved by bringing together all the different virtualisation layers in hypervisors and storage arrays that are present in a data centre, as well as its replication backup or software, to manage consistent and recoverable copies of IT services.

DR orchestration can be used to create an isolated virtual data centre, where virtual snapshots can be moved and the IT service brought up to simulate a failover scenario or disclosure request. This allows data, IT service and application consistency to be tested over extended periods of time. Called a ‘test sandbox’, these can be configured either exactly like the primary site (reusing IP addresses within a fenced network environment) or with different IP addresses, enabling some network traffic in and out of the test sandbox. It is also possible to include some service VMs that can take the place of physical resources during a DR exercise.

In addition, DR orchestration can drive the validity of candidate recovery points at a much more granular level using runbooks, including execution of queries, transactions or other relevant workloads. Since the test sandbox is isolated from production, almost any kind of test, from the trivial to the most thorough, can be run against the recovery point candidate.

This allows DR testing in data centres to be incorporated into a corporate compliance process and lets data centre managers validate every tier of an application and apply business rules to each recovery step, ensuring policy compliance and business continuity. The degree of recovery testing within a test sandbox is limited only by the amount of virtual resources available, and the period of time available for tests. There is no danger of contamination on the production site, and when tests are complete the test sandbox can be shut down and discarded.

Upon applying recovery policy and verifying successful completion, the candidate recovery point can be certified and kept. The DR orchestrator timestamps and catalogues entry for the set of snapshots as a single certified recovery point. If, however, the tests fail, then the candidate recovery point is invalid and the DR orchestrator closes the loop. At that point, the snapshots are discarded, the test failure is logged and the entire procedure is restarted at once. If the test fails repeatedly then the DR orchestrator raises a 'trouble ticket' – usually an SNPT trap and/or email – so that an operations analyst can be called to examine the logs and diagnose the reason for the recovery failure.

This allows Data Centres to test their DR systems and processes as often as they like – moving from once a year to once an hour – all automated and running in the background, but guaranteeing a rapid alert should anything go wrong, and a fast recovery if it does.

The dynamic nature of IT services and, therefore, data centre management, where changes are being made consistently to hardware, operating systems, middleware and applications, requires this closed-loop mechanism in order to assure recovery. Recovery Point Objectives remain objectives until the loop is closed and they are transformed into guarantees.

Conclusion
Data Centre Managers have touch points on all the major issues that impact company's well-being, including application and service availability, recovery compliance and eDiscovery. They hold the key to a company's reputation and its perception among customers, investors and prospects as well as industry bodies and government agencies. Data Centre managers must, therefore, be able to:

• Align business continuity policy with IT disaster recovery
• Deploy full automated, non-disruptive, continuous DR testing
• Provide guarantees to senior executives that all components of critical IT services are always fully recoverable and discoverable
• Have immediate insight into any compliance deviations and be able to either deal with them directly or flag them for the appropriate senior executive
• Provide achievable and testable Recovery Point Objectives and Recovery Time Objectives

It's a task that, thanks to modern backup and recovery configurations, within closer reach than ever before. However, independent research conducted this year suggests that only one third of companies test their DR even on a yearly basis, prompting many within the industry to think the majority of businesses will have to insist on a more accountable, responsive and adaptable approach to DR in the future.

Exos X20 and IronWolf Pro 20TB CMR-based HDDs help organizations maximize the value of data.
Quest Software has signed a definitive agreement with Clearlake Capital Group, L.P. (together with...
Infinidat has achieved significant milestones in an aggressive expansion of its channel...
Collaboration will safeguard HPC storage systems and customer data with Panasas hardware-based...
Peraton, a leading mission capability integrator and transformative enterprise IT provider, has...
Helping customers plan for software failure, data loss and downtime.
Cloud Computing and Disaster Recovery specialist, virtualDCS has been named as the first UK-based...
SharePlex 10.1.2 enables customers to move data in near real-time to MySQL and PostgreSQL.