Disaster Recovery Planning
Disaster Recovery Planning
I don’t think anyone can question the importance of a working, tested, reality-based Disaster Recovery Plan (DRP). A disaster recovery plan is a comprehensive statement of consistent actions to be taken before, during, and after a disruptive event that causes a significant loss of information systems resources. Disaster Recovery Plans are the procedures for responding to an emergency, providing extended backup operations during the interruption, and managing recovery and salvage processes afterwards, should an organization experience a substantial loss of processing capability.
The primary objective of the disaster recovery plan is to provide the capability to implement critical processes at an alternate site and return to the primary site and normal processing within a time frame that minimizes the loss to the organization by executing rapid recovery procedures.
When planning for a disaster, it’s important to try to account for the unexpected consequences of the both the disaster and the remediation. When you try to “expect the unexpected,” however, that doesn’t mean you can literally and financially prepare for every contingency. Preparing as well as possible for what you can will reduce the negative impact of unforeseen events. If 70 percent, 80 percent, or 90 percent of the recovery goes smoothly and according to plan, the unexpected events will have a much smaller impact on survivability of the business.
Disasters primarily affect availability, which affects the ability of the staff to access the data and access working systems, but a disaster can also affect the other two tenets: confidentiality and integrity.
Goals and Objectives of DRP
A major goal of DRP is to provide an organized way to make decisions if a disruptive event occurs. The purpose of the disaster recovery plan is to reduce confusion and enhance the ability of the organization to deal with the crisis.
Obviously, when a disruptive event occurs, the organization will not have the luxury to create and execute a recovery plan on the spot. Therefore, the amount of planning and testing that can be done beforehand will determine the capability of the organization to withstand a disaster.
The objectives of the DRP are multiple, but each is important. They can include the following:
-
Protecting an organization from major computer services failure
-
Minimizing the risk to the organization from delays in providing services
-
Guaranteeing the reliability of standby systems through testing and simulation
-
Minimizing the decision making required by personnel during a disaster
In this section, we will examine the following areas of DRP:
-
The DRP process
-
Testing the disaster recovery plan
-
Disaster recovery procedures
The Disaster Recovery Planning Process
This phase involves the development and creation of the recovery plans, which are similar to the BCP process. However, BCP is involved in BIA and loss criteria for identifying the critical areas of the enterprise that the business requires to sustain continuity and financial viability; the DRP process assumes that those identifications have been made and the rationale has been created. Now we’re defining the steps we will need to perform to protect the business in the event of an actual disaster.
The steps in the disaster planning process phase are:
-
Data Processing Continuity Planning. Planning for the disaster and creating the plans to cope with it.
-
Data Recovery Plan Maintenance. Keeping the plans up-to-date and relevant.
Data Processing Continuity Planning
The various means of processing backup services are all important elements to the disaster recovery plan. Here we look at the most common alternate processing types:
-
Mutual aid agreements
-
Subscription services
-
Multiple centers
-
Service bureaus
-
Other data center backup alternatives
Mutual Aid Agreements
A mutual aid agreement (sometimes called a reciprocal agreement) is an arrangement with another company that may have similar computing needs. The other company may have similar hardware or software configurations or may require the same network data communications or Internet access as your organization.
In this type of agreement, both parties agree to support each other in the case of a disruptive event. This arrangement is made on the assumption that each organization’s operations area will have the capacity to support the others in a time of need. This is a big assumption.
There are clear advantages to this type of arrangement. It allows an organization to obtain a disaster-processing site at very little or no cost, thereby creating an alternate processing site even though a company may have very few financial resources to create one. Also, if the companies have very similar processing needs - that is, the same network operating system, the same data communications needs, or the same transaction processing procedures), this type of agreement may be workable.
This type of agreement has serious disadvantages, however, and really should be considered only if the organization has the perfect partner (a subsidiary, perhaps) and has no other alternative to disaster recovery (i.e., a solution would not exist otherwise). One disadvantage is that it is highly unlikely that each organization’s infrastructure will have the extra, unused capacity to enable full operational processing during the event. Also, in contrast to a hot or warm site, this type of arrangement severely limits the responsiveness and support available to the organization during an event and can be used only for short-term outage support.
The biggest flaw in this type of plan is obvious if we ask what happens when the disaster is large enough to affect both organizations. A major outage can easily disrupt both companies, thereby canceling any advantage that this agreement may provide. The capacity and logistical elements of this type of plan make it seriously limited.
Subscription Services
Another type of alternate processing scenario is presented by subscription services. In this scenario, third-party commercial services provide alternate backup and processing facilities. Subscription services are probably the most common of the alternate processing site implementations. They have very specific advantages and disadvantages, as we will see.
There are three basic forms of subscription services with some variations:
-
Hot site
-
Warm site
-
Cold site
Hot Site
This is the Cadillac of disaster recovery alternate backup sites. A hot site is a fully configured computer facility with electrical power, heating, ventilation, and air conditioning (HVAC) and functioning file/print servers and workstations. The applications that are needed to sustain remote transaction processing are installed on the servers and workstations and are kept up-to-date to mirror the production system. Theoretically, operators and other personnel should be able to walk in and, with a data restoration of modified files from the last backup, begin full operations in a very short time. If the site participates in remote journaling - that is, mirroring transaction processing with a high-speed data line to the hot site - even the backup time may be reduced or eliminated.
This type of site requires constant maintenance of the hardware, software, data, and applications to ensure that the site accurately mirrors the state of the production site. This adds administrative overhead and can be a strain on resources, especially if a dedicated disaster recovery maintenance team does not exist.
The advantages to a hot site are numerous. The primary advantage is that 24/7 availability and exclusivity of use are ensured. The site is available immediately (or within the allowable time tolerances) after the disruptive event occurs. The site can support an outage for a short time as well as a long-term outage.
Some of the drawbacks of a hot site are as follows:
-
It is seriously the most expensive of any alternative. Full redundancy of all processing components (e.g., hardware, software, communications lines, and applications) is expensive, and the services provided to support this function will not be cheap.
-
It is common for the service provider to oversell its processing capabilities, betting that not all its clients will need the facilities simultaneously. This situation could create serious contention for the site’s resources if a disaster is large enough to affect a major geographic region.
-
There also exists a security issue at the hot site, because the applications may contain mirrored copies of live production data. Therefore, all the security controls and mechanisms that are required at the primary site must be duplicated at the hot site. Access must be controlled, and the organization must be aware of the security methodology implemented by the service organization.
-
Also, a hot site may be administratively resource-intensive because controls must be implemented to keep the data up to date and the software patched.
Warm Site
A warm site could best be described as a cross between a hot site and cold site. Like a hot site, the warm site is a computer facility readily available with electrical power, HVAC, and computers, but the applications may not be installed or configured. It may have file/print servers, but not a full complement of workstations. External communication links and other data elements that commonly take a long time to order and install will be present, however.
To enable remote processing at this type of site, workstations will have to be delivered quickly, and applications and their data will need to be restored from backup media.
The advantages to this type of site, as opposed to the hot site, are primarily as follows:
-
Cost. This type of configuration will be considerably less expensive than a hot site.
-
Location. Because this type of site requires less extensive control and configuration, more flexibility exists in the choice of site.
-
Resources. Administrative resource drain is lower than with the maintenance of a hot site.
The primary disadvantage of a warm site, compared to a hot site, is the difference in the amount of time and effort it will take to start production processing at the new site. If extremely urgent critical transaction processing is not needed, this may be an acceptable alternative.
Cold Site
A cold site is the least ready of any of the three choices, but it is probably the most common of the three. A cold site differs from the other two in that it is ready for equipment to be brought in during an emergency, but no computer hardware (servers or workstations) resides at the site. The cold site is a room with electrical power and HVAC, but computers must be brought on-site if needed, and communications links may be ready or not. File and print servers have to be brought in, as well as all workstations, and applications will need to be installed and current data restored from backups.
A cold site is not considered an adequate resource for disaster recovery, because of the length of time required to get it going and all the variables that will not be resolved before the disruptive event. In reality, using a cold site will most likely make effective recovery impossible. It will be next to impossible to perform an in-depth disaster recovery test or to do parallel transaction processing, making it very hard to predict the success of a disaster recovery effort.
There are some advantages to a cold site, however, the primary one being cost. If an organization has very little budget for an alternative backup-processing site, the cold site may be better than nothing. Also, resource contention with other organizations will not be a problem, and neither will geographic location likely be an issue.
The big problem with this type of site is that having the cold site could engender a false sense of security. But until a disaster strikes, there’s really no way to tell whether it works or not, and by then it will be too late.
Multiple Centers
A variation on the previously listed alternative sites is called multiple centers, or dual sites. In a multiple-center concept, the processing is spread over several operations centers, creating a distributed approach to redundancy and sharing of available resources. These multiple centers could be owned and managed by the same organization (in-house sites) or used in conjunction with some sort of reciprocal agreement.
The advantages are primarily financial, because the cost is contained. Also, this type of site will often allow for resource and support sharing among the multiple sites. The main disadvantage is the same as for mutual aid: a major disaster could easily overtake the processing capability of the sites. Also, multiple configurations could be difficult to administer.
Service Bureaus
In rare cases, an organization may contract with a service bureau to fully provide all alternate backup-processing services. The big advantage to this type of arrangement is the quick response and availability of the service bureau, testing is possible, and the service bureau may be available for more than backup. The disadvantages of this type of setup are primarily the expense and resource contention during a large emergency.
Other Data Center Backup Alternatives
There are a few other alternatives to the ones we have previously mentioned. Quite often an organization may use some combination of these alternatives in addition to one of the preceding scenarios.
-
Rolling/mobile backup sites - Contracting with a vendor to provide mobile backup services. This may take the form of mobile homes or flatbed trucks with power and HVAC sufficient to stage the alternate processing required. This is considered a cold site variation.
-
In-house or external supply of hardware replacements - Vendor resupply of needed hardware, or internal stockpiling of critical components inventory. The organization may have a subscription service with a vendor to send identified critical components overnight. This option may be acceptable for a warm site but is not acceptable for a hot site.
-
Prefabricated buildings - It’s not unusual for a company to employ a service organization to construct prefabricated buildings to house the alternate processing functions if a disaster should occur. This is not too different from a mobile backup site - a very cold site.
Transaction Redundancy Implementations
The CISSP candidate should understand the three concepts used to create a level of fault tolerance and redundancy in transaction processing. Although these processes are not used solely for disaster recovery, they are often elements of a larger disaster recovery plan. If one or more of these processes are employed, the ability of a company to get back on-line is greatly enhanced.
-
Electronic vaulting. Electronic vaulting refers to the transfer of backup data to an off-site location. This is primarily a batch process of dumping the data through communications lines to a server at an alternate location.
-
Remote journaling. Remote journaling refers to the parallel processing of transactions to an alternate site, as opposed to a batch dump process like electronic vaulting. A communications line is used to transmit live data as it occurs. This feature enables the alternate site to be fully operational at all times and introduces a very high level of fault tolerance.
-
Database shadowing. Database shadowing uses the live processing of remote journaling, but it creates even more redundancy by duplicating the database sets to multiple servers.
T
Disaster Recovery Plan MaintenanceDisaster recovery plans often get out of date. A similarity common to all recovery plans is how quickly they become obsolete, for many different reasons. The company may reorganize, and the critical business units may be different from the ones existing when the plan was first created. Most commonly, changes in the network or computing infrastructure may change the location or configuration of hardware, software, and other components. The reasons may be administrative: Complex disaster recovery plans are not easily updated, personnel lose interest in the process, or employee turnover may affect involvement.
Whatever the reason, plan maintenance techniques must be employed from the outset to ensure that the plan remains fresh and usable. It’s important to build maintenance procedures into the organization by using job descriptions that centralize responsibility for updates. Also, create audit procedures that can report regularly on the state of the plan. It’s also important to ensure that multiple versions of the plan do not exist, because they could create confusion during an emergency. Always replace older versions of the text with updated versions throughout the enterprise when a plan is changed or replaced.
Emergency management plans, business continuity plans, and disaster recovery plans should be regularly reviewed, evaluated, modified, and updated. At a minimum, the plan should be reviewed at an annual audit. The plan should also be reevaluated:
-
After tests or training exercises, to adjust any discrepancies between the test results and the plan
-
After a disaster response or an emergency recovery, as this is an excellent time to amend the parts of the plan that were not effective
-
When personnel, their responsibilities, their resources, or organizational structures change, to familiarize new or reorganized personnel with procedures
-
When polices, procedures, or infrastructures change
Testing the Disaster Recovery Plan
Testing the disaster recovery plan is very important (a tape backup system cannot be considered working until full restoration tests have been conducted); a disaster recovery plan has many elements that are only theoretical until they have actually been tested and certified. The test plan must be created, and testing must be carried out in an orderly, standardized fashion and be executed on a regular basis.
Reasons for Testing
In addition to the general reasons for testing that we have previously mentioned, there are several specific reasons to test, primarily to inform management of the recovery capabilities of the enterprise. Other specific reasons are as follows:
-
Testing verifies the accuracy of the recovery procedures and identifies deficiencies.
-
Testing prepares and trains the personnel to execute their emergency duties.
-
Testing verifies the processing capability of the alternate backup site.
Creating the Test Document
To get the maximum benefit and coordination from the test, a document outlining the test scenario must be produced, containing the reasons for the test, the objectives of the test, and the type of test to be conducted (see the five following types). Also, this document should include granular details of what will happen during the test, including the following:
-
The testing schedule and timing
-
The duration of the test
-
The specific test steps
-
Who will be the participants in the test
-
The task assignments of the test personnel
-
The resources and services required (supplies, hardware, software, documentation, and so forth)
Certain fundamental concepts will apply to the testing procedure. Primarily, the test must not disrupt normal business functions. Also, the test should start with the easy testing types (see the following section) and gradually work up to major simulations after the recovery team has acquired testing skills.
It’s important to remember that the reason for the test is to find weaknesses in the plan. If no weaknesses were found, it was probably not an accurate test. The test is not a graded contest on how well the recovery plan or personnel executing the plan performed. Mistakes will be made, and this is the time to make them. Document the problems encountered during the test and update the plan as needed, and then test again.
The Five Disaster Recovery Plan Test Types
Disaster recovery/emergency management plan testing scenarios have several levels and can be called different things, but there are generally five types of disaster recovery plan tests. The listing here is prioritized, from the simplest to the most complete testing type. As the organization progresses through the tests, each test is progressively more involved and more accurately depicts the actual responsiveness of the company. Some of the testing types, such as the last two, require major investments of time, resources, and coordination to implement. The CISSP candidate should know all of these and what they entail.
The following are the testing types:
-
Checklist review. During a checklist type of disaster recovery plan, copies of the plan are distributed to each business unit’s management. The plan is then reviewed to ensure that the plan addresses all procedures and critical areas of the organization. This is considered a preliminary step to a real test and is not a satisfactory test in itself.
-
Table-top exercise or structured walk-through test. In this type of test, members of the emergency management group and business unit management representatives meet in a conference room setting to discuss their responsibilities and how they would react to emergency scenarios by stepping through the plan. The goal is to ensure that the plan accurately reflects the organization’s ability to recover successfully, at least on paper. Each step of the plan is walked through in the meeting and marked as performed. Major glaring faults with the plan should be apparent during the walk-through.
-
Walk-through drill or simulation test. The emergency management group and response teams actually perform their emergency response functions by walking through the test, without actually initiating recovery procedures. During a simulation test, all the operational and support personnel expected to perform during an actual emergency meet in a practice session. The goal here is to test the ability of the personnel to respond to a simulated disaster. The simulation goes to the point of relocating to the alternate backup site or enacting recovery procedures, but it does not perform any actual recovery process or alternate processing.
-
Functional drill or parallel test. This type tests specific functions such as medical response, emergency notifications, warning and communications procedures, and equipment, although not necessarily all at once. This type of test also includes evacuation drills, in which personnel walk the evacuation route to a designated area where procedures for accounting for the personnel are tested. A parallel test is a full test of the recovery plan, utilizing all personnel. The goal of this type of test is to ensure that critical systems will actually run at the alternate processing backup site. Systems are relocated to the alternate site, parallel processing is initiated, and the results of the transactions and other elements are compared.
-
Full-interruption or full-scale exercise. A real-life emergency situation is simulated as closely as possible. This test involves all the participants who would be responding to the real emergency, including community and external organizations. The test may involve ceasing some real production processing. The plan is totally implemented as if it were a real disaster, to the point of involving emergency services (although for a major test, local authorities might be informed and help coordinate).
lists the five disaster recovery plan testing types in priority.
Figure 8-3: Disaster Recovery Plan Testing Types
Disaster Recovery Procedures
This part of the plan details what roles various personnel will take on, what tasks must be implemented to recover and salvage the site, how the company interfaces with external groups, and what financial considerations will arise. Senior management must resist the temptation to participate hands-on in the recovery effort, as these efforts should be delegated. Senior management has many very important roles in the process of disaster recovery, including:
-
Remaining visible to employees and stakeholders
-
Directing, managing, and monitoring the recovery
-
Rationally amending business plans and projections
-
Clearly communicating new roles and responsibilities
Information or technology management has more tactical roles to play, such as:
-
Identifying and prioritizing mission-critical applications
-
Continuously reassessing the recovery site’s stability
-
Recovering and constructing all critical data
Monitoring employee morale and guarding against employee burnout during a disaster recovery event is the proper role of human resources. Other emergency recovery tasks associated with human resources could include:
-
Providing appropriate retraining
-
Monitoring productivity of personnel
-
Providing employees and family with counseling and support
The financial area is primarily responsible for:
-
Reestablishing accounting processes, such as payroll, benefits, and accounts payable
-
Reestablishing transaction controls and approval limits
Isolation of the incident scene should begin as soon as the emergency has been discovered. Authorized personnel should attempt to secure the scene and control access; however, no one should be placed in physical danger to perform these functions. It’s important for life safety that access be controlled immediately at the scene, and only by trained personnel directly involved in the disaster response. Additional injury or exposure to recovery personnel after the initial incident must be prevented.
The Recovery Team
A recovery team will be clearly defined with the mandate to implement the recovery procedures at the declaration of the disaster. The recovery team’s primary task is to get the predefined critical business functions operating at the alternate backup-processing site.
Among the many tasks the recovery team will have will be the retrieval of needed materials from off-site storage - that is, backup tapes, media, workstations, and so on. When this material has been retrieved, the recovery team will install the necessary equipment and communications. The team will also install the critical systems, applications, and data required for the critical business units to resume working.
The Salvage Team
A salvage team, separate from the recovery team, will be dispatched to return the primary site to normal processing environmental conditions. It’s advisable to have a different team, because this team will have a different mandate from the recovery team. They are not involved with the same issues the recovery team is concerned with, such as creating production processing and determining the criticality of data. The salvage team has the mandate to quickly and, more importantly, safely clean, repair, salvage, and determine the viability of the primary processing infrastructure after the immediate disaster has ended.
Clearly, this cannot begin until all possibility of personal danger has ended. Firefighters or police might control the return to the site. The salvage team must identify sources of expertise, equipment, and supplies that can make the return to the site possible. The salvage team supervises and expedites the cleaning of equipment or storage media that may have suffered from smoke damage, the removal of standing water, and the drying of water-damaged media and papers.
This team is often also given the authority to declare when the site is up and running again - that is, when the resumption of normal duties can begin at the primary site. This responsibility is large, because many elements of production must be examined before the green light is given to the recovery team that operations can return.
Normal Operations Resume
This job is normally the task of the recovery team, or another, separate resumption team may be created. The plan must have full procedures on how the company will return production processing from the alternate site to the primary site with the minimum of disruption and risk. It’s interesting to note that the steps to resume normal processing operations will be different from the steps in the recovery plan; that is, the least critical work should be brought back first to the primary site.
It’s important to note that the emergency is not over until all operations are back in full production mode at the primary site. Reoccupying the site of a disaster or emergency should not be undertaken until a full safety inspection has been done. Ideally the investigation into the cause of the emergency has been completed and all damaged property has been salvaged and restored before returning. During and after an emergency, the safety of personnel must be monitored, any remaining hazards must be assessed, and security must be maintained at the scene. After all safety precautions have been taken, an inventory of damaged and undamaged property must be done to begin salvage and restoration tasks. Also, the site must not be reoccupied until all on-site investigative processes have been completed. Detailed records must be kept of all disaster-related costs, and valuations must be made of the effect of the business interruption.[*]
All elements discussed here involve well-coordinated logistical plans and resources. To manage and dispatch a recovery team, a salvage team, and perhaps a resumption team is a major effort, and the short descriptions we have here should not give the impression that it is not a very serious task.




