Anyone can take a backup, but can they recover? Even with today's improved capabilities, the actual recovery process continues to be a challenge for IT managers in Windows shops.
On the surface, disaster recovery seems simple enough. All you need are the backup files and similar hardware at another location, right? In some cases, this concept works if you're restoring just a workstation or server. But everyone who has restored a workstation or server knows that sometimes these seemingly simple tasks are more difficult than they should be.
Regardless of the technical approach, the following are critical factors that will improve your confidence in the entire disaster recovery process, from planning to implementation:
1. Streamline business and technology -- IT managers always hope their disaster recovery plans are never activated. But, if disaster strikes, your company's investment in disaster recovery and its ongoing support for the process will pay dividends. That's why it's important to get top-level management to buy into disaster recovery right from the start.
A disaster recovery plan based on a business-continuity approach takes into account all the staffing and technology that are needed to keep the business going. Like an insurance policy, even top executives can appreciate the peace of mind a solid disaster recovery plan brings.
To get everyone on board, IT and business professionals from around the company should be enlisted to serve on a corporate disaster recovery team. A formal project plan should be written for any major new effort so that all goals are formalized. Business professionals can provide key input for the process, such as how much downtime will cost and how long each department can operate offline. They can also help rank all application systems in their order of importance so that the order of recovery can be developed. Key files and their associated retention periods should be continuously reviewed.
2. Continuously fine-tune documentation -- Each application should be documented so that files, application programs and the network infrastructure can be restored correctly. As change takes place that affect new master files, applications or operating systems, the disaster recovery plan must be updated.
For Microsoft-based applications, the documentation must include server operating system requirements, Windows domains, security settings, DC/DHCP/WINS controllers and communications. Additional documentation should include virtual machine servers, other operating systems, such as Unix, Linux or Solaris, storage area networks, the network topology, communication requirements, vendor contacts, employee home contacts and other helpful information.
One approach that works well is publishing standard templates, application write-ups and disaster recovery standards on our intranet. Besides a great source of documentation, it is always backed up and would be readily available during recovery procedures.
The IT change control process should also require a disaster recovery review when major application changes are made. Integrating DR reviews into change control will help keep documentation updated. When your disaster recovery plan starts to become outdated, it will slow down or even stop recovery efforts.
3. Maintain security at all times -- Information and security settings must be carefully preserved at all points during the process. As a starting point, all on-site and off-site backups must be physically protected. These files contain a wealth of sensitive information about customers, employees and the company's business.
During recovery testing, all network security must be carefully maintained. The disaster recovery checklist should include security testing to ensure that network and server settings are working. This is particularly important if domain controllers are being restored with the applications on larger full scale tests. Temporary shortcuts with security controls must not be taken during the recovery process so that applications work expediently.
4. Ensure success by actively testing -- Change is the watchword for the IT industry. Active testing of a disaster recovery plan is required because hardware, software and applications are constantly changing. If a DR test has not been conducted for an application in a couple of years, the recovery process may not work well. Just as most companies do not release changes to their application systems without rigorous testing, the disaster recovery environment also needs to be well tested.
Systematically testing your disaster recovery plan will provide confidence in the recovery process for each application you evaluate. Until the DR process is actually tested, it is only a best guess on how to restore the system. The recovery team needs training so it is better prepared in the event of a real emergency.
Even plans that are well thought out can have missing steps or other surprises when they are tested. For example, when we started the process, we found that recovering NT backups to a spare server would cause BSOD issues because of hardware differences. We were more successful in rebuilding the OS environment and then restoring the application. Recovery then improved in new Windows Server releases where we could use full image copies. These types of issues can only be found when the process is thoroughly tested.
Disaster recovery – just do it
Although these four points are not a complete list, they do offer key points of improvement we have incorporated from our own evaluations of the process. When IT and business professionals work together, it helps to ensure critical needs are not overlooked.
Documentation is critical, especially if a true disaster occurs to the data center and no one is allowed access on site. Instituting updates to disaster recovery documentation in the change control process helps ensure it will be kept up to date. You must maintain security to ensure corporate information is protected within the disaster recovery environment.
Finally, active testing is the only way companies can be sure their business continuity plans will work as expected. During a true recovery, any delays will create additional expenses as well as lost business opportunities.
Harry L. Waldron has more than 35 years of experience in the IT profession. A
Microsoft MVP, he works as a senior developer for Parsippany, N.J.-based Fairfax Information
Technology Services where he provides technical, business and leadership support on key development
projects. He writes about security and best practices for several technical forums, including myITforum.com.
This was first published in November 2007