Backup and Recovery: Backup Seems Simple …

Conceptually, a backup strategy is simple. A system administrator decides what data is critical for business operation, determines a backup schedule that has a minimal effect on operations, and uses a backup utility program to make the copies. The backups are stored in a safe place so they can be used to recover from a failure.

Though a backup strategy is quite simple in concept, the difficulty comes in the details. Architecting a backup and recovery strategy is more involved than most people realize. One of the most frustrating and discouraging tasks is determining where to start. What at first seems a simple task becomes daunting as you start digging deeper and realize how many elements of the backup strategy are interconnected. For example, as a system administrator of a large enterprise, chances are you would not want the burden of deciding what data is backed up when, and for how long it is kept. In fact, you may be presented with various analysis summaries of the business units or own the task of interviewing the business unit managers yourself in order to have them determine the data, the window in which backup may run, and the retention level of the data once it is stored on the backup media. This is often called a business impact analysis (BIA) and should yield some results that will be useful during the policy-making process. The results of these reports should also help define the recovery window, should this particular business unit suffer a disaster where data cannot be accessed or updated. Knowledge of these requirements may, in fact, change the entire budget structure for your backup environment, so it is imperative during the design and architecture phase that you have some understanding of what the business goals are with regard to recovery.

You will find that most business unit managers are not as concerned about backup as they are with recovery. As you can see from the level of complexity of our example, too often the resulting frustration may lead to inactivity where nothing gets done-or at least not done in the most effective manner. The obvious intent of a backup and recovery system is to provide data protection. Since we are setting up a system to protect the data, the next step also seems obvious: Determine how much data is in the enterprise and where it resides. This is an important part of establishing the backup and recovery system, but it does not provide enough information to architect a strategy. In addition to knowing how much data you have and where it is, you must also have a good understanding of why the data is being backed up and what the recovery requirements are. This is necessary so you can make the appropriate decisions about the overall backup and recovery strategy. The more you understand the nature of the data and the level of protection required, the better decisions you can make in setting up the entire backup and recovery environment.

The Goals of Tape Backup

You always want to keep in mind that the overall goal of tape backup is to make copies of your data that can be used to recover from any kind of data loss. The primary goals of the tape backup portion of an overall data protection strategy are to do the following:

Understand the goals of the business in order to deliver a properly configured backup environment.
Enable information services to resume as quickly as is physically possible after any system component failure or application error.
Enable data to be relocated to where it's needed, when it's needed by the business.
Meet regulatory and business policy data retention requirements.
Meet recovery goals; in the event of a disaster, return the business to a predetermined operating level.

Each of these goals relates to a specific area of data protection and needs to be considered as we put together our overall backup strategy. Specifically, you should ask why data is being backed up. As you consider each system or group of systems, keep in mind whether the data is being backed up to protect against failure, disaster, or regulatory requirements, and if the goals of the business will be met in the event of a failure or disaster. In reality, your success as a backup administrator will not be measured by how fast you are able to back up your data but how swiftly you are able to meet the aforementioned goals. Stated simply, your success will be defined by the restorability of the data in the environment.

The Role of Tape Backup

For a personal computer user, backup typically means making a copy of the data on the computer's hard drive onto a tape or CD-ROM. Personal backup media are often labeled by hand and are 'managed' by storing them in a drawer or cabinet located in the room with the computer. In the enterprise, data protection is a little more complex. Enterprise backup must be able to do the following:

Make copies of your data, whether organized as files, databases, or the contents of logical volumes or disks.
Manage the backup media that contain these copies so that any backup copy of any data can be quickly and reliably located when required, and so that the media can be tracked accurately, regardless of the number.
Provide mechanisms to duplicate sets of backed up data so that while a copy remains on-site for quick restores, another copy can be taken off-site for archival or disaster protection purposes.
Track the location of all copies of all data accurately.
Why Is the Data Backed Up?

Why you are backing up data seems like a trivial question, but it really needs to be answered for all the data in the enterprise. Some of the most common answers to this question are as follows:
Business requirement
Hardware failure protection
Disaster recovery (DR)
Protection from application failure
Protection from user error
Specific service-level agreements with the users/customers (SLA)
Legal requirements

You need to understand what data on what systems falls into each category. By interviewing the data owners, you will be better equipped to categorize the data. In most cases, the administrators know what it takes to recover the operating system and, in some cases, the database engines and other applications. However, the onus must be placed upon the data owner (customer) for the administrators to fully understand the impact to the business in the event there is a data loss (BIA). Addressing their expectations up front will save much time, money, and potential embarrassment. Several years ago, one of us was given the task of architecting a backup solution that would allow for quick recovery. 'Quick' recovery is subjective, so the question asked was this: 'What is your expectation of a ‘quick' recovery?' Based on the response of 30 minutes, a proposal was drafted for the type of system that would need to be designed to meet this 30-minute recovery window. Soon after management reviewed the proposal, we agreed to a more realistic time frame. So you can see how this would give you an opportunity to show customers how much money their requirements will cost without you having to lose sleep in the process.

You will usually find some of the systems have fairly static data and would probably be backed up to protect against hardware failure or for DR. Other systems are very dynamic with a very active user base. Backup of this data should be considered for protection against application failure or user error. What is generally seen on systems is a mixture of these data types. The core operating system (OS) and base applications are usually static and can be rebuilt from release materials, while data used by the application can be very volatile. We will discuss each of these in more detail. Defining data types is vital, because understanding the data allows us to determine the recovery requirements. In most cases, the recovery requirements dictate the backup strategy.

Hardware Failure

Some of the data in an enterprise is backed up specifically to protect against hardware failure. You want to be sure you can recover an entire volume or database in case a disk or server fails. (The probability of doing any restore of less than an entire volume is very small.) The backup protection will be geared to this recovery requirement.

The best pure hardware failure protection is disk mirroring-that is, making a complete second copy of the data on disk to another disk. However, this practice does not eliminate the need for backups. For the data that falls into this category, you might consider raw volume backups where all the data in a disk volume is backed up at disk read speed. A raw partition backup is a bit-by-bit backup of a partition of a disk drive on UNIX. On Windows NT/2000, this is called a disk-image backup. You do not read the data via the filesystem, so you avoid adding this process to the system overhead. A raw volume backup can give you much better backup performance; however, it has some restrictions. The primary restriction is that you back up the entire volume. For example, if a 50-GB volume is only 50 percent full, a filesystem backup would result in 25 GB being backed up. However, a raw volume backup would result in 50 GB being backed up, and, accordingly, more tape being used. Then, on the restore, the entire volume is restored regardless of how much data actually resides in it. You need to take this into account when determining whether to do raw backups.

The backup strategy for this protection could be configured around the hardware layout of each system. If you know that a system will be backed up solely for hardware protection, you can lay out the system to optimize the backup and recovery performance. A lot of the data that could fall into this category is more static; it would be backed up less frequently and would usually involve full backups. This data can be entire systems within your enterprise or some of the static data that is found on more dynamic systems, such as the OS-related data or the actual applications that are loaded on a system.

Disaster Recovery

For systems that are a part of your DR strategy, you need to ensure you have all the data required to rebuild a system in an easily identified group. You must also ensure you have all the supporting data necessary to recover these systems. This can include the supporting OS data as well as everything required for the backup application in order to do full system restores. Using a vault-type solution where backups are sent off-site to be stored until needed in conjunction with the backup application greatly helps this task.

The biggest challenge here is identifying which systems and applications are critical and determining how fast they have to be back online. A part of the DR strategy should include the priority of recovering these systems. The speed of recovery can dictate some of the backup decisions. It is very likely that systems that are a part of your DR strategy might also require protection within one of the other strategies. You would actually configure your backup and recovery system to provide the necessary DR protection in addition to any other requirements. Keep in mind that when you declare a disaster it may mean you no longer have access to your primary site. So any reports, documentation, call lists, operations guides, and so on that you may require should be in an off-site location along with your DR backup media. Many DR test plans fail because of one document or component that was overlooked.

Application Failure

The data that needs protection against application corruption usually requires more frequent backups. In these instances, the use of both incremental and full backups is very important. The highest risk of application data corruption is database applications, so you should develop a specific backup strategy for these applications. Most of the backup applications can interface with the database applications to allow both full and incremental backups that can be done either hot (with the database still active) or cold (with the database shut down). The systems that require this type of data protection might also be part of your DR strategy, so they would be part of multiple strategies.

User Error

For the data that is directly user-generated or -accessed, you might want to consider a backup strategy for user error protection. This might also include mailboxes, but in these instances, the backup and recovery strategy is dictated by the mail application. The very nature of providing user error protection implies that there are many more instances of single file or directory restores, so the backup strategy needs to support this. This strategy would generally involve more frequent incremental backups. The frequency of backups is an important consideration if it involves data that users are deleting and restoring on a regular basis. You would also want to ensure the backups are configured to facilitate faster browsing and recovery.

Service Level Agreements

You might find some of the data is being backed up to meet a specific service level agreement (SLA). The backup strategy will depend on the exact agreement. It is very possible that the SLA will actually be for a recovery requirement. If that is the case, the backup strategy will be governed by these requirements. This is often the situation where there is a dedicated backup and recovery administration staff that provides this service for a particular company or agency. The other groups or business units become the customers of the backup group and could have specific SLAs. These will usually dictate the backup strategy. This is also the case in hosting centers. It is very important to determine exactly what the exact requirements are. These can involve any of the backup types mentioned, with the additional requirement to have systems or applications back online within a specific time frame. It is common to have an agreement that any request for the recovery of any file or directory must be accomplished within a given time. All of this information is required to allow you to actually put together a backup strategy.

Legal Requirements

Your company may be required by law to keep certain data for a particular time period, without exception. Then there's always the possibility that legal will be very strict in noting that certain data types are not to be kept more than a particular time period. These factors will further shape the way you architect the collective backup solution; for example, one server may be a member of multiple policies in order to achieve the legal requirement of its data. It is good practice to always include the legal department when determining the data retention requirements whenever possible. This is essentially a component of a business impact analysis.

Complexity in Enterprise Backup

The functions of enterprise tape backup may seem straightforward. But implementing a truly functional backup environment that meets enterprise data protection requirements can be a complex undertaking. When you design or update a backup strategy, complexity can arise for several reasons:

Ability to back up all of the data. For the backup strategy to be useful, it must ensure that all data that can be lost is backed up. In an enterprise with large numbers of information servers, some of which may share data with others, identifying the sets of data objects to be backed up can be a significant effort.
Frequency. Backup frequency is essentially a trade-off between resources (network and I/O bandwidth, processor capacity, tape and library hardware, and application access) and the need for the most current data possible. Again, with many information services needing data protection, finding the right balance between backup frequency and resource consumption is a challenge.
Integration of all data managers. Enterprises with many information services are likely to use multiple data management systems (filesystems and database management systems), each with its own mechanisms for backing up data objects that it recognizes. Your task is integrating these mechanisms into a schedule that provides a consistent backup of all required data for a service and keeping them up-to-date as the service changes.
Continuous availability. Continuous application availability is increasingly required in the today's enterprise. A variety of mechanisms enable consistent backups with minimal application downtime. Choosing among these and implementing the choice can be a complex task.
Media management. Business or regulatory requirements can result in multiyear data retention requirements. Enterprises can find themselves responsible for maintaining backups and archives on tens or even hundreds of thousands of media (tape cartridges, optical disk platters, etc.). The procedures for managing large numbers of media can also be complex.
Management of multiple locations. Business considerations may require that servers and data be located in multiple locations. Maintaining a consistent set of backup procedures across multiple data centers can require extensive design or management talent.

The backup component of an enterprise data protection strategy has to accommodate all of these factors.

Where Do We Start?

As you start planning your backup and recovery system, you need to start gathering detailed information on your enterprise. You need to know the network layout for all systems. If your enterprise is made up of multiple networks, you need to know how much data resides on it and the speed of each network or subnet. Obviously, it is much faster to move data across a 100-Mb/sec (100Base-T) network than a 10-Mb/sec (10Base-T) network. You need to understand the network layout and the corresponding data to help identify potential bottlenecks and take them into consideration as you architect your backup and recovery system. (This information is also necessary in determining where to put media servers and tape devices, but we will get to that in a later chapter.)

As you look at the network that makes up your enterprise, you need to understand the network speed and topology. You also need to understand the disk layout, especially for the larger file servers and database servers, or identify who has this knowledge. You should watch for bottlenecks involving the disks, as well as the networks, SCSI connections, and any other appropriate I/O paths. When considering the decisions that need to be made when architecting backup strategy, the two things you must always keep in mind are the effect on normal production and effects on restore speed and performance. This usually involves making the necessary cost trade-offs to achieve the best of all worlds.

Here are some of the steps necessary for you to gather the information needed before establishing the backup strategy:

Identify all the systems, noting the order in which they would need to be recovered following a disaster.
Identify all networks involved, including speed of network and existing load at various times throughout the 24-hour day and night.
Locate all existing backup-related hardware, such as tape drives and libraries.
Identify recovery requirements.
Identify data and application availability requirements during backup.
Determine the best way to move the data.

We discuss each of these points in a little more detail in the sections that follow.

Identify All Systems

You need to identify all systems that need to be backed up. Generally this will be most if not all of the systems in the enterprise, with the exception of user workstations. There may be some systems that are basically replicated systems and can be easily re-created. In general, it is only necessary to back up one of these systems. The following information should be gathered for all the systems:

Amount of data
Speed of system
Number and type of networks
Type of data-database or filesystem?
Priority of recovery in DR
Tape drive or library installed?

Identify All Networks Involved

The network layout is an important part of the information required. Identifying the layout can be very critical to establishing the backup and recovery strategy. This step addresses the potential performance bottlenecks, because slow networks are often some of the primary bottlenecks. If there is a significant amount of data on a slow network, a media server may need to be located on the network. Any systems that have large amounts of backup data, such as a system with more than 100 GB, should be considered as media servers and have direct connections to a tape drive or drives. Following is the information needed for the networks:

Speed of network
Amount of data residing on the systems
Location of any backup hardware
Current and proposed production traffic

Locate Backup Hardware

Identifying all the systems and mapping the network topology should provide an idea of the total backup requirements. Part of this information is the location of the potential backup devices. The next step is to make sure the hardware is correctly located within the enterprise. Any enterprise backup and recovery strategy should be based on an application that supports library and drive sharing to ensure the tape drives and libraries are connected throughout the enterprise in such a way as to minimize bottlenecks, as well as to gain the most use from these very expensive tape drive resources. In a pure local area network (LAN) environment, it might be advisable to physically locate the tape library or libraries close enough to the systems that have the largest amounts of data so they can be directly connected to the tape drives and therefore perform backups and restores without data being moved across the network. These systems become media servers and control access to their drives. To handle data from other LAN-based systems, you either need to add more drives and give these systems access to their own drives or use the media servers to handle the backups for the systems that do not have their own drives. Also, the systems must be physically located close enough to the tape devices to be directly connected via SCSI cables.

Backup is a part of any data protection strategy, but there are other technologies, such as replication, that are part of it as well. The key to a sound strategy is to incorporate all the different technologies. I have been involved in too many discussions with people who were trying to recover from an outage only to discover they were not as protected as they thought. One particular case involved a company that had lost their primary server that ran their most critical application. They spent several hours trying to recover from mirrored disks, when the actual failure was filesystem corruption. Mirrors did not help in this case. Their outage was extended, but they were able to recover, since they had backups. My worst call while working in support was from a system administrator who had done a mass delete from the wrong window and had removed enough of the operating system that he could not reboot. When he asked what he had to do to recover, I told him part of the process would be to restore from his latest backup. To this, he answered that configuring for backups was on his 'list of things to do.'

Locate Backup Hardware-SAN Alternative

If a storage area network (SAN) is available, it can allow for more flexibility in the backup and recovery strategy. The backup hardware can be better shared amongst the large data-resident systems while still keeping the data off the production LAN. This can also allow large systems to be backed up directly to tape without making the application servers general-purpose media servers and having these systems back up other LAN-based clients.

Identify Recovery Requirements

As you identify all the systems in the enterprise, you should note the specific recovery requirements of each system. This is very helpful in setting up the backup strategy. If an order-processing application can tolerate an eight-hour outage without severe business consequences, for example, an incremental backup strategy that minimizes backup time at the expense of restore time may be appropriate. For a Web retail application, on the other hand, where every minute of downtime means permanently lost sales, a strategy that replicates data in real time might be more appropriate, even with its greater impact on application performance. The other item to note is the order in which systems need to be recovered as part of your overall disaster recovery (DR) plan.

Identify Data and Application Availability Requirements during Backup

As you assess the backup requirements of each system, you should also make sure you know which of the database applications must be kept up- remain 'hot'-during the backup and which can be shut down to be backed up 'cold.' There are performance trade-offs involved with backing up a database while it is online, but sometimes this is necessary. This is due to the increased I/O activity, since the database activity is continuing, as well as the additional backup I/O. There are other methods of handling database backups, either hot or cold, using frozen image technologies and possibly off-host backup methods. These are discussed later in the book.

Determine the Best Way to Move the Data

You have several options for moving the data from disk to tape. Each has its own advantages and disadvantages. The methods include the following:

Files. This involves using the operating system to read all the appropriate files within the backup set and move that data from disk to tape. This method has more operating system overhead but allows for single files to be backed up and restored. It also enables the application to check each file to determine access or modification time so incremental backups can be performed.
Volumes. An entire volume can be backed up without reading the filesystem structure but by doing a bit-by-bit copy of the data from disk to tape. This is called a raw backup. This method allows for much faster data transfers but in general does not allow for single-file backups and restores. It also does not allow incremental backups. This backup method results in an entire volume being backed up, even the portions that do not contain valid data.
Block level. If the filesystem has enough information about the files, it is possible to determine which blocks have been changed. If the backup application can interface with the filesystem, you can back up just the changed blocks. This type of backup is called block level incremental.
Mapped raw backup. Some backup applications, such as VERITAS Software's NetBackup, can map a raw volume and then perform a raw volume backup while retaining the filesystem map so single files can be restored. This also allows for incremental backups. This type of backup is discussed in more detail in the section on frozen image backups in Chapter 7, 'Evaluating Other Backup-Related Features and Options.'
Off-host backups. This is a mechanism where data is moved from disk to tape without the application host being directly involved in the disk reads or tape writes.

Backup and Recovery

Wednesday, December 19, 2007

Backup Seems Simple …