Wednesday, December 19, 2007

Evaluating Storage Media Requirements

Backup Window and Amount of Data

One of the most critical steps in evaluating the storage media requirements is determining the actual backup window. This must be the actual amount of time you are allowed to have backups running, while at the same time, controlling the backup hardware, using a major part of the network, and using the resources on the systems being backed up. The size of your backup window is becoming a much harder measurement to define. You must be able to determine the number of hours in a day and in a week that can be dedicated to backups, as this is an integral part of the equation to determine media requirements.

Based on the data we collected in the earlier chapters, you should now have a very good idea of how much data needs to be backed up each day and each week. You need to know how much data needs to be backed up during each window. Generally, the largest backups will be the full backups. In the past, most administrators performed daily incremental backups and did all their full backups over the weekend when most people were not working. This concept is changing. It is very common now for a percentage of the systems, say, one-fifth, to have full backups done each day and the remaining systems to have incremental backups each day. The weekends are saved to do maintenance or to catch up. If this is closer to your model, then your window would be the time each day when backups are performed, and the amount of data would be the average sum of the total data that would be backed up, roughly a fifth of your total data. If you do not have specific operational information on the amount of data that will make up your incremental backups, you can estimate using a percentage of change to calculate the amount of data. It is common to use 20 percent, unless you have a more accurate measurement. The goal here is to try to get as close as possible to your actual environment.

Drives

Now that we have the amount of data and the number of hours needed to store that data, all we have left to do is some basic math. Just take the total amount of data that has to be backed up daily and divide by the duration of the daily backup window:

Ideal data transfer rate = Amount of data to back up ÷ Backup window

If you have 100 GB of data and an 8-hour window, your ideal data transfer rate would be 12.5 GB/hr.

After you have an idea of the ideal data transfer rate, you can then look at the different drive types to see which might offer the best fit for your needs. Not surprisingly, this is a little more complicated than just looking at the base numbers, though. With potential drive technology, you must consider both performance and capacity. In larger enterprise environments, one size usually does not fit all. As mentioned several times, you need to look at the recovery requirements first and work back. This might mean you will need two different types of drives, some that are very high performance but with less capacity and some that offer higher capacity with lower performance. Data that is being kept for long retention periods, especially to fulfill legal requirements, might be better suited for the lower-performance but higher-capacity media. Data that might be required for immediate restores where time is money might be better suited for the high-performance media. It is not uncommon to have backups done to high-performance drives and media and then the images vaulted to high-capacity drives and media for off-site storage.

A sample of tape drive transfer rates, capacities, and access times is given in Table 4.1. This information can be very helpful in determining which drive technology you need, but never forget these are all theoretical numbers and are given without taking into account the internal drive compression. Drive manufacturers advertise compression rates for the different drive technologies. These vary depending on the drive but are also theoretical numbers. These specifications can change with new firmware levels or versions of the drives. To get the most accurate numbers, contact the drive vendor or go to their Web site, where you'll find up-to-date specification sheets.

Table 4.1: Tape Drive Data Transfer Rates and Capacities

DRIVE

THEORETICAL TRANSFER RATE GB/HR (NO COMPRESSION)

THEORETICAL CAPACITIES GB (NO COMPRESSION)

ACCESS TIME EXCLUDING LOAD TIME

COMPRESSION

4mm (HP DDS-2)

1.8

4

4mm (HP DDS-3)

3.6

12

Mammoth

11

20

60 sec

2:1

Mammoth-2

42.4

60

60 sec

2:1

DLT 4000

5.4

20

68 sec

2:1

DLT 7000

18

35

60 sec

2:1

DLT 8000

21.5

40

60 sec

2:1

SDLT

39.6

110

70 sec

2:1

9840

36

20

11 sec

2.5:1

9940

36

60

41 sec

3.5:1

LTO

52.7

100

25 sec

2:1

AIT-2

21.1

50

27 sec

2.6:1

AIT-3

42

100

27 sec

2.6:1

When you start actually figuring how many of which kind of drive you will need, we recommend using the native transfer rates and capacities without compression. It is very difficult to estimate what kind of compression rate you will experience, as it is totally dependent on the makeup of your data. Some data is very compressible, while other data will yield very little compression. If you do your architecture based on no compression, the only surprises you should experience should be good ones; you will have plenty of capacity with room for growth.

Capacity

After selecting the appropriate drive technology that provides the performance and cartridge capacity you need, you next want to look at how many cartridges you will need to have available. This involves all the elements we have looked at so far. The number of cartridges required depends on the amount of data that you are backing up, the frequency of your backups, your retention periods, and the capacity of the media used to store your backups. A simple formula that can be used is as follows:

Number of tapes = (Total data to back up × Frequency of backups × Retention period)/Tape capacity

Following is an example:

  • Total amount of data = 100 GB

  • Full backups per month = 4

  • Retention period for full backups = 6 months

  • Incremental backups per month = 30

  • Retention period for incremental backups = 1 month

Preliminary calculations:

  • Size of full backups = 100 GB × 4 per month × 6 months = 2.4 TB

  • Size of incremental backups = (20 percent of 100 GB) × 30 × 1 month = 600 GB

  • Total data stored = 2.4 TB + 600 GB = 3 TB

Solution:

  • Tape drive = DLT 7000

  • Tape capacity without compression = 31.5 GB

  • Total tapes needed for full backups = 2.4 TB / 31.5 GB = 76.2 = 77

  • Total tapes needed for incremental backups = 600 GB / 31.5 GB = 19.1 = 20

  • Total tapes needed = 77 + 20 = 97

By looking at this example, you would expect to have a minimum of 97 active cartridges at any given time. This also assumes that all the cartridges will be filled to capacity and there will be no unused tape. These calculations are based on no compression. This does give you an idea of the steps necessary to plan for an appropriately sized tape library. We would never recommend implementing an enterprise backup strategy that does not include a robotic tape library with a barcode reader. Without these, the management can become overwhelming and very susceptible to human error. It is much better to turn over media management to an enterprise backup application.

When figuring out how many slots are required to support your environment, do not forget to include some slots for cleaning tapes and at least two for the catalog backups. Actually, you will want to reserve twice as many slots for catalog backups as are needed so you can keep a copy of the catalog. If you are including an off-site storage solution of some type (vaulting) as part of your backup strategy, you need to include this in your total capacity calculations, since creating duplicate copies requires additional tapes.

Library?

As stated in the previous section, most enterprise backup strategies will include some type of robotic tape library. There are several library manufacturers, each with an entire line of libraries from small to very large. Part of this decision will be based on the drive technology you select, as some libraries support only certain drives. The considerations for selecting a library are as follows:

  • Does it handle the desired drive type?

  • Will it handle the required number of drives?

  • Does it support the needed number of slots?

  • Does it have expansion capability?

  • What type of connection, SCSI or Fiber?

  • Does it support barcode labels?

As you look at the different libraries available, you should also consider if your strategy is best served by one large library that contains all the drives and media or by smaller libraries that are distributed throughout your enterprise. We will discuss some of the reasons for picking one or the other in a later chapter, but part of this decision is whether you plan to implement a SAN or distributed media servers (or both). Generally, it is cheaper to buy one large library than two smaller libraries that equal the same capacity in drives and slots.

A sample of the library vendors are ADIC, ATL, Compaq, Exabyte, Fujitsu, HP, IBM, NEC, Sony, Spectra Logic, and StorageTek. Each of these companies has a Web site that contains all the information for their entire line of libraries. This would be an excellent place to go for information.

An Introduction to NetBackup

Many commercial backup products are available on the market today. The leader amongst them on UNIX platforms is VERITAS Software's NetBackup DataCenter. We will use this product, which we refer to as just NetBackup, in our explanations and examples of setting up a backup domain. We start with an introduction to NetBackup, including an explanation of the unique architecture of the product, and then we define the terms that it uses. Most of the examples use the latest release, 4.5, but we will mention when there is a significant difference with older releases.

NetBackup Tiered Architecture

NetBackup uses a four-tiered architecture for backup domains.

The tiers are as follows:

  • Client. Any system that contains data that needs to be backed up.

  • Media server. Any system that has physically connected storage devices to be used for backups. These can be robotic devices, standalone tape drives, or optical storage devices.

  • Master. The NetBackup server that provides administration and control for backups and restores for all clients and servers. It is also the system that contains all the catalog information for the backup domain.

  • Global Data Manager. A master of masters that can monitor and facilitate management of multiple master servers and multiple backup domains.

All systems within a NetBackup domain fall within at least one of these tiers and can actually fit into more than one. The first three tiers are always found, even if on the same system. The fourth tier, the Global Data Manager tier, is usually found when there are multiple NetBackup domains that are monitored and administered from a single location. This tiered architecture is one of the things that make NetBackup so scalable and flexible. As you start out, you can have a single master server that gives you a single point of administration, and at the same time, you can have as many media servers as are needed to support your configuration. As you grow, you can add more tape devices and just add more media servers without having a great impact on your overall configuration. If your enterprise continues to grow, you can simply add another master server with media servers as needed. At this point, you might add the fourth tier. The first tier, clients, can be added or deleted easily, since the configuration is kept on the master server.

Explanation of Specific NetBackup Concepts

The NetBackup product can be thought of as being made up of two major components: netbackup and media manager. The netbackup component is responsible for the who, what, when, where, and how aspects of the backup jobs:

  • Who needs to be backed up? The client.

  • What needs to be backed up? The file or file list.

  • When do the backups run? The schedule.

  • Where should they be stored? The logical storage unit.

  • How should this policy be handled? Specific attributes of the backup policy.

It also tracks and manages all the backups and all of the backup images.

The media manager component is responsible for managing all the physical media and all the devices. In general, the netbackup component deals with logical devices, and the media manager deals with physical devices.

The netbackup component tracks and manages all the data that is backed up by using the unique backup identifier assigned to each backup image when it is created. It also manages the overall catalog and the scheduling of new tasks. It selects the appropriate media server to match each backup job.

The media manager component manages all the physical storage devices and the physical media. It is through the media manager that the physical tape libraries and drives are configured, and the volume database is populated. The media manager controls the tape libraries and maintains the inventory of all the volumes.

Layout NetBackup Domain

Now the fun begins. You have gathered tons of data and know more about your enterprise than you ever thought was possible. It is time to put all of this knowledge to use. If this is the first time an actual backup and recovery strategy has been implemented, you will be able to tailor the backup domain. If this is an upgrade or application change, you will probably have to work within the confines of the existing layout, making changes as required.

Using NetBackup as the application in this domain, you first want to list all the systems that will be backed up as clients. This will give you an idea of the number of systems that need to be backed up and the distribution of data. Any systems that have a large amount of data, over 100 GB for example, should be noted, as you might want to make them media servers. The other important thing to track with the clients is their network connectivity. If it looks like there are a lot of network-based clients on slow networks, you should consider installing a high-speed backup network. This gives you increased backup and recovery performance, as well as keeping backup and recovery traffic off the production network. It is now common to install a 100Base-T or Gigabit Ethernet network just as a backup and recovery network.

A NetBackup domain requires at least one master server. In most situations, there will be only one; however, in a later chapter we discuss some reasons to have more than one master. The system that you choose for the master will depend on the size of your enterprise-the number of clients, the total number of files being backed up, and the number of storage units you will need. In a smaller environment, the master server can be a system that is already being used for other work or could be a combined master and media server if it is attached to a backup device. Figure 3.2 shows an example of a configuration where the master server is also a media server and all the client backups are basically LAN-based backups.

In larger environments, the master server is usually a dedicated NetBackup server, although it could still be a media server. This server must have enough disk capacity to handle the NetBackup catalogs and, potentially, the debug logs. Most of the debug logs are located in /usr/openv/ netbackup/logs. If this directory is not located in a separate partition, you must make sure you do not allow the logs to grow and fill the disk. The largest part of the catalog is the image database, which is located in /usr/ openv/netbackup/db/images. It is not uncommon for this directory to be a separate partition. All of the meta data for all the backups are sent to the master and stored in this image database portion of the catalog. The maximum amount of disk space that NetBackup requires at any given time varies according to the following factors:

  • Number of files that you are backing up

  • Frequency of full and incremental backups

  • Number of user backups and archives

  • Retention period of backups

  • Average length of full pathname of files

  • File information (such as owner permissions)

  • Average amount of error log information existing at any given time

  • Whether you have enabled the master catalog compression option



To estimate the disk space required for the image database portion of the NetBackup catalog:

  1. Estimate the maximum number of files that each schedule for each policy backs up during a single backup of all its clients.

  2. Determine the frequency and retention period of the full and incremental backups for each policy.

  3. Use the information from Steps 1 and 2 to calculate the maximum number of files that exist at any given time.

    • Assume you schedule full backups every seven days with a retention period of four weeks and differential incremental backups daily with a retention period of one week. The number of file paths you must allow space for is four times the number of files in a full backup plus one week's worth of incrementals.

    • The following formula expresses the maximum number of files that can exist at any given time for each type of backup (daily, weekly, etc.):

      Files per Backup × Backups per Retention Period = Maximum Number of Files

    • If a daily differential incremental schedule backs up 1200 files for all its clients and the retention period is seven days, the maximum number of files resulting from these incrementals that can exist at one time are as follows:

      1200 × 7 days = 8400

    • If a weekly full backup schedule backs up 3000 files for all its clients and the retention period is four weeks, the maximum number of files due to weekly full backups that can exist at one time are as follows:

      3000 × 4 weeks = 12,000

    • Obtain the total for a server by adding the maximum files for all the schedules together. The maximum number of files that can exist at one time due to the preceding two schedules is the sum of the two totals, which is 20,400.


      Note

      For policies that collect true image restore information, an incremental backup collects catalog information on all files (as if it were a full backup). This changes the preceding calculation for the incremental from 1200 × 7 = 8400 to 3000 × 7 = 21,000. After adding 12,000 for the fulls, the total for the two schedules is 33,000, rather than 20,400.

  4. Obtain the number of bytes by multiplying the number of files by the average length of the file's full pathnames and file information.

    1. Determining the space required for binary catalogs: If you are unsure of the average length of a file's full pathname, use 100. Using the results from the examples in Step 3 yields the following:

      (8400 × 100) + (12,000 × 100) = 1992 KB (1024 bytes in a kilobyte)

    2. Determining the space required for ASCII catalogs: If you are unsure of the average length of a file's full pathname, use 150. (Averages from 100 to 150 are common.) Using the results from the examples in Step 3 yields the following:

      (8400 × 150) + (12,000 × 150) = 2988 KB (1024 bytes in a kilobyte)


      Note

      If you have ASCII catalogs and use catalog indexing, multiply the number in Step 4 by 1.5 percent.

  5. If you are running with debug logging, add 10 to 15 MB to the total calculated in Step 4. This is the average space for the error logs. Increase the value if you anticipate problems.

  6. Allocate space so all this data remains in a single partition.

You must take many factors into account when determining what kind of system to use for the master server. It can be any type of system, any UNIX system or Windows system from the supported systems list. If the master is a dedicated system, you will need enough computing power to support the network adapters plus the NetBackup processes, as well as enough memory to support each. If the system is also a media server, the system resource requirements are higher. With NetBackup, it is not uncommon to share a tape library among multiple media servers. In many of these cases, the robotic control is handled by the master, while media servers share the tape drives, either directly connected or truly shared in a storage area network (SAN). (We look at using the Shared Storage Option in a SAN in a later chapter.) The following tables provide information about the number of CPUs and the amount of memory needed to support several hardware and software components, as well as the I/O adapter performance numbers. You should use the numbers listed in Tables 3.1 through 3.3 to design your master and media servers.

Table 3.1: CPUs Needed per Backup Server Component

COMPONENT

NUMBER OF CPUS PER COMPONENT

Network cards

1 per 2-3 100Base-T cards

1 per 5-7 10Base-T cards

1 per 2-3 FDDI cards

1 per ATM card

1 per 1-2 Gb Ethernet card

(preferably 1)

Tape drives

1 per 2-3 DLT 8000 drives

1 per 2-3 DLT 7000 drives

1 per 3-4 DLT 4000 drives

1 per 2-4 8mm and 4mm drives

OS + NetBackup

1

Table 3.2: Memory Needed per Backup Server Component

COMPONENT

MEMORY NEEDED PER COMPONENT

Network cards

16 MB per network card

Tape drives

128 MB per DLT 8000 drives

128 MB per DLT 7000 drives

64 MB per DLT 4000 drives

32 MB per 8mm and 4mm drives

OS + NetBackup

256 MB

OS + NetBackup + GDM

512 MB

NetBackup multiplexing

2 MB × no. of streams × no. of drives

Figure 3.3 shows an example of a typical shared library configuration where there is a single master server and two media servers, each with two drives from a shared four-drive library. This would be a good option if the media servers either had a large amount of data or if you wanted to share the workload of backing up network clients.

If the amount of data is small enough, the drives could be directly connected to the master, making it the master and media server. When determining if you need a media server or multiple media servers, you should consider the following:

  • Amount of data

  • Location of data

  • Speed of networks

  • Backup window

Let's look at these in more detail.

Table 3.3: Drive Controller Data Transfer Rates

DRIVE CONTROLLER

THEORETICAL MB/SEC

THEORETICAL GB/HR

SCSI

5

18

Narrow SCSI-2

10

36

Wide SCSI-2

20

72

Ultra ATA

33

118.8

Ultra SCSI-3

40

144

Ultra ATA 66

66

237.6

Ultra2 SCSI-3

80

288

Fibre Channel

100

360

Amount of Data

The total amount of data that must be backed up when full backups are done is a good estimate for the maximum data that would be required to be managed. It is also very important to determine how much data is to be backed up on a daily basis. This is usually an estimation based on the amount of user or application data and the daily rate of change. If a filesystem contains 100 GB of data but only has a rate of change of 2 percent, you only have to worry about 2 GB of data for your daily backups. These two numbers, total data and changed data, are also used to determine how many tape drives are needed and are part of the media requirements formula.

Location of Data

If all the data is located on a couple of large file servers, you should make them media servers by physically connecting them to tape drives and maybe have one more to handle all the network-based clients. If the data is spread throughout your enterprise, you must decide how you want to configure the backup domain. You can configure a dedicated media server or servers and back up all the data over the LAN, or you can distribute media servers closer to the clients. The restriction here will be the SCSI cable length restrictions from the media servers to the libraries.

Speed of Networks

If a significant amount of data resides on clients on a slow network, you should consider either installing a high-speed backup network or, if there is enough data, making one of these clients a media server. The other consideration is the amount of traffic the backup and recovery requirements will add to the existing networks. If possible, you should put the backup and recovery traffic on a dedicated network. If this is not possible, you might have to throttle large backup clients on slow networks or they will dominate the network. Table 3.4 will help you determine how different networks will affect the overall backup performance.

Backup Window

The backup window can also come into play when you are determining media server requirements. Some straightforward formulas are used to calculate how many tape drives are required to back up a known amount of data in a fixed amount of time, assuming no other bottlenecks. We discuss these in the next chapter. If the amount of data to be backed up and the amount of time available result in too many drives required for a single media server, this would indicate another media server is needed. You must always stay within the system constraints when configuring media servers. It does no good to put more tape devices on a server than it has the I/O bandwidth to handle. You do not want to create any unnecessary bottlenecks.

Table 3.4: Network Data Transfer Rates

NETWORK TECHNOLOGY

THEORETICAL GB/HR

10Base-T

3.6

100Base-T

36

FDDI

36

Gigabit GbE

360

Quad FastEthernet QFE Trunked

144

Business Requirements of Backup Systems

After determining why the data needs to be backed up and the recovery requirements, you are ready to look at how your particular business requirements come into play. You need to determine how often each type of data or each system needs to be backed up, what the restore requirements are, what the data retention policy needs to be, any security requirements, off-site storage requirements, and unique business unit requirements. All of these items must be addressed.

Developing a Backup Strategy

To start this phase of architecting your backup and recovery strategy, you need to look at the frequency of backups and the required retention of the data. This is usually controlled by the business, legal, and recovery requirements. The business requirements that generally affect the backup strategy are those that define how long specific types of data must be kept available either locally or in a storage facility. These requirements could also specify the number of copies of the data that must be retained. In some cases, there are specific business requirements regarding how often specific data is backed up. Legal requirements must also be considered, although they are usually the basis of the specific business requirements. When you are dealing with data that might fall under control of any of the many governmental regulatory agencies, you must make sure your strategy complies with all their requirements.

Business Requirements

The specific business requirements that you need to consider include the following :

  • Service-level agreements to business units. What backup and recovery guarantees do you have?

  • Unique requirements for specific data. For example, all original circuit design must be kept for seven years.

  • Recovery time objectives. How fast will specific systems/applications be recovered?

  • Recovery point objectives. How far back in time are you willing to go to recover?

Legal Requirements

The legal requirements you need to consider are generally those imposed by governmental regulatory agencies. These typically involve specific data retention requirements for specific kinds of data. What makes this even more challenging is that these requirements can change because of changes in administrations or new laws. These can also dictate how many copies of the data must be kept and where it must be kept.


Recovery Requirements

As you build your strategy, you should make a special note of systems or applications that have special recovery requirements. These are usually covered by the business requirements but are worth mentioning again. We have found it much better to always look first at the recovery requirements when building a backup strategy, since that is probably the reason you are doing backups.

With these absolutes in mind, the next step is to take the information gathered in the first chapter and start your backup matrix. As you put this together, you should also consider the type of backup you need. Following are the different backup options:

  • Full backup. This backup copies all the files and directories that are below a specified directory or filesystem to a storage unit.

  • Cumulative incremental backup. Scheduled by the administrator on the master server, this option backs up files that have changed since the last successful full backup. All files are backed up if no prior backup has been done. This is very similar to a differential incremental backup, which is covered later, with one very major difference. In the event of a full system recovery, a cumulative incremental backup would require only two images: the last full backup and the most recent cumulative incremental. While this speeds the recovery process, this type of backup does require more tapes than the differential incremental and may potentially take more time, because you are backing up all the files that have changed since the last full backup.

  • Differential incremental backup. Scheduled by the administrator on the master server, this option backs up files that have changed since the last successful incremental or full backup. All files are backed up if no prior backup has been done. This is what most people traditionally refer to by incremental backup. During a full recovery, using this type of backup could require more tapes. However, do not base your architecture decisions just on these two definitions, but rather on the information gathered during your initial discovery phase.

  • True image restore. This type of backup restores the contents of a directory to what it was at the time of any scheduled full or incremental backup. Previously deleted files are ignored. You can also select Move Detection, which specifies that true image incremental backups include files that were moved, renamed, or newly installed.





[*]From the NetBackup 4.5 DataCenter System Administrator's Guide, VERITAS

Frequency of Backups

Once you understand the general backup requirements for all of the data and the business and legal requirements, you should have a pretty good idea of how much data needs to be backed up and at least a minimum requirement for the frequency. The trick in establishing the ideal frequency policy is to come up with a schedule that gives you adequate protection with minimal media usage. You don't want to back up any more often than needed to get the necessary level of protection, since 'more often' means more tapes, more data being moved, and more administration. When in doubt, however, go with more media.

Establishing the best frequency and retention policy for data that is not covered by business and legal requirements also involves knowing why the data is being backed up and what recovery requirements are. In general, truly static data or static systems should not need very frequent backups. They might be backed up as infrequently as once a week or even once a month. As far as the number of copies required, a normal practice is to keep between two and four copies of the backups.

Data that is more dynamic requires more frequent backups and probably needs one of the incremental types. The decision between differential incremental and cumulative incremental is based on recovery requirements versus media usage. Weekly full backups and daily differential incremental backups could require up to six tapes to restore a directory, filesystem, or database in a worst-case scenario. Each of the incremental backup images might be small, but each day's changes could be on a separate tape, or at least different images on the same tape. If you did the same backup sequence with cumulative incremental backups, no recovery would take more than two images that could reside on two tapes; however, if there were enough changes to the data, the cumulative incremental backups could approach a full backup in size. You must decide whether it is better to use fewer tapes with differential backups but run the risk of having to mount more tapes on a restore or potentially use more tapes on the backups but only have to restore two images to restore an entire backup.

Obviously, another piece of this equation is the anticipated type of recovery activity. If the data is being backed up for DR protection or to protect against hardware failures, the question of differential versus cumulative is important. If the data is being backed up to protect against user deletion or error, you should stay with differential backups. The recovery requirement also comes into play. By using the information you gathered during the interview process with the data owners, you can realistically plan based on the expectations you set during your discovery. If the absolute speed of recovery is important, the use of cumulative incremental backups is desired. The incremental type generally comes into play when you are working with filesystem backups. With databases, you generally work with the application tools that are integrated with the backup application, and this will dictate much of what you do.

Retention of Backups

A very common mistake people make is to retain their backups for too long. This increases the cost of backups, since you will need more media. You need to make sure you understand all the legal and business requirements for keeping copies of the backups and ensure you meet them. For normal operations, you want to make sure you have a retention level that at least exceeds the frequency. If you do a particular backup weekly, the retention level must be at least one week or you will be unprotected. A general practice is to keep at least two cycles of each type of backup. This way you will always have two copies of the data on tape when doing the next backup. This method allows you to recover from a crash that might occur on the day/time for the next backup, plus it provides an extra copy in case there is a problem with one of the tapes. Another common practice is to assign off-site tapes a different retention level than tapes kept on-site. The reasons people do this vary, but in most cases, it is driven by the business. For instance, perhaps the business requires that the backup images be kept on-site for 30 days for recovery purposes, while off-site images must be kept for 180 days.

Give this entire issue some careful thought, and don't just say you are going to keep everything forever. If this describes your particular situation, you do have an uphill battle ahead of you, but it's not the end of the world. The best thing to do from this point on is to make sure you can classify your data properly, then redesign your backup policies so you are only backing up and keeping the data you need for the periods of time you require.

Security of Backups

Security of the backups is something else that must be taken into consideration. Your business might require encrypted backups so that the data on the tapes cannot be recovered without the proper key or password. Many backup and recovery applications offer this kind of backup, but there is a performance penalty associated with encrypted backups. The data will be encrypted on the client system before being sent across the network, so this will require CPU cycles on the client and will also slow down the rate of data being presented to the network for backup. The data is very secure, since it is encrypted before it is sent across the network and is still encrypted when written to tape. This also requires the key be known in order to do a restore. Most people rely on keeping the media secure rather than implementing data security by using encryption.

Off-Site Storage Requirements

Most people today realize the need to implement a true data protection strategy of which backup is an integral part. One of the components of this strategy is management of off-site media. You should always select a backup product such as VERITAS NetBackup that offers an automated vaulting or off-site storage solution so you have the management tools for tracking the media that is off-site. The questions that you are addressing for an off-site or vault solution are as follows:

  • What images need to be sent off-site?

  • How many copies of the images need to be kept off-site?

  • How long should each image or type of image be kept off-site?

  • Do we have enough tape media on hand to allow for two or more copies?

You must determine which systems need to have off-site copies of their data and if all or only part of that system needs to be sent off-site. Sending a backup tape of the operating system off-site is typically not necessary and not cost-effective. You will need to know if incremental backup images as well as full backups are required for off-site. This requirement can be different for each type or class of system. It also depends on the existence of a DR site and legal requirements.

How Many Copies?

The next question is how many copies of each backup image should be kept off-site. This usually depends on why the data is being stored off-site. If there are legal requirements, then these requirements usually stipulate how many copies. If it is being done for pure DR, one or two copies might be adequate. This decision is usually based on either specific legal and business requirements or on your overall DR strategy. In cases where an off-site storage facility is used, you might only need one copy. If there are multiple sites, it is common to send copies to a sister site that can act as a DR site. In these cases, you might want two off-site copies so one can be kept locally and the other at the remote site. What you really need to do is keep an open mind and take a look at all the possibilities and requirements.

How Long Should They Be Kept Off-Site?

If there are legal or business requirements, these must be met. A common legal requirement is that all data related to any financial trades must be retained for seven years. The other considerations are how many potentially different copies of an image do you want to store off-site and how many tape cartridges are you willing to maintain out of production? Again, it may be up to the business unit managers as to how long they want their data stored off-site. In most companies, the IT department views the business units as their customers; as such, there are costs involved in the services you are providing to your customers. Similarly, when the business unit manager requests that his or her data remain off-site for two years, you must present the costs associated with the management of this data. There will be the cartridge cost, pickup/delivery costs, and storage costs. Even if you are not charging this back to the managers, it is a good idea to document this for your management. Once you are able to present people with the facts, reality sets in and you are able to help them reasonably determine the proper length of time for off-site storage given their business requirements.

Differences between Business Units

Sometimes data must be segregated between business units. There are several ways to do this. The most secure way is to have multiple tape libraries and do the different business unit backups to different physical libraries. Sometimes this is not possible or practical. If this is the case, there are other ways to accomplish this. With a tool such as VERITAS Software's NetBackup, you can establish unique volume pools and assign media to a specific volume pool. You then assign specific backups to specific volume pools. This allows you to logically segregate the data within a single tape library. We discuss this in more detail later when we look at installing and configuring a backup product. One word of caution: Just because your software solution supports it doesn't mean you have to implement it. In other words, if you do not have a compelling reason to implement multiple volume pools, then by all means do not add a level of complexity to your environment simply because you can. Backup products like VERITAS NetBackup are quite scalable and easily modified should that be a requirement down the road.

Other differences between business units may potentially affect how you architect your backup strategy. Often these will result in unique backup and recovery requirements for some systems. With most backup products, you can set up backup policies for similar clients with similar requirements and separate the clients that have unique needs. You must always be aware of any of these unique requirements before putting your strategy together.

Backup Seems Simple …

Conceptually, a backup strategy is simple. A system administrator decides what data is critical for business operation, determines a backup schedule that has a minimal effect on operations, and uses a backup utility program to make the copies. The backups are stored in a safe place so they can be used to recover from a failure.

Though a backup strategy is quite simple in concept, the difficulty comes in the details. Architecting a backup and recovery strategy is more involved than most people realize. One of the most frustrating and discouraging tasks is determining where to start. What at first seems a simple task becomes daunting as you start digging deeper and realize how many elements of the backup strategy are interconnected. For example, as a system administrator of a large enterprise, chances are you would not want the burden of deciding what data is backed up when, and for how long it is kept. In fact, you may be presented with various analysis summaries of the business units or own the task of interviewing the business unit managers yourself in order to have them determine the data, the window in which backup may run, and the retention level of the data once it is stored on the backup media. This is often called a business impact analysis (BIA) and should yield some results that will be useful during the policy-making process. The results of these reports should also help define the recovery window, should this particular business unit suffer a disaster where data cannot be accessed or updated. Knowledge of these requirements may, in fact, change the entire budget structure for your backup environment, so it is imperative during the design and architecture phase that you have some understanding of what the business goals are with regard to recovery.

You will find that most business unit managers are not as concerned about backup as they are with recovery. As you can see from the level of complexity of our example, too often the resulting frustration may lead to inactivity where nothing gets done-or at least not done in the most effective manner. The obvious intent of a backup and recovery system is to provide data protection. Since we are setting up a system to protect the data, the next step also seems obvious: Determine how much data is in the enterprise and where it resides. This is an important part of establishing the backup and recovery system, but it does not provide enough information to architect a strategy. In addition to knowing how much data you have and where it is, you must also have a good understanding of why the data is being backed up and what the recovery requirements are. This is necessary so you can make the appropriate decisions about the overall backup and recovery strategy. The more you understand the nature of the data and the level of protection required, the better decisions you can make in setting up the entire backup and recovery environment.

The Goals of Tape Backup

You always want to keep in mind that the overall goal of tape backup is to make copies of your data that can be used to recover from any kind of data loss. The primary goals of the tape backup portion of an overall data protection strategy are to do the following:

  • Understand the goals of the business in order to deliver a properly configured backup environment.

  • Enable information services to resume as quickly as is physically possible after any system component failure or application error.

  • Enable data to be relocated to where it's needed, when it's needed by the business.

  • Meet regulatory and business policy data retention requirements.

  • Meet recovery goals; in the event of a disaster, return the business to a predetermined operating level.

Each of these goals relates to a specific area of data protection and needs to be considered as we put together our overall backup strategy. Specifically, you should ask why data is being backed up. As you consider each system or group of systems, keep in mind whether the data is being backed up to protect against failure, disaster, or regulatory requirements, and if the goals of the business will be met in the event of a failure or disaster. In reality, your success as a backup administrator will not be measured by how fast you are able to back up your data but how swiftly you are able to meet the aforementioned goals. Stated simply, your success will be defined by the restorability of the data in the environment.

The Role of Tape Backup

For a personal computer user, backup typically means making a copy of the data on the computer's hard drive onto a tape or CD-ROM. Personal backup media are often labeled by hand and are 'managed' by storing them in a drawer or cabinet located in the room with the computer. In the enterprise, data protection is a little more complex. Enterprise backup must be able to do the following:

  • Make copies of your data, whether organized as files, databases, or the contents of logical volumes or disks.

  • Manage the backup media that contain these copies so that any backup copy of any data can be quickly and reliably located when required, and so that the media can be tracked accurately, regardless of the number.

  • Provide mechanisms to duplicate sets of backed up data so that while a copy remains on-site for quick restores, another copy can be taken off-site for archival or disaster protection purposes.

  • Track the location of all copies of all data accurately.

  • Why Is the Data Backed Up?

    Why you are backing up data seems like a trivial question, but it really needs to be answered for all the data in the enterprise. Some of the most common answers to this question are as follows:

  • Business requirement

  • Hardware failure protection

  • Disaster recovery (DR)

  • Protection from application failure

  • Protection from user error

  • Specific service-level agreements with the users/customers (SLA)

  • Legal requirements

You need to understand what data on what systems falls into each category. By interviewing the data owners, you will be better equipped to categorize the data. In most cases, the administrators know what it takes to recover the operating system and, in some cases, the database engines and other applications. However, the onus must be placed upon the data owner (customer) for the administrators to fully understand the impact to the business in the event there is a data loss (BIA). Addressing their expectations up front will save much time, money, and potential embarrassment. Several years ago, one of us was given the task of architecting a backup solution that would allow for quick recovery. 'Quick' recovery is subjective, so the question asked was this: 'What is your expectation of a ‘quick' recovery?' Based on the response of 30 minutes, a proposal was drafted for the type of system that would need to be designed to meet this 30-minute recovery window. Soon after management reviewed the proposal, we agreed to a more realistic time frame. So you can see how this would give you an opportunity to show customers how much money their requirements will cost without you having to lose sleep in the process.

You will usually find some of the systems have fairly static data and would probably be backed up to protect against hardware failure or for DR. Other systems are very dynamic with a very active user base. Backup of this data should be considered for protection against application failure or user error. What is generally seen on systems is a mixture of these data types. The core operating system (OS) and base applications are usually static and can be rebuilt from release materials, while data used by the application can be very volatile. We will discuss each of these in more detail. Defining data types is vital, because understanding the data allows us to determine the recovery requirements. In most cases, the recovery requirements dictate the backup strategy.

Hardware Failure

Some of the data in an enterprise is backed up specifically to protect against hardware failure. You want to be sure you can recover an entire volume or database in case a disk or server fails. (The probability of doing any restore of less than an entire volume is very small.) The backup protection will be geared to this recovery requirement.

The best pure hardware failure protection is disk mirroring-that is, making a complete second copy of the data on disk to another disk. However, this practice does not eliminate the need for backups. For the data that falls into this category, you might consider raw volume backups where all the data in a disk volume is backed up at disk read speed. A raw partition backup is a bit-by-bit backup of a partition of a disk drive on UNIX. On Windows NT/2000, this is called a disk-image backup. You do not read the data via the filesystem, so you avoid adding this process to the system overhead. A raw volume backup can give you much better backup performance; however, it has some restrictions. The primary restriction is that you back up the entire volume. For example, if a 50-GB volume is only 50 percent full, a filesystem backup would result in 25 GB being backed up. However, a raw volume backup would result in 50 GB being backed up, and, accordingly, more tape being used. Then, on the restore, the entire volume is restored regardless of how much data actually resides in it. You need to take this into account when determining whether to do raw backups.

The backup strategy for this protection could be configured around the hardware layout of each system. If you know that a system will be backed up solely for hardware protection, you can lay out the system to optimize the backup and recovery performance. A lot of the data that could fall into this category is more static; it would be backed up less frequently and would usually involve full backups. This data can be entire systems within your enterprise or some of the static data that is found on more dynamic systems, such as the OS-related data or the actual applications that are loaded on a system.

Disaster Recovery

For systems that are a part of your DR strategy, you need to ensure you have all the data required to rebuild a system in an easily identified group. You must also ensure you have all the supporting data necessary to recover these systems. This can include the supporting OS data as well as everything required for the backup application in order to do full system restores. Using a vault-type solution where backups are sent off-site to be stored until needed in conjunction with the backup application greatly helps this task.

The biggest challenge here is identifying which systems and applications are critical and determining how fast they have to be back online. A part of the DR strategy should include the priority of recovering these systems. The speed of recovery can dictate some of the backup decisions. It is very likely that systems that are a part of your DR strategy might also require protection within one of the other strategies. You would actually configure your backup and recovery system to provide the necessary DR protection in addition to any other requirements. Keep in mind that when you declare a disaster it may mean you no longer have access to your primary site. So any reports, documentation, call lists, operations guides, and so on that you may require should be in an off-site location along with your DR backup media. Many DR test plans fail because of one document or component that was overlooked.

Application Failure

The data that needs protection against application corruption usually requires more frequent backups. In these instances, the use of both incremental and full backups is very important. The highest risk of application data corruption is database applications, so you should develop a specific backup strategy for these applications. Most of the backup applications can interface with the database applications to allow both full and incremental backups that can be done either hot (with the database still active) or cold (with the database shut down). The systems that require this type of data protection might also be part of your DR strategy, so they would be part of multiple strategies.

User Error

For the data that is directly user-generated or -accessed, you might want to consider a backup strategy for user error protection. This might also include mailboxes, but in these instances, the backup and recovery strategy is dictated by the mail application. The very nature of providing user error protection implies that there are many more instances of single file or directory restores, so the backup strategy needs to support this. This strategy would generally involve more frequent incremental backups. The frequency of backups is an important consideration if it involves data that users are deleting and restoring on a regular basis. You would also want to ensure the backups are configured to facilitate faster browsing and recovery.

Service Level Agreements

You might find some of the data is being backed up to meet a specific service level agreement (SLA). The backup strategy will depend on the exact agreement. It is very possible that the SLA will actually be for a recovery requirement. If that is the case, the backup strategy will be governed by these requirements. This is often the situation where there is a dedicated backup and recovery administration staff that provides this service for a particular company or agency. The other groups or business units become the customers of the backup group and could have specific SLAs. These will usually dictate the backup strategy. This is also the case in hosting centers. It is very important to determine exactly what the exact requirements are. These can involve any of the backup types mentioned, with the additional requirement to have systems or applications back online within a specific time frame. It is common to have an agreement that any request for the recovery of any file or directory must be accomplished within a given time. All of this information is required to allow you to actually put together a backup strategy.

Legal Requirements

Your company may be required by law to keep certain data for a particular time period, without exception. Then there's always the possibility that legal will be very strict in noting that certain data types are not to be kept more than a particular time period. These factors will further shape the way you architect the collective backup solution; for example, one server may be a member of multiple policies in order to achieve the legal requirement of its data. It is good practice to always include the legal department when determining the data retention requirements whenever possible. This is essentially a component of a business impact analysis.

Complexity in Enterprise Backup

The functions of enterprise tape backup may seem straightforward. But implementing a truly functional backup environment that meets enterprise data protection requirements can be a complex undertaking. When you design or update a backup strategy, complexity can arise for several reasons:

  • Ability to back up all of the data. For the backup strategy to be useful, it must ensure that all data that can be lost is backed up. In an enterprise with large numbers of information servers, some of which may share data with others, identifying the sets of data objects to be backed up can be a significant effort.

  • Frequency. Backup frequency is essentially a trade-off between resources (network and I/O bandwidth, processor capacity, tape and library hardware, and application access) and the need for the most current data possible. Again, with many information services needing data protection, finding the right balance between backup frequency and resource consumption is a challenge.

  • Integration of all data managers. Enterprises with many information services are likely to use multiple data management systems (filesystems and database management systems), each with its own mechanisms for backing up data objects that it recognizes. Your task is integrating these mechanisms into a schedule that provides a consistent backup of all required data for a service and keeping them up-to-date as the service changes.

  • Continuous availability. Continuous application availability is increasingly required in the today's enterprise. A variety of mechanisms enable consistent backups with minimal application downtime. Choosing among these and implementing the choice can be a complex task.

  • Media management. Business or regulatory requirements can result in multiyear data retention requirements. Enterprises can find themselves responsible for maintaining backups and archives on tens or even hundreds of thousands of media (tape cartridges, optical disk platters, etc.). The procedures for managing large numbers of media can also be complex.

  • Management of multiple locations. Business considerations may require that servers and data be located in multiple locations. Maintaining a consistent set of backup procedures across multiple data centers can require extensive design or management talent.

The backup component of an enterprise data protection strategy has to accommodate all of these factors.

Where Do We Start?

As you start planning your backup and recovery system, you need to start gathering detailed information on your enterprise. You need to know the network layout for all systems. If your enterprise is made up of multiple networks, you need to know how much data resides on it and the speed of each network or subnet. Obviously, it is much faster to move data across a 100-Mb/sec (100Base-T) network than a 10-Mb/sec (10Base-T) network. You need to understand the network layout and the corresponding data to help identify potential bottlenecks and take them into consideration as you architect your backup and recovery system. (This information is also necessary in determining where to put media servers and tape devices, but we will get to that in a later chapter.)

As you look at the network that makes up your enterprise, you need to understand the network speed and topology. You also need to understand the disk layout, especially for the larger file servers and database servers, or identify who has this knowledge. You should watch for bottlenecks involving the disks, as well as the networks, SCSI connections, and any other appropriate I/O paths. When considering the decisions that need to be made when architecting backup strategy, the two things you must always keep in mind are the effect on normal production and effects on restore speed and performance. This usually involves making the necessary cost trade-offs to achieve the best of all worlds.

Here are some of the steps necessary for you to gather the information needed before establishing the backup strategy:

  1. Identify all the systems, noting the order in which they would need to be recovered following a disaster.

  2. Identify all networks involved, including speed of network and existing load at various times throughout the 24-hour day and night.

  3. Locate all existing backup-related hardware, such as tape drives and libraries.

  4. Identify recovery requirements.

  5. Identify data and application availability requirements during backup.

  6. Determine the best way to move the data.

We discuss each of these points in a little more detail in the sections that follow.

Identify All Systems

You need to identify all systems that need to be backed up. Generally this will be most if not all of the systems in the enterprise, with the exception of user workstations. There may be some systems that are basically replicated systems and can be easily re-created. In general, it is only necessary to back up one of these systems. The following information should be gathered for all the systems:

  • Amount of data

  • Speed of system

  • Number and type of networks

  • Type of data-database or filesystem?

  • Priority of recovery in DR

  • Tape drive or library installed?

Identify All Networks Involved

The network layout is an important part of the information required. Identifying the layout can be very critical to establishing the backup and recovery strategy. This step addresses the potential performance bottlenecks, because slow networks are often some of the primary bottlenecks. If there is a significant amount of data on a slow network, a media server may need to be located on the network. Any systems that have large amounts of backup data, such as a system with more than 100 GB, should be considered as media servers and have direct connections to a tape drive or drives. Following is the information needed for the networks:

  • Speed of network

  • Amount of data residing on the systems

  • Location of any backup hardware

  • Current and proposed production traffic

Locate Backup Hardware

Identifying all the systems and mapping the network topology should provide an idea of the total backup requirements. Part of this information is the location of the potential backup devices. The next step is to make sure the hardware is correctly located within the enterprise. Any enterprise backup and recovery strategy should be based on an application that supports library and drive sharing to ensure the tape drives and libraries are connected throughout the enterprise in such a way as to minimize bottlenecks, as well as to gain the most use from these very expensive tape drive resources. In a pure local area network (LAN) environment, it might be advisable to physically locate the tape library or libraries close enough to the systems that have the largest amounts of data so they can be directly connected to the tape drives and therefore perform backups and restores without data being moved across the network. These systems become media servers and control access to their drives. To handle data from other LAN-based systems, you either need to add more drives and give these systems access to their own drives or use the media servers to handle the backups for the systems that do not have their own drives. Also, the systems must be physically located close enough to the tape devices to be directly connected via SCSI cables.


Locate Backup Hardware-SAN Alternative

If a storage area network (SAN) is available, it can allow for more flexibility in the backup and recovery strategy. The backup hardware can be better shared amongst the large data-resident systems while still keeping the data off the production LAN. This can also allow large systems to be backed up directly to tape without making the application servers general-purpose media servers and having these systems back up other LAN-based clients.

Identify Recovery Requirements

As you identify all the systems in the enterprise, you should note the specific recovery requirements of each system. This is very helpful in setting up the backup strategy. If an order-processing application can tolerate an eight-hour outage without severe business consequences, for example, an incremental backup strategy that minimizes backup time at the expense of restore time may be appropriate. For a Web retail application, on the other hand, where every minute of downtime means permanently lost sales, a strategy that replicates data in real time might be more appropriate, even with its greater impact on application performance. The other item to note is the order in which systems need to be recovered as part of your overall disaster recovery (DR) plan.

Identify Data and Application Availability Requirements during Backup

As you assess the backup requirements of each system, you should also make sure you know which of the database applications must be kept up- remain 'hot'-during the backup and which can be shut down to be backed up 'cold.' There are performance trade-offs involved with backing up a database while it is online, but sometimes this is necessary. This is due to the increased I/O activity, since the database activity is continuing, as well as the additional backup I/O. There are other methods of handling database backups, either hot or cold, using frozen image technologies and possibly off-host backup methods. These are discussed later in the book.

Determine the Best Way to Move the Data

You have several options for moving the data from disk to tape. Each has its own advantages and disadvantages. The methods include the following:

  • Files. This involves using the operating system to read all the appropriate files within the backup set and move that data from disk to tape. This method has more operating system overhead but allows for single files to be backed up and restored. It also enables the application to check each file to determine access or modification time so incremental backups can be performed.

  • Volumes. An entire volume can be backed up without reading the filesystem structure but by doing a bit-by-bit copy of the data from disk to tape. This is called a raw backup. This method allows for much faster data transfers but in general does not allow for single-file backups and restores. It also does not allow incremental backups. This backup method results in an entire volume being backed up, even the portions that do not contain valid data.

  • Block level. If the filesystem has enough information about the files, it is possible to determine which blocks have been changed. If the backup application can interface with the filesystem, you can back up just the changed blocks. This type of backup is called block level incremental.

  • Mapped raw backup. Some backup applications, such as VERITAS Software's NetBackup, can map a raw volume and then perform a raw volume backup while retaining the filesystem map so single files can be restored. This also allows for incremental backups. This type of backup is discussed in more detail in the section on frozen image backups in Chapter 7, 'Evaluating Other Backup-Related Features and Options.'

  • Off-host backups. This is a mechanism where data is moved from disk to tape without the application host being directly involved in the disk reads or tape writes.