BACKUP - WHO NEEDS IT ?

Your hard drive has crashed, but there’s an inner contentment. Not only did you back up the whole system last night, but you keep a spare new hard disk drive at hand for just such an eventuality. No Problem; just put in the new disk, partition and format it, add the operating system, install the backup programme, (or on Unix and Linux use tar or cpio), and watch the tape drive recover the whole lot whilst you have some coffee. You’ll have it all restored before most people get in for work.

You start the restore, but why is the tape drive thrashing around so much? Your heart sinks as you realise the restore is not restoring. An error message appears on the screen. It doesn’t matter what the message says, you know you are now in deep trouble. The restore has failed and you know you cannot get at your information! You eject the tape and put in the previous backup tape, OK it’s last week’s, but that’s better than nothing. Misery sets in as the same thing happens again. The phone rings. Your boss and his boss want to know why their computers are not on line. You start to explain and they make it quite clear that if you cannot get their computers on-line in an hour, you won’t have a job at the end of that hour.

Why? What could have brought you to this miserable episode? Perhaps is you had used BRU backup software you might not be in this position. Well there are a number of causes and it’s probably worth examining them to make sure you don’t get caught like this. Analysts have determined that 20-30% of backups fail, and the user doesn’t even know it. It doesn’t matter which technology you are using to backup; tape, disk, optical, whatever, there are some golden rules you need to follow. Generally the backup hardware will be dictated by the amount of data you want to backup, but don’t be fooled by what at first sight appears to be obvious.

Case 1
 You’re safe. Your drive has Read After Write (RAW). Just about all tape drives have read after write capability. This means that there is a read head positioned just after the write head and the tape drive verifies that what it reads is exactly what it has written.

backup layout

If there is drop out on the tape Read After Write will detect it, the backup application will try to write the data again and if there are problems it will move down the tape and write again on a good piece of the tape, so no problem there! Or is there? Some backup applications rely on the read after write function within the tape drive to serve as the backup verification mechanism, but there are a lot of hurdles in the way of the data trying to get to the tape head. If the data is going across a network then the problems are magnified as corruption can occur anywhere in the network hardware or software before it gets close to the backup device. Read After Write won’t help you if you present corrupt data to the tape drive. Consider data coming off a disk and going to a SCSI tape drive on the same computer. It travels from the disk, onto the bus to memory and then back from memory to the bus to the SCSI host bus adapter (HBA) where the software driver has to be correctly matched to the operating system. Then through the adapter hardware to the SCSI cable to the tape drive, where the tape drive’s firmware needs to match the adapter card driver. Finally through the tape drive hardware. Within a tape drive, the data presented to the tape is often manipulated in structured ways to ensure that it gets the best distribution of flux transitions on to the tape. This makes sure your data is in the most robust format there can be. It is not unknown for something to go wrong between the data connector and the write head. In such cases Read After Write will report all is well because the read head reads the data that the write head wrote, but this data is not the data you wanted to write! It is corrupt, and if something has gone badly wrong it could be random data. The result? Garbage on the restore. Golden Rule Don’t rely on RAW or backup software (including Unix/Linux embedded apps) that relies on Read After Write. ALWAYS run a verify pass on your backup. If the backup application does not support verify, ditch it for one that does and do it now before it is too late.

BRU checks the data from the tape matches BRU applies 32-bit checksum technology to both the metadata and file data coming off the filesystem and places the checksum into the header of each buffer block. These buffer blocks are then assembled into blocks written to tape. The checksum value in each buffer block is the critical factor used during the verification operation to validate the accuracy of the backup.

Case 2
You go to restore a data file and find it is not on the backup set. Why? Because the backup application had a complicated user interface and you misunderstood the include/exclude feature on the backup application, or you mistyped the latest free backup command line programme by one letter. Result you only backed up system files when in fact you wanted to include only data files. This can easily happen when a new job is created, because once a backup job has been created it does its job each time running in the background, and the administrator forgets all about it. When a new job is required the administrator has to “relearn” the application because it is used so rarely, sometime only once every two or three years. Often, during the needs for a new backup job the administrator has changed, so the person creating the new backup job has to start from scratch with a package they have never seen before and no one else is around to act as a mentor. Golden Rule ALWAYS try a restore from a backup whenever a new backup job has been configured (to a test directory is useful) to make sure the files you want are actually there. You should always do this even if you have run a verify pass, as this will only verify that the files you selected to be backed up are actually there. If you selected the wrong files, verify alone will not help you. It helps if your backup package is easy to use and doesn’t have too many bells and whistles to learn. Don’t choose a backup package that does everything, unless you really need the extras.

Case 3
You have backed up, run a verify pass and a restore, but 3 months later the restore fails with some error message, that usually says the restore will be aborted. Typically tar or cpio will generate “tape I/O read error” and the restore will be aborted. Why? The backup application met a bad spot on tape and quite rightly found an error because it couldn’t read the data properly. Now you have the first few files from the backup, the bulk of it is still on the tape. This is typical of a backup application that is just a user interface built on top of tar or cpio. Another problem that might arise is when the backup application uses multi-streaming from different client systems to the same tape. In this technique, data from one client group is interleaved with data from other client groups onto the same tape. This means that any particular client group’s data will be divided and spread amongst the data of the other client groups on the backup media. If the backup is large it may have spread over a number of tapes. The danger in this approach is if one block of data cannot be read during a recovery, data from the multiple clients will be lost. Also restores are complex requiring the management of the multiple tape sets just to recover a single client system. Golden Rule Choose a backup application that has built in error handling. Surprisingly very few backup applications can satisfactorily accommodate errors during a restore. Check with the software company to understand what they do to ensure the availability of the data. An application’s bells and whistles are no good if the underlying technology cannot deliver the data. Your data is important, so meticulous care should be taken to check a backup software’s capabilities to fully understand the level of protection it affords. BRU does not use multi archive interleaving like this but it can manage a number of backups to different tape drives at the same time.

Case 4
You backed up with a verify pass, the restore runs perfectly, but then the complaints start rolling in. The data has errors, some files are in error with characters missing. Why? When a backup application is based on the cpio format, the checksums used to verify the data’s accuracy are only calculated on the meta data (data “about” the data block), and does not checksum the actual data. Therefore a cpio verify pass cannot verify the actual data is correct, only that the header information is correct. Some backup applications verify the backup by conducting test restores on random backup sets. The same issue previously addressed applies here. If the backup data hasn’t been 100% verified, users can still experience aborted restores because corrupt data can still be experienced. If the first bit of the restore is bad, the entire backup will be lost even though segments of the backup set proved to be accurate. It should be noted that a tar archive can be fully verified using a bit by bit check against the disk. This doubles the time it takes to do a complete backup. Nothing must change on the disk between the backup and verify, otherwise errors will be generated and each error will have to be investigated. This approach is impractical because of today’s shrinking backup windows. Some applications note the problems with a backup and record them in the fault log.  As above BRU adds checksums on the disk data and uses these to verify the archive, you can see the difference between applications built on tar and cpio at this white paper  [pdf] rom the TOLIS  Group

It is very easy to forget to check the log, particularly if it is someway down the directory chain, so you will not know if your backup has failed. Golden Rule Make sure your backup application incorporates the checks and balances to assure that the data you believe you backed-up actually made it to the backup media accurately and can be successfully and accurately recovered. Without this assurance, all other application functionality is window dressing. Make sure your backup application has some sort of notification that alerts you when there is a problem with the backup, usually by email. Of course BRU provides this.

Finally “I don’t need backup – I’ve got RAID.”
RAID is fault tolerant, it is not fault free. The Internet is full of tales of RAID arrays that fail. Remember also that users deleting their data is one of the most common causes of lost data, and in that case RAID will not help you. If the building catches fire, the RAID array may be lost, but a tape backup made with a reliable backup package and stored offsite will save the day. Backing-up data is a simple concept; just move data to a safe place and bring it back when it’s needed. In reality, how this work gets done is very complex. The process should not be an “art form,” but good science and engineering. The availability of your data, and your sanity, depends on it.