on my previous blog i've discussed some of the viable approaches to data protection with virtual machines, before i dwell into the pros and cons of each approach i'd like to discuss the fundamental differences between file level and block level backup (and solicit your input :-) ).
Encapsulation is one of the basic rules for software design, simply put, it's the computer geek's equivalent of the famous "Don't ask, don't tell" policy. The idea is pretty simple, let's assume our File System is component A and our Disk System is component B. Component A and B publish a public interface that others can use, but they hide their internal mechanisms from the other components. This enables us to do some nifty tricks, such as RAID, as far as the file system is concerned it is working with a "regular disk", it is unaware of the fact that our disk system had actually taken the 100GB disk space that we defined and partitioned it into multiple strips that are actually located across 5 different disks in order to provide it (the FS) with better performance and hardware fault tolerant. There are other ways where this principle is used but you have to agree that it comes in pretty handy.
But, why do i even mention "encapsulation", and how is that relevant to File VS Block level backups?
The point i am trying to make is that the Disk level is not aware of the "file contents"and the File System is not aware of the "disk layout", this actually dictates the pros and cons of those two very different approaches to data protection.
With file level backups it's really easy to define which files you want to protect, than when the time comes, someone has to access the files and move the data they contain to some sort of data repository, in order to do that you must deal with issues such as:
- Open files
- Interdependencies between multiple files
- Identifying which (sub)files have changed
- For structured data (databases etc.), do we backup the entire file (or file group) or only the portions that have changed?
Block level backups are usually pretty straight forward, there's a mechanism that keeps track of the changed in "realtime" (this usually enables CDP, but that's a whole different story) and when the time comes the data will be moved to the data repository, but this technology has its own challenges
- Minimum granularity is usually a volume
- Hard to exclude unused file data (page file?)
- Recovering files from a block level backup
- Communicating with applications (and File System) to ensure backup consistency.
Generally speaking block level backups have a "lower overhead" than file level backup, so, if you decided to virtualize your environment and keep using agents on the individual virtual machines, you would probably want to use a block level backup solution. File Level backups are still viable (especially if they skip the "indexing" process by using an FS filter or journaling and allow for "sub file" incremental backup), but you will need to be more careful when planning your backup windows in order to prevent VM sprawl.
Stay tuned, next we'll discuss other approaches such as proxy backups