Compression Isn't Always What You Might Think
Compression Isn't Always What You Might Think
Contents
About this document
    Related information
What is compression
What happens if you have already compressed a file?
How does the file become bigger?
Where is compression performed?
What about "best case" compression ratios?
So what can be done?
What else should I know?
About this document
The purpose of this document is to clarify the use of 
compression in backing up data, and to answer questions such as the following:
   Why am I not getting three-to-one compression?
   My tapes hold 120 gigs, but I'm only getting 38 gigs. Why?
The information in this document applies to all versions of AIX.
Related information
AIX and related product documentation is also available:
http://www.rs6000.ibm.com/resource/aix_resource
/Pubs/index.html
See "Understanding Data Compression" in System Management Concepts: 
Operating System and Devices
What is compression?
Compression finds repeating patterns in data, leaves a tag in its place, and
only keeps the data one time. This means that:
   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
could become something like:
   /2A
This would say / (escape in this non-representative example), then 2, which has 
character value of 50 in US ASCII for the count, and A, which is the character 
to repeat 50 times. There are obviously more complicated schemes than this, 
but in the end they fall into the category of reducing repetition in a file.
What happens if you have already compressed a file?
Compressed data generally does NOT recompress.  Binary programs are 
semi-compressed in that the source code (text) is converted into machine code 
(binary) which will be more compact.  Graphics files tend to be already 
compressed, since they are usually highly repeating in nature.  
Obviously, compressed files are already compressed, and these include 
.ZIP by Phil Katz, the old unix compress .Z, gnu compressed .gz files, the newer 
open compression .bz, and numerous other available formats.
Compressed data is already mostly non-repeating.  If you try to recompress it, 
one of three things will happen:
- A lot of CPU time is spent making the data only half a percent smaller.
 - The file ends up being the same size.
 - The file ends up being larger.
 
How does the file become bigger?
Compressed files require the inclusion of a decompression table, similar to a key, 
which is used during the extraction of data from the files.
Where is compression performed?
There are several ways compression takes place:
- Sometimes it is part of the file format for your application.  In this case, the 
application uses CPU time to make the file smaller, but this doesn't always work. 
 - You might use a third-party application, such as RAR, WinZup, ARJ, GNU 
Zip, your backup software, or pencil and paper to make calculations. 
 - Your operating system may have a feature to compress an entire disk, 
partition, filesystem, or individual files in a transparent manner to your 
applications.
 - You may use hardware, such as the CPU inside your tape drive, or an 
add-in card for your disk drives.
 
What about "best case" compression ratios?
Sales and Marketing for most major companies and many other organizations tend 
to indicate compression ratios that are either best case, such as a file 
consisting of 37 gigabytes of the same character, or they may even be values 
that are not attainable in a real-world device.
While the technology, in best cases, can actually achieve huge 
compression rates, these are virtually impossible to achieve unless you are 
backing up empty databases.
So what can be done?
Be aware of the native capacity or the raw capacity of your 
storage media, and of the data characteristics of what you are storing.  If 
you are backing up users' workstations, don't expect to get 120 gigs on your 40. 
What else should I know about compression?
Heavy encryption sufficiently randomizes data so that compression is prevented.
Performance, a complex topic, is affected by compression. If your network is heavily loaded, slow, 
or poorly implemented, you will want compression before you send large 
amounts of data to it. If your network is good, or if your client systems' CPUs 
are slow, then don't use client-side compression. Instead, let your tape drive do 
the work.
Note that the streaming throughput of your tape drive may be rated at its 
maximum compressed rate.  Before you decide your drive is running too slowly, 
you will need to determine the head-to-tape transfer speed of your tape 
drive. It may have a few different ones since tape drives often run at 
multiple speeds to prevent buffer underruns.
Compression is unpredictable. For example, a user on your network may have a hidden stash of MP3s on an 
ecrypted volume, so although you might see the data it may be uncompressable.
Also, some backup utilities such as "Tivoli Storage Manager" will not report 
expected values of data on your volumes if the data was compressed by the client. The 
server will see what was sent from the client, not what the client copied. The 
server would see the raw capacity of the tape, since this wouldn't recompress; 
however, since it was compressed before, all of this data would actually 
represent more to the client than the server indicated.
If you have further questions about compression or would like information about
IBM's consulting services, contact your local service representative.
[ Doc Ref: 98406788611840     Publish Date: Mar. 23, 2001     4FAX Ref: 1029 ]