Compression Isn't Always What You Might Think

About this document
Related information
What is compression
What happens if you have already compressed a file?
How does the file become bigger?
Where is compression performed?
What about "best case" compression ratios?
So what can be done?
What else should I know?

About this document

The purpose of this document is to clarify the use of compression in backing up data, and to answer questions such as the following:

   Why am I not getting three-to-one compression?
   My tapes hold 120 gigs, but I'm only getting 38 gigs. Why?

The information in this document applies to all versions of AIX.

Related information

AIX and related product documentation is also available:
http://www.rs6000.ibm.com/resource/aix_resource /Pubs/index.html

See "Understanding Data Compression" in System Management Concepts: Operating System and Devices

What is compression?

Compression finds repeating patterns in data, leaves a tag in its place, and only keeps the data one time. This means that:

   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

could become something like:

/2A

This would say / (escape in this non-representative example), then 2, which has character value of 50 in US ASCII for the count, and A, which is the character to repeat 50 times. There are obviously more complicated schemes than this, but in the end they fall into the category of reducing repetition in a file.

What happens if you have already compressed a file?

Compressed data generally does NOT recompress. Binary programs are semi-compressed in that the source code (text) is converted into machine code (binary) which will be more compact. Graphics files tend to be already compressed, since they are usually highly repeating in nature. Obviously, compressed files are already compressed, and these include .ZIP by Phil Katz, the old unix compress .Z, gnu compressed .gz files, the newer open compression .bz, and numerous other available formats.

Compressed data is already mostly non-repeating. If you try to recompress it, one of three things will happen:

A lot of CPU time is spent making the data only half a percent smaller.
The file ends up being the same size.
The file ends up being larger.

How does the file become bigger?

Compressed files require the inclusion of a decompression table, similar to a key, which is used during the extraction of data from the files.

Where is compression performed?

There are several ways compression takes place:

Sometimes it is part of the file format for your application. In this case, the application uses CPU time to make the file smaller, but this doesn't always work.
You might use a third-party application, such as RAR, WinZup, ARJ, GNU Zip, your backup software, or pencil and paper to make calculations.
Your operating system may have a feature to compress an entire disk, partition, filesystem, or individual files in a transparent manner to your applications.
You may use hardware, such as the CPU inside your tape drive, or an add-in card for your disk drives.

What about "best case" compression ratios?

Sales and Marketing for most major companies and many other organizations tend to indicate compression ratios that are either best case, such as a file consisting of 37 gigabytes of the same character, or they may even be values that are not attainable in a real-world device. While the technology, in best cases, can actually achieve huge compression rates, these are virtually impossible to achieve unless you are backing up empty databases.

So what can be done?

Be aware of the native capacity or the raw capacity of your storage media, and of the data characteristics of what you are storing. If you are backing up users' workstations, don't expect to get 120 gigs on your 40.

What else should I know about compression?

Heavy encryption sufficiently randomizes data so that compression is prevented.

Performance, a complex topic, is affected by compression. If your network is heavily loaded, slow, or poorly implemented, you will want compression before you send large amounts of data to it. If your network is good, or if your client systems' CPUs are slow, then don't use client-side compression. Instead, let your tape drive do the work.

Note that the streaming throughput of your tape drive may be rated at its maximum compressed rate. Before you decide your drive is running too slowly, you will need to determine the head-to-tape transfer speed of your tape drive. It may have a few different ones since tape drives often run at multiple speeds to prevent buffer underruns.

Compression is unpredictable. For example, a user on your network may have a hidden stash of MP3s on an ecrypted volume, so although you might see the data it may be uncompressable.

Also, some backup utilities such as "Tivoli Storage Manager" will not report expected values of data on your volumes if the data was compressed by the client. The server will see what was sent from the client, not what the client copied. The server would see the raw capacity of the tape, since this wouldn't recompress; however, since it was compressed before, all of this data would actually represent more to the client than the server indicated.

If you have further questions about compression or would like information about IBM's consulting services, contact your local service representative.

[ Doc Ref: 98406788611840 Publish Date: Mar. 23, 2001 4FAX Ref: 1029 ]