Deduplication And Compression In A Hybrid Storage

It’s not a some sort of ‘buzzword’ but I think you’d heard them at least few times in relation to the storage.

So what it is and how much you can profit from this technologies?

Let’s begin with compression.

If you have 10 Tb storage appliance where 4Tb actually occupied, how much data are you really store? 6 Tb?

Wrong!

Nowadays storage solutions offer on-fly data compression without any impact on performance, so if your data can be compressed at all, you can have more available drive space at the same price.

The question is what data do you have and how much can it be compressed.

Most storage solutions using some sort of ‘zip’ algorithm, so you can pick your example data chunk and try to compress it yourself.

In my experience, VM drives data compress ratio varies from 30 to 50 percent.

For mail storage, it can be unpredictable, varies from 10 to 80% due mail attachments.

And most of the audio and video content almost can’t be compressed at all.

Anyway, it’s a good idea to keep compression enabled anyway, because it really almost not cost you any resources, and even will speed up things a bit with read operations from the storage. A most known filesystem that can support data compression natively is ZFS.

Deduplication And Compression In A Hybrid Storage

Now let’s deal with deduplication

There is a more complicated topic I can say.

The idea is very simple: let’s imagine that we have all storage divided into blocks of some size, and if exactly the same block is already written to the storage, we can write some kind of a pointer to that block instead to write it again to the storage, and save space here.

The system can verify each block with some sort of hash algorithm and take actions based on checksum calculated. A most known filesystem that can support data deduplication is ZFS too, by the way.

But there are two problems:

Block size.

If we set it too low, we’ll have less profit in space occupied. If we keep it big, we’ll have less deduplication ratio.

Deduplication table.

We need to have a table where all known blocks checksums can be stored.
And this table must fit into the RAM, so we can very quickly access it when storing and retrieving data from the storage.

So let’s go to the numbers here:

I assume that we have a same 10 Tb storage appliance with deduplication enabled and block size, and if we’ll use the relatively large block size of 64 Kbytes we’ll need at least 5 Gb of RAM for each 1 Tb of storage, which gives us 50 Gb RAM total.

It’s for deduplication table only, you’ll want to have some for the system, cache, etc!

It can be good if you’re storing large files.

If not, you’ll need to use less block size, and then RAM requirements will double at least.

What about the gain?

It can surprise you, but it’s not fantastic for most of the customers.

You can see numbers like 1.1 to 2.0 deduplication ratio for most real life examples, but I’d seen dedupe ratio about 20 once too. It was virtual desktop infrastructure storage host, which stored multiple very similar VDI images.

So I can recommend you not to be over-optimistic about deduplication.

It can be useful in some cases, but simple data compression will give you much more profit at no cost at all.

Most hybrid storages can offer both data compression and deduplication so you can easily enable and disable them, test how it works with your data and make a decision next.

David Kovacs is passionate about entrepreneurship and triathlon. Presently, he is spending most of his time with his friends trying to kick off an awesome project.

https://www.linkedin.com/in/kovacsd