Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

cm0002@lemmy.world · 2 days ago

Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

dwt@feddit.org · 1 day ago

Whichever compresses your data the smallest, while still having acceptable runtime?

Lemmchen@feddit.org · 1 day ago

Free space compression is easily done by each of them to my knowledge. Everything else in your dataset doesn’t sound highly compressible anyway. Plus, you should really consider decompression time. In general your trade-off seems wrong. If you need compression at all in your case, go with the fastest decompression time. If you’re using a filesystem like ZFS, you can have it handle compression automatically.

From my perspective zstd seems like it’s becoming the new standard over gz/bzip2/xz, so maybe consider this.

sga@piefed.social · 2 days ago

it feels like a xy problem. if i am not wrong, single bit corruption leaving file unextractable is a bit wild, and my guess is that it’s headers were blown.

As for general stuff, use a file system which does parity calc and such. or use something like raid to have redundant drives. (you can set something like 1 in 5 breaks, or 2 in 5, but more you allow to break, less actual space usable you would have). Or have really simple backups.

As to physical media - do not go flash based (ssd/sd cards/usb pen drives) if you want to leave them unpowered. they expect to be powered once every few months. they are effectively ram disks but like much more stable. Hard disk drives are better, but handle physical shocks much worse. you can drop a ssd and expect it to work. for a hard disk it is almost game over. Magnetic tape are better, they are much less data dense, but they are cheap.

I would assume it’s lossless

yes. these are lossless algorithms.

Now coming to compression - no compression practically deals with bit corruption. practically all compression formats aim for small size and or fast (de)/compression. Saving parity bits is wasteful.

If you can install some thing, try, for eg, https://github.com/Parchive/par2cmdline. You give whatever file (compressed or not), and it will generate parity bits so you can repair stuff. now use whatever compression you want, and prepare parity bits for worst case.

As to what compression algorithm, zip or gzip (deflate), bzip2 (or newer 3), xz (lzma in general), and zstandard (or older lz4), brotli are practically not going anywhere. most distros use them, they are used on web, and many other places. My favorite is zstadard as it gives great compression and extremely fast.

Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting?

no and no.

You should also consider file archive format. for example zip (the format, not algo) or tar are effictively standards and stable, and practically here forever. there are mountable ones like squashfs (also fairly common. most linux distros use it for live images) and dwarfs (not yet a standard, imagine squashfs but also deduplicating).

do compression formats exist (or can they exist) which correct for bit flop - yes

If your goal is that a single bit flip should not ruin it, you should probably not look into deduplicating ones (they are reducing the number of bits stored, so in case a bit flip happens, less files would be corrupted).

Now coming to another part - do you want to compress data ? if so, why?

when you compress data, you literally reducing number of bits. now imagine if all bits on your disc are equally likely to undergo bitrot. if so, compressing makes your files less likely to rot.

but as you have also said, it is possible that if they corrupt, the corruption is more catastrophic (in a plane text file, this may just mean a character mutated, or in a image, some color changed. hardly problematic).

So you should also check - is compression worth it? come up with a number, lets say 90%. if compression algorithm reduces file size to 0.93 the original, do not go for compression. if it does, do compress. I am not saying pick 90%, but like decide on one you seem content with.

here is a stupid idea. if compression reduces file size by 2x, then compress, and make yet another copy. now even if one is corrupted, you have a pristine copy.

dgdft@lemmy.world · 1 day ago

Seconding the zstd rec.

For anyone like OP worried about having an offline copy of the utility on hand, I strongly recommend throwing a copy of cosmos on your recovery drive. It’s a set of common tools that are compiled to a special binary format that’ll run on all major OSes (and ARM64), including zstd.

Tuukka R@piefed.ee · 2 days ago

You can of course remedy the risk of corruption by making several compressed files instead of just one. Then you will lose only part of the data in the worst case scenario.