I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive’s capacity so I want to compress them at the highest ratio supported by standard tools. I’ve zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it’s lossless since file level compression can regenerate the original file in its entirety?)
I’ve heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don’t know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.
I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I’m only looking at gz, xz, or bz2.
So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?
Whichever compresses your data the smallest, while still having acceptable runtime?
Free space compression is easily done by each of them to my knowledge. Everything else in your dataset doesn’t sound highly compressible anyway. Plus, you should really consider decompression time. In general your trade-off seems wrong. If you need compression at all in your case, go with the fastest decompression time. If you’re using a filesystem like ZFS, you can have it handle compression automatically.
From my perspective zstd seems like it’s becoming the new standard over gz/bzip2/xz, so maybe consider this.
it feels like a xy problem. if i am not wrong, single bit corruption leaving file unextractable is a bit wild, and my guess is that it’s headers were blown.
As for general stuff, use a file system which does parity calc and such. or use something like raid to have redundant drives. (you can set something like 1 in 5 breaks, or 2 in 5, but more you allow to break, less actual space usable you would have). Or have really simple backups.
As to physical media - do not go flash based (ssd/sd cards/usb pen drives) if you want to leave them unpowered. they expect to be powered once every few months. they are effectively ram disks but like much more stable. Hard disk drives are better, but handle physical shocks much worse. you can drop a ssd and expect it to work. for a hard disk it is almost game over. Magnetic tape are better, they are much less data dense, but they are cheap.
I would assume it’s lossless
yes. these are lossless algorithms.
Now coming to compression - no compression practically deals with bit corruption. practically all compression formats aim for small size and or fast (de)/compression. Saving parity bits is wasteful.
If you can install some thing, try, for eg, https://github.com/Parchive/par2cmdline. You give whatever file (compressed or not), and it will generate parity bits so you can repair stuff. now use whatever compression you want, and prepare parity bits for worst case.
As to what compression algorithm, zip or gzip (deflate), bzip2 (or newer 3), xz (lzma in general), and zstandard (or older lz4), brotli are practically not going anywhere. most distros use them, they are used on web, and many other places. My favorite is zstadard as it gives great compression and extremely fast.
Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting?
no and no.
You should also consider file archive format. for example zip (the format, not algo) or tar are effictively standards and stable, and practically here forever. there are mountable ones like squashfs (also fairly common. most linux distros use it for live images) and dwarfs (not yet a standard, imagine squashfs but also deduplicating).
do compression formats exist (or can they exist) which correct for bit flop - yes
If your goal is that a single bit flip should not ruin it, you should probably not look into deduplicating ones (they are reducing the number of bits stored, so in case a bit flip happens, less files would be corrupted).
Now coming to another part - do you want to compress data ? if so, why?
when you compress data, you literally reducing number of bits. now imagine if all bits on your disc are equally likely to undergo bitrot. if so, compressing makes your files less likely to rot.
but as you have also said, it is possible that if they corrupt, the corruption is more catastrophic (in a plane text file, this may just mean a character mutated, or in a image, some color changed. hardly problematic).
So you should also check - is compression worth it? come up with a number, lets say 90%. if compression algorithm reduces file size to 0.93 the original, do not go for compression. if it does, do compress. I am not saying pick 90%, but like decide on one you seem content with.
here is a stupid idea. if compression reduces file size by 2x, then compress, and make yet another copy. now even if one is corrupted, you have a pristine copy.
Seconding the zstd rec.
For anyone like OP worried about having an offline copy of the utility on hand, I strongly recommend throwing a copy of cosmos on your recovery drive. It’s a set of common tools that are compiled to a special binary format that’ll run on all major OSes (and ARM64), including
zstd
.
You can of course remedy the risk of corruption by making several compressed files instead of just one. Then you will lose only part of the data in the worst case scenario.