Have you ever wished that tar
or zip
would deduplicate files when creating
an archive? Well here’s a hacky solution using git.
How It Works
Git already has deduplication functionality, due to the way it stores files. Internally, files are named using their own checksums, so if two files have the same checksum then only one copy of the file is stored.
So, to make use of this, if you add all the files to a new git repo then it
will perform the deduplication. Then, you archive the .git
directory of the
repo with zip
or tar
.
When unarchiving, you just do the opposite. Unzip the .git
directory inside
the destination directory. Run git reset --hard
to bring back all the
duplicate files. Then, just delete the .git
folder.
Git will also do zlib compression if you run git gc --aggressive
. Bzip2
compression is better, but why not have both?!
The Results
I took some recent work, which I know contains duplicate files, to test if this would actually work. Here are the results:
39M original
3.5M original.gitar
10M original.tar.bz2
2.7M original.tar.lrz *see update below
The original directory contained 39mb of files. Running tar cjf
original.tar.bz2 original
, which uses bzip2 compression, compressed the folder
to about 25% of it’s original size. The git method compressed the folder to
about 10% of it’s original size. So it does actually work.
Update: lrzip is better
After publishing this article, someone suggested trying lrzip, which I hadn’t heard of before. It doesn’t do file deduplication per se, but it does a good job of compressing files with large chunks of redundant data – such as a tarball of duplicate files. By default it uses LZMA compression, which seems to be better than bzip2.
Running tar cf original.tar original && lrzip original.tar
produces a file
named original.tar.lrz
with a size of 2.7M
, which is a bit better than the
git method.
The Script
Here is a quick and nasty script called gitar.sh
that makes these
deduplicated archives. Use gitar.sh myfolder
to create the myfolder.gitar
archive. Then use gitar.sh myfolder.gitar
to recreate the original folder.
Do whatever you want with the script. I’ve released it under the MIT license just because I don’t want to get sued if someone copy/pastes it onto a production server and everything explodes.