Have you ever wished that
zip would deduplicate files when creating an archive? Well here’s a hacky solution using git.
Git already has deduplication functionality, due to the way it stores files. Internally, files are named using their own checksums, so if two files have the same checksum then only one copy of the file is stored.
So, to make use of this, if you add all the files to a new git repo then it will perform the deduplication. Then, you archive the
.git directory of the repo with
When unarchiving, you just do the opposite. Unzip the
.git directory inside the destination directory. Run
git reset --hard to bring back all the duplicate files. Then, just delete the
Git will also do zlib compression if you run
git gc --aggressive. Bzip2 compression is better, but why not have both?!
I took some recent work, which I know contains duplicate files, to test if this would actually work. Here are the results:
39M original 3.5M original.gitar 10M original.tar.bz2 2.7M original.tar.lrz *see update below
The original directory contained 39mb of files. Running
tar cjf original.tar.bz2 original, which uses bzip2 compression, compressed the folder to about 25% of it’s original size. The git method compressed the folder to about 10% of it’s original size. So it does actually work.
After publishing this article, someone suggested trying lrzip, which I hadn’t heard of before. It doesn’t do file deduplication per se, but it does a good job of compressing files with large chunks of redundant data – such as a tarball of duplicate files. By default it uses LZMA compression, which seems to be better than bzip2.
tar cf original.tar original && lrzip original.tar produces a file named
original.tar.lrz with a size of
2.7M, which is a bit better than the git method.
Here is a quick and nasty script called
gitar.sh that makes these deduplicated archives. Use
gitar.sh myfolder to create the
myfolder.gitar archive. Then use
gitar.sh myfolder.gitar to recreate the original folder.
Do whatever you want with the script. I’ve released it under the MIT license just because I don’t want to get sued if someone copy/pastes it onto a production server and everything explodes.