Have you ever wished that tar or zip would deduplicate files when creating an archive? Well here’s a hacky solution using git.

## How It Works

Git already has deduplication functionality, due to the way it stores files. Internally, files are named using their own checksums, so if two files have the same checksum then only one copy of the file is stored.

So, to make use of this, if you add all the files to a new git repo then it will perform the deduplication. Then, you archive the .git directory of the repo with zip or tar.

When unarchiving, you just do the opposite. Unzip the .git directory inside the destination directory. Run git reset --hard to bring back all the duplicate files. Then, just delete the .git folder.

Git will also do zlib compression if you run git gc --aggressive. Bzip2 compression is better, but why not have both?!

## The Results

I took some recent work, which I know contains duplicate files, to test if this would actually work. Here are the results:

 39M	original
3.5M	original.gitar
10M	original.tar.bz2
2.7M	original.tar.lrz *see update below

The original directory contained 39mb of files. Running tar cjf original.tar.bz2 original, which uses bzip2 compression, compressed the folder to about 25% of it’s original size. The git method compressed the folder to about 10% of it’s original size. So it does actually work.

## Update: lrzip is better

After publishing this article, someone suggested trying lrzip, which I hadn’t heard of before. It doesn’t do file deduplication per se, but it does a good job of compressing files with large chunks of redundant data – such as a tarball of duplicate files. By default it uses LZMA compression, which seems to be better than bzip2.

Running tar cf original.tar original && lrzip original.tar produces a file named original.tar.lrz with a size of 2.7M, which is a bit better than the git method.

## The Script

Update: Sam Gleske has written a more robust script here: http://github.com/sag47/drexel-university/tree/master/bin.

Here is a quick and nasty script called gitar.sh that makes these deduplicated archives. Use gitar.sh myfolder to create the myfolder.gitar archive. Then use gitar.sh myfolder.gitar to recreate the original folder.

Do whatever you want with the script. I’ve released it under the MIT license just because I don’t want to get sued if someone copy/pastes it onto a production server and everything explodes.

• beardyjay

Nice and thanks for the script too! :)

• Marshall Levin

Nice work! I made a few small changes to idiot-proof it so I wouldn’t accidentally clobber an existing git repo.

http://pastebin.com/CNecRQRq

• http://www.tomdalling.com/ Tom Dalling

Thanks. I put it in a github gist and applied the patch.

• Tintin

Works good!
But, when i am doing deduplication manually to my file with srot -u, i am getting my file much smaller.

For example:
The original file is 6851852227 kb
The gitar file is 2053674855 kb
The sort -u file is 16123844 kb

What make this big difference? the indexes?

Can i use this script to have better results?

Thanks

• http://www.tomdalling.com/ Tom Dalling

I guess it could be the git indexes if you have millions of files. I’m not exactly sure what you’re doing with sort -u, but if I had to guess I’d say you might be deleting files that aren’t identical – they just have the same file name. See if lrzip gives you a better result.

• Tintin

Not millions of files, but millions of rows in a file.

sort -u is sorting the file and leaving only the unique rows.

• http://www.tomdalling.com/ Tom Dalling

Oh well that explains it. Git will only deduplicate identical files. It doesn’t do anything to the lines within a file.

• Tintin

Thanks Tom!
Do you know what can help me?
Dedup within files..
Thanks

• http://www.tomdalling.com/ Tom Dalling

I would keep using sort -u, then run lrzip on it afterwards. That should give you very good compression.

• Tintin

The problem is that i want also to reconstruct the full file.

and when i am using sort -u i am loosing this information..

i want to combine dedup within file and between files.

• http://www.tomdalling.com/ Tom Dalling

Try lrzip by itself and see what results you get. The file might be too large to get good results. Otherwise, you might have to write your own script that does run-length encoding or something ( http://en.wikipedia.org/wiki/Run-length_encoding).

• Sam Gleske

So I couldn’t resist writing my own CLI utility for gitar :D. I present to you my own gitar.sh. The readme has some benchmarks and other fun information while I explored gitar. I wrote this script from scratch without any reference to your own script. Thanks for the neat hack Tom!

https://github.com/sag47/drexel-university/tree/master/bin#gitarsh—a-simple-deduplication-and-compression-script

• http://www.tomdalling.com/ Tom Dalling

Looks good! I’ve included it in the article.