Git-Logo-2Color

Have you ever wished that tar or zip would deduplicate files when creating an archive? Well here’s a hacky solution using git.

How It Works

Git already has deduplication functionality, due to the way it stores files. Internally, files are named using their own checksums, so if two files have the same checksum then only one copy of the file is stored.

So, to make use of this, if you add all the files to a new git repo then it will perform the deduplication. Then, you archive the .git directory of the repo with zip or tar.

When unarchiving, you just do the opposite. Unzip the .git directory inside the destination directory. Run git reset --hard to bring back all the duplicate files. Then, just delete the .git folder.

Git will also do zlib compression if you run git gc --aggressive. Bzip2 compression is better, but why not have both?!

The Results

I took some recent work, which I know contains duplicate files, to test if this would actually work. Here are the results:

 39M	original
3.5M	original.gitar
 10M	original.tar.bz2
2.7M	original.tar.lrz *see update below

The original directory contained 39mb of files. Running tar cjf original.tar.bz2 original, which uses bzip2 compression, compressed the folder to about 25% of it’s original size. The git method compressed the folder to about 10% of it’s original size. So it does actually work.

Update: lrzip is better

After publishing this article, someone suggested trying lrzip, which I hadn’t heard of before. It doesn’t do file deduplication per se, but it does a good job of compressing files with large chunks of redundant data – such as a tarball of duplicate files. By default it uses LZMA compression, which seems to be better than bzip2.

Running tar cf original.tar original && lrzip original.tar produces a file named original.tar.lrz with a size of 2.7M, which is a bit better than the git method.

The Script

Update: Sam Gleske has written a more robust script here: http://github.com/sag47/drexel-university/tree/master/bin.

Here is a quick and nasty script called gitar.sh that makes these deduplicated archives. Use gitar.sh myfolder to create the myfolder.gitar archive. Then use gitar.sh myfolder.gitar to recreate the original folder.

Do whatever you want with the script. I’ve released it under the MIT license just because I don’t want to get sued if someone copy/pastes it onto a production server and everything explodes.

  • beardyjay

    Nice and thanks for the script too! :) 

  • Marshall Levin

    Nice work! I made a few small changes to idiot-proof it so I wouldn’t accidentally clobber an existing git repo.

    http://pastebin.com/CNecRQRq

  • http://www.tomdalling.com/ Tom Dalling

    Thanks. I put it in a github gist and applied the patch.

  • Tintin

    Works good!
    But, when i am doing deduplication manually to my file with srot -u, i am getting my file much smaller.

    For example:
    The original file is 6851852227 kb
    The gitar file is 2053674855 kb
    The sort -u file is 16123844 kb

    What make this big difference? the indexes?

    Can i use this script to have better results?

    Thanks

  • http://www.tomdalling.com/ Tom Dalling

    I guess it could be the git indexes if you have millions of files. I’m not exactly sure what you’re doing with `sort -u`, but if I had to guess I’d say you might be deleting files that aren’t identical – they just have the same file name. See if `lrzip` gives you a better result.

  • Tintin

    Not millions of files, but millions of rows in a file.

    sort -u is sorting the file and leaving only the unique rows.

  • http://www.tomdalling.com/ Tom Dalling

    Oh well that explains it. Git will only deduplicate identical files. It doesn’t do anything to the lines within a file.

  • Tintin

    Thanks Tom!
    Do you know what can help me?
    Dedup within files..
    Thanks

  • http://www.tomdalling.com/ Tom Dalling

    I would keep using `sort -u`, then run `lrzip` on it afterwards. That should give you very good compression.

  • Tintin

    The problem is that i want also to reconstruct the full file.

    and when i am using sort -u i am loosing this information..

    i want to combine dedup within file and between files.

  • http://www.tomdalling.com/ Tom Dalling

    Try `lrzip` by itself and see what results you get. The file might be too large to get good results. Otherwise, you might have to write your own script that does run-length encoding or something ( http://en.wikipedia.org/wiki/Run-length_encoding).

  • Sam Gleske

    So I couldn’t resist writing my own CLI utility for gitar :D. I present to you my own gitar.sh. The readme has some benchmarks and other fun information while I explored gitar. I wrote this script from scratch without any reference to your own script. Thanks for the neat hack Tom!

    https://github.com/sag47/drexel-university/tree/master/bin#gitarsh—a-simple-deduplication-and-compression-script

  • http://www.tomdalling.com/ Tom Dalling

    Looks good! I’ve included it in the article.