Fast appending files to tar archive is impossible.

Eventually, tar is very slow for appending files to the existing tarball. I’m particularly talking about following options:

-r – append files to the end of an archive
-u – only append files newer than copy in archive

Logically thinking, for -u to work, it should accomplish linear search through the archive. Than bigger the archive, than slower the search. Moreover, if you’ll try to append in the loop, it will accomplish search as many times as many iterations you loop has. I would advice to use in the most exceptional case ONLY. Try avoiding

# -u. slow inefficient approach of taring multiple files
for file in $(ls -A)
do
    tar -uf tarball.tar $file;		#traverses all archive to append the file.
done

You’d think that -r option usage forces tar application to append files to the end of the archive, getting the position of the archive’s end from archive’s index. It doesn’t. Tar format is designed in a way that it has no index.

# -r approach is also slow and inefficient
for file in $(ls -A)
do
    tar -rf tarball.tar $file;		#traverses all archive to append the file.
done

However, TAR supports several formats for its archive. But they are not well-documented. I had a brief overview of them, and looks like
–format=gnu is the most recent and featured one. And It still has no index. I no longer understand why tar is even used. Despite of that, below is a workaround, allowing for packing unlimited amounts of files right instant. I recommend to never use append function with tar format. Instead, get to know what are you going to archive, prepare necessary files, and archive them all.

# faster approach for taring multiple files. No appending
ls -A >> list.txt
tar -cT list.txt -f backup.tar
  • In order to get what tar is doing and so on, let’s consider two things:

    1. Making something big and single out of many small things
    2. Data compression

    1 = Apart of compression itself, tar was used a lot for sequential data backups (like tapes). The job was to take a directory (or many) and create a single file out of all that crap in that place, so that it can be written to tape in serial manner. This is still no 1 thing for tar and it has nothing to do with compression.

    2 = As you should know, compression ratio of one big single thing is much higher that compression ration of many small things (even though the total size of both scenarios might be the same). The point here is that compression is highly dependent on patterns. Having more patterns in something big is more obvious than in something small.

    2 = How two things work: tar creates something single big from a lot of small which, in tern, increases the compression ratio of gzip/bzip2 (which is normally used for compression of tar by -f/-z flags respectively).

    This was a king of an answer on a question “why people still use tar”

    • T1

      Searching for repeating patterns are effective on small inputs and fast on large inputs. Archiving is a procedure of 2 inversely proportional parameters: compression time and compression effectiveness.

      I want my archives to compress quickly, because I have a large input of heterogeneous data. I have even decided to avoid compressing it at all, so I’m just using tar.

      I’d say there is a limited field for using tar: plain texts archiving. For rest input tar efficiency is under question