Notes on keeping version histories of files ------------------------------------------- One of the lowest-level tasks of a version control system (such as RCS, SCCS, or CVS) is to maintain an archive of the changing contents of an individual file. For systems like RCS, this is most of the system's job, while for project-oriented version control systems the task is dwarfed by the larger bookkeeping problems of versioning directories full of files. This document will analyze three methods of keeping such an archive. METHOD #1: Plain texts of each revision This is the baseline method; it is not commonly used. Whenever the contents of a file change, a complete new plain text is stored. The time it takes to reconstruct any revision is proportional to the size of that revision's contents, assuming the archive is indexed efficiently. The space taken by the archive is proportional to the sum of the sizes of all revisions of the file. It may be that, as disk space grows cheaper, this method will become popular due to its simplicity. METHOD #2: The weave The weave method is used by SCCS and Bitkeeper. It is traditionally implemented with text files in mind, although nothing prevents the method from being applied to binary files. The archive contains all the lines of text present in any revision of the file, all "woven" together, with interspersed control instructions indicating which lines are included in which revisions of the file. For instance, in SCCS, the following weave represents the contents of a file containing the two lines "foo" and "bar" in the first revision and the two lines "bar" and "baz" in the second revision ("^A" denotes a control-A character): ^AI 1 ^AD 2 foo ^AE 2 bar ^AI 2 baz ^AE 2 ^AE 1 The time it takes to reconstruct any revision is proportional to the size of the archive. The archive size is difficult to characterize, but in general it grows with each new revision proportionally to the size of the changed lines in that revision relative to previous revisions. The weave method has the advantage of uniform retrieval time for all revisions of a file. Unfortunately, it does not appear to be the subject of much research, it is difficult to implement, and there are few, if any, open-source implementations of the weave to refer to. As such, it is not widely used. METHOD #3: Deltas Deltas are used by RCS, CVS, PRCS, arch, Subversion, and other version control systems. There are many variations on this method, but in general, the plaintext of the most recent revision is stored, and prior revisions are expressed in terms of their difference relative to the succeeding revision. For instance, in RCS, the following represents the contents of a file (as above) containing the two lines "foo" and "bar" in the first revision and the two lines "bar" and "baz" in the second revision: 1.2 text @bar baz @ 1.1 text @a0 1 foo d2 1 @ The most recent revision is expressed in plain text, while the previous revision is expressed as a delta (in this case, a text-format delta which uses append and delete commands to express changed lines). Similarly to the weave, the archive size grows with each new revision in proportion to the size of the changed lines relative to the previous revision. (In contrast to the weave, however, if a line disappears in one revision and reappears in the next, it will typically be expressed twice in the archive.) To reconstruct an old revision, we must (at least, naively) take the most recent revision and apply a series of deltas to it. The time it takes to apply a delta is proportional to the size of the source plus the size of the delta; applying N deltas, then, takes time proportional to the size of the deltas plus N times the average size of the file across all those revisions. The second term of the above expression may grow unreasonably large for an old revision of a big file. For instance, if a change log file grows from zero to one million bytes over the course of one thousand revisions, then roughly five hundred million bytes will have to be processed in order to reconstruct the oldest version of the file. In spite of this performance weakness, the delta approach enjoys more popularity than the weave approach, partly because there has been extensive published research into delta algorithms and there are many open source delta implementations to use or refer to. Also, reconstructing the most recent revision is faster in the delta approach, and that is usually a much more common case than examining an old revision. Fortunately, there are two ways to improve on the performance of the delta approach without abandoning it entirely: delta combination and skip-deltas. Delta combination seeks to reduce the time it takes to apply a series of deltas, by combining all of the deltas together into one big delta before looking at the source data. If a million-byte file has undergone a hundred small changes, then the million bytes will only need to be processed once, instead of being copied over a hundred times with minor changes each time. Delta combination alone brings the performance of the delta method in line with the weave method, while retaining quick reconstruction of the most recent revisions. Skip-deltas seek to drastically limit the number of delta applications required to reconstruct an old revision of the file. Instead of expressing each old revisions as a delta against the succeeding revision, some are expressed against more advanced revisions. A file with ten revisions (numbered 0-9) would be represented with: Revision 0 expressed as a delta against revision 8 Revision 1 expressed as a delta against revision 2 Revision 2 expressed as a delta against revision 4 Revision 3 expressed as a delta against revision 4 Revision 4 expressed as a delta against revision 8 Revision 5 expressed as a delta against revision 6 Revision 6 expressed as a delta against revision 8 Revision 7 expressed as a delta against revision 8 Revision 8 expressed as a delta against revision 9 Revision 9 expressed in plain text Or, graphically: 0 --------------------------------------------> 8 --> 9 4 --------------------> 8 2 --------> 4 6 --------> 8 1 --> 2 3 --> 4 5 --> 6 7 --> 8 With skip-deltas, the number of deltas required to reconstruct any revision of the file is proportional to the base-2 logarithm of the number of revisions in the file. If a file has a thousand revisions, it would take only eight delta applications to reconstruct revision 0 using the plain text at revision 999. Skip-deltas come with a space penalty; the average delta runs across lg(N) revisions instead of one. In practice, the space cost is generally much less than a factor of lg(N). Skip-deltas also come with a penalty in creation time. Whenever a new revision is added, an average of two previous revisions must be re-expressed against the new plain text, instead of just one. In the worst case, the base-2 logarithm of the current number of revisions must be re-expressed. The space and time penalty can be reduced by not applying the skip-delta technique to the first few revisions, by skipping the top row (which only helps revision 0), or by skipping some number of rows close to the bottom. Any of these variations can produce notable savings for practical data sets, while only increasing the number of deltas needed for reconstruction by a constant number. REFERENCES RCS http://www.gnu.org/software/rcs/rcs.html SCCS http://www.cvshome.org/cyclic/cyclic-pages/sccs.html PRCS http://prcs.sourceforge.net Bitkeeper http://www.bitmover.com arch http://www.regexps.com/arch.html Subversion http://subversion.tigris.org The technique of delta combination is currently implemented in Subversion, and is either planned for or implemented in PRCS. The author does not know of papers or other references describing the technique. The technique of skip-deltas is implemented in Subversion. AUTHOR This document was written by Greg Hudson on 2002-06-24. It was last updated on 2002-10-03.