Introduction to Version Control
===============================


Petr Baudis <pasky@suse.cz>
SuSE Labs CZ

Published in the Proceedings of OpenWeekend 2005 (ISBN 80-01-03349-X).


Introduction
~~~~~~~~~~~~

For every non-trivial project, it is immensely useful to be able to trace its
development over time and see the history of changes as they happened.
Moreover, if multiple developers collaborate on the project, you need some
means to enable this collaboration in a reasonably easy way, and it is also
useful to be able to see which developer did certain change.

And this is basically all the version control tools are designed to care
about, yet it turns out to be a challenging task to do it right. Note also
that we are not speaking only about software projects, since version control
is also useful for books, hardware design sheets, or whatever else. We will
however focus on software projects since they are the ones who usually put
most challenges on the tools - open source projects in particular, since they
usually feature large number of developers scattered around the world, there
is no stable set of developers (they usually get direct write access only
after showing some work) and they frequently work outside of the system (just
sending patches around).

"VCS" versus "SCM"
~~~~~~~~~~~~~~~~~~

Version control systems (VCS) are also called "Source control managment tools"
(SCM) by some people (e.g. among the Linux kernel developers). Either of these
terms is not particular well defined, but the commonly perceived meaning is
that SCM tools are more focused on software projects and aside of version
control they also frequently provide solutions for build control, automated
testing, etc. Examples include Aegis, Vesta, etc.  Due to their limited
applicability, we will not further focus on them.


Basic Concepts
~~~~~~~~~~~~~~

In the area of version control, one should understand few basic terms.  The
first one is "revision", which represents state of a tree (or a single file in
some older systems, CVS in particular) at a given moment, usually explicitly
marked by the developer. It is sometimes also called "commit", based on the
act of marking it. The term "changeset" is also used, however always referring
to the revision of the whole tree, never a single file. Revision is usually
identified by some means, either by a sequence of small decimal numbers,
larger decimal number, huge hexadecimal number, or a string - this usually
reflects some of the basic design decisions of a VCS.

"Branch" is a given line of development (revisions). There may be multiple
lines, e.g. representing various release trains or development efforts not yet
ready for the main development line. Frequently, one might want to join
several branches back together, which is called a merge.

Revisions then represent a "revision graph" - in general this is a directed
acyclic graph, but in older systems unable to represent a merge (e.g. CVS,
SVN) it is actually a tree.

"Repository" is some storage place (directory, database file, ...) holding all
the data. It can usually hold multiple branches of a project and frequently
even multiple independent projects.


Problems in version control
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Why do we have so many VCSes? Each of them attempts to tackle the version
control challenges in a different way, and many approaches are indeed
possible. We will present the three most notable problems.

The first problem is obviously how to actually store the history data, and
most design choices already reflect in that. Different (frequently
contradictory) storage model choices are required to make different operations
fast (and thus feasible).

The second problem is merging. The research in this area investigates how,
given two versions of a file with some common and some distinct history, merge
those two versions together. The most common method used currently is the
so-called "three-way merge", but this is perceived as not accurate enough and
actually giving wrong results in some cases. The "precise-CDV merge" is the
most likely candidate to take the place of the widest used merge method in the
future, but it is still in development.

The third problem is enabling distributed development. This is a popular
notion amongst many VCS designers (as well as users) meaning that when you are
getting the source from the version control system, you usually also get the
history, and especially the possibility to do local development on your local
copy of the source, independently on the remote repository - this enables
working offline and working even if you do not have any write access to the
repository. The disadvantage is that the concepts may be initially more
difficult to grasp than in centralized version control, and the systems
themselves are also frequently more complicated.


Existing systems
~~~~~~~~~~~~~~~~

We shall now briefly cover some of the existing systems. Note that this is
merely a selection, since there is a lot of existing systems, so only the most
interesting ones (according to the author's humble opinion) are described
here.

Centralized VCSes
-----------------

Those are version control systems which require you to access a single
centralized repository in order to do your work. All of those listed here also
store revision graphs as trees, making seamless merging of branches quite
difficult to do and usually requiring some user interaction.

SCCS

Source Code Control System was the first system for per-file version control,
dating as far back as 1972.  It was quickly superseded by RCS when it appeared
some years later, since it was reportedly quite difficult to use and user
unfriendly. It is notable for its storage format "weave", which is otherwise
widely used only by BitKeeper and was considered obscure by other systems.
However, it turned out lately that it was likely one of the key essences for
BitKeeper's smart merging method, and is considered for being adopted lately.

RCS

Revision Control System was quite popular in the 80s, offering simple per-file
revision control. Contrary to SCCS, it used simpler storage method - the latest
revision stored verbatim and the older revisions stored as chain of differences
to the previous revision in the chain. RCS is still used today if you need to
track only a single file, and the history isn't very extensive.

CVS

Control Version System is basically just an extension of RCS, making it work
on multiple files and over the network. This means that history of individual
files is stored separately and reconstructing changesets (commits of several
files at once) is possible only heuristically; it has also number of other
limitations. CVS is still very popular and widespread, since it is quite easy
to learn and use, and has historically large existing user base.

Subversion (SVN)

Subversion is conceptually similar to CVS, although it offers some important
advantages and fixes many of its problems. It still isn't good in merging,
however, and does not support the distributed development model (note that
there is the SVK project in development which is aimed at distributed
development of SVN-based projects).


Decentralized VCSes
-------------------

These version control systems support (or sometimes even enforce) the
distributed development model.

BitKeeper

BitKeeper is probably the oldest regular version control system supporting
the distributed development. In contrast with the rest of the systems
described here, BitKeeper is non-free non-opensource software. However, it
can be regarded as being one of the key impulses for the big development in
the VCS area, since at least early one of the key motivations for the new
systems was to compete with BitKeeper and be able to substitute it in the
Linux kernel development model, which used BitKeeper until April 2005.

BitKeeper enforces an one-repository-per-branch rule, so forking a branch
equals making a new repository. It uses the SCCS files as its backend format
(but tracks the commits as changesets, unlike CVS with its RCS usage).

GNU Arch

GNU Arch was probably the first free VCS focused on the distributed
development model, and is popular in some circles, but it suffers from a
user-unfriendly and complicated user interface, and it is generally perceived
as unsuitable for tracking large projects. It uses tarballs with patch files
as backend storage format. It is officially unmaintained now, but Bazaar and
ArX projects are based on GNU Arch, still in development and mainly aim to fix
its user interface.

Darcs

Darcs is a very unusual VCS. It is one of the two big opensource projects
written in Haskell ;-) and it uses a radically different approach to revision
control, reducing the whole history to patches. Any revision is then simply a
particular combination of patches, and merging is done by a clever "patch
commutation" logic.

Monotone

Monotone is also one of the more popular distributed VCSes. It reduces revision
control to manipulation with objects (files, trees, and revisions) having
unique identifiers. It also makes the history immutable and generally maintains
a strong regard for cryptographically strong security and accountability. It
uses the sqlite database backend, and is currently not very suitable for large
projects especially due to speed problems.

GIT

GIT is a very new VCS developed by community around Linus Torvalds and
currently used for Linux kernel development. Design-wise, it borrows heavily
from Monotone, but GIT's design is much simpler and GIT is also very fast (that
being one of its key design points). Contrary to many other version control
systems, it does not store edges in the revision graph (changes), but vertices
(snapshots) - however in a way that prevents most of the redundant information
being stored multiple times, so it still scales moderately well.

Cogito

Cogito is basically a GIT frontend focusing on making it easy to use, and in
general to provide an excellent command-line user interface for version control
which is as simple and easy to use as possible, and still does not get in way
for power users.

Codeville (CDV)

Codeville is a VCS which primarily focuses on getting merging right, currently
somewhat painful problem, and although still in development, its "Precise-CDV"
merging method is already gaining popularity even amongst other VCS developers.
Its other focus is simple user interface.


Conclusion
~~~~~~~~~~

There is a lot of competition in the VCS "market", and still not a clear
winner. So far most of the competition was in the design area, introducing and
trying out new ideas, but the systems weren't very practically usable. It is
not very likely that any radical new ideas will be unleashed soon and the
further development will hopefully go into making the systems more practical
to use - especially fixing the speed problems and frequently bad user
interface.


About the author
~~~~~~~~~~~~~~~~

Petr Baudis is currently studying at the Faculty of Math and Physics, Charles
University, Prague, where he also works as network administrator. He is using
and developing free software since childhood, the most notable work so far
probably being the ELinks text web browser. Currently, he maintains the Cogito
version control system with the generous sponsorship of SuSE Labs Czech
Republic.