Introduction to Version Control =============================== Petr Baudis SuSE Labs CZ Published in the Proceedings of OpenWeekend 2005 (ISBN 80-01-03349-X). Introduction ~~~~~~~~~~~~ For every non-trivial project, it is immensely useful to be able to trace its development over time and see the history of changes as they happened. Moreover, if multiple developers collaborate on the project, you need some means to enable this collaboration in a reasonably easy way, and it is also useful to be able to see which developer did certain change. And this is basically all the version control tools are designed to care about, yet it turns out to be a challenging task to do it right. Note also that we are not speaking only about software projects, since version control is also useful for books, hardware design sheets, or whatever else. We will however focus on software projects since they are the ones who usually put most challenges on the tools - open source projects in particular, since they usually feature large number of developers scattered around the world, there is no stable set of developers (they usually get direct write access only after showing some work) and they frequently work outside of the system (just sending patches around). "VCS" versus "SCM" ~~~~~~~~~~~~~~~~~~ Version control systems (VCS) are also called "Source control managment tools" (SCM) by some people (e.g. among the Linux kernel developers). Either of these terms is not particular well defined, but the commonly perceived meaning is that SCM tools are more focused on software projects and aside of version control they also frequently provide solutions for build control, automated testing, etc. Examples include Aegis, Vesta, etc. Due to their limited applicability, we will not further focus on them. Basic Concepts ~~~~~~~~~~~~~~ In the area of version control, one should understand few basic terms. The first one is "revision", which represents state of a tree (or a single file in some older systems, CVS in particular) at a given moment, usually explicitly marked by the developer. It is sometimes also called "commit", based on the act of marking it. The term "changeset" is also used, however always referring to the revision of the whole tree, never a single file. Revision is usually identified by some means, either by a sequence of small decimal numbers, larger decimal number, huge hexadecimal number, or a string - this usually reflects some of the basic design decisions of a VCS. "Branch" is a given line of development (revisions). There may be multiple lines, e.g. representing various release trains or development efforts not yet ready for the main development line. Frequently, one might want to join several branches back together, which is called a merge. Revisions then represent a "revision graph" - in general this is a directed acyclic graph, but in older systems unable to represent a merge (e.g. CVS, SVN) it is actually a tree. "Repository" is some storage place (directory, database file, ...) holding all the data. It can usually hold multiple branches of a project and frequently even multiple independent projects. Problems in version control ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Why do we have so many VCSes? Each of them attempts to tackle the version control challenges in a different way, and many approaches are indeed possible. We will present the three most notable problems. The first problem is obviously how to actually store the history data, and most design choices already reflect in that. Different (frequently contradictory) storage model choices are required to make different operations fast (and thus feasible). The second problem is merging. The research in this area investigates how, given two versions of a file with some common and some distinct history, merge those two versions together. The most common method used currently is the so-called "three-way merge", but this is perceived as not accurate enough and actually giving wrong results in some cases. The "precise-CDV merge" is the most likely candidate to take the place of the widest used merge method in the future, but it is still in development. The third problem is enabling distributed development. This is a popular notion amongst many VCS designers (as well as users) meaning that when you are getting the source from the version control system, you usually also get the history, and especially the possibility to do local development on your local copy of the source, independently on the remote repository - this enables working offline and working even if you do not have any write access to the repository. The disadvantage is that the concepts may be initially more difficult to grasp than in centralized version control, and the systems themselves are also frequently more complicated. Existing systems ~~~~~~~~~~~~~~~~ We shall now briefly cover some of the existing systems. Note that this is merely a selection, since there is a lot of existing systems, so only the most interesting ones (according to the author's humble opinion) are described here. Centralized VCSes ----------------- Those are version control systems which require you to access a single centralized repository in order to do your work. All of those listed here also store revision graphs as trees, making seamless merging of branches quite difficult to do and usually requiring some user interaction. SCCS Source Code Control System was the first system for per-file version control, dating as far back as 1972. It was quickly superseded by RCS when it appeared some years later, since it was reportedly quite difficult to use and user unfriendly. It is notable for its storage format "weave", which is otherwise widely used only by BitKeeper and was considered obscure by other systems. However, it turned out lately that it was likely one of the key essences for BitKeeper's smart merging method, and is considered for being adopted lately. RCS Revision Control System was quite popular in the 80s, offering simple per-file revision control. Contrary to SCCS, it used simpler storage method - the latest revision stored verbatim and the older revisions stored as chain of differences to the previous revision in the chain. RCS is still used today if you need to track only a single file, and the history isn't very extensive. CVS Control Version System is basically just an extension of RCS, making it work on multiple files and over the network. This means that history of individual files is stored separately and reconstructing changesets (commits of several files at once) is possible only heuristically; it has also number of other limitations. CVS is still very popular and widespread, since it is quite easy to learn and use, and has historically large existing user base. Subversion (SVN) Subversion is conceptually similar to CVS, although it offers some important advantages and fixes many of its problems. It still isn't good in merging, however, and does not support the distributed development model (note that there is the SVK project in development which is aimed at distributed development of SVN-based projects). Decentralized VCSes ------------------- These version control systems support (or sometimes even enforce) the distributed development model. BitKeeper BitKeeper is probably the oldest regular version control system supporting the distributed development. In contrast with the rest of the systems described here, BitKeeper is non-free non-opensource software. However, it can be regarded as being one of the key impulses for the big development in the VCS area, since at least early one of the key motivations for the new systems was to compete with BitKeeper and be able to substitute it in the Linux kernel development model, which used BitKeeper until April 2005. BitKeeper enforces an one-repository-per-branch rule, so forking a branch equals making a new repository. It uses the SCCS files as its backend format (but tracks the commits as changesets, unlike CVS with its RCS usage). GNU Arch GNU Arch was probably the first free VCS focused on the distributed development model, and is popular in some circles, but it suffers from a user-unfriendly and complicated user interface, and it is generally perceived as unsuitable for tracking large projects. It uses tarballs with patch files as backend storage format. It is officially unmaintained now, but Bazaar and ArX projects are based on GNU Arch, still in development and mainly aim to fix its user interface. Darcs Darcs is a very unusual VCS. It is one of the two big opensource projects written in Haskell ;-) and it uses a radically different approach to revision control, reducing the whole history to patches. Any revision is then simply a particular combination of patches, and merging is done by a clever "patch commutation" logic. Monotone Monotone is also one of the more popular distributed VCSes. It reduces revision control to manipulation with objects (files, trees, and revisions) having unique identifiers. It also makes the history immutable and generally maintains a strong regard for cryptographically strong security and accountability. It uses the sqlite database backend, and is currently not very suitable for large projects especially due to speed problems. GIT GIT is a very new VCS developed by community around Linus Torvalds and currently used for Linux kernel development. Design-wise, it borrows heavily from Monotone, but GIT's design is much simpler and GIT is also very fast (that being one of its key design points). Contrary to many other version control systems, it does not store edges in the revision graph (changes), but vertices (snapshots) - however in a way that prevents most of the redundant information being stored multiple times, so it still scales moderately well. Cogito Cogito is basically a GIT frontend focusing on making it easy to use, and in general to provide an excellent command-line user interface for version control which is as simple and easy to use as possible, and still does not get in way for power users. Codeville (CDV) Codeville is a VCS which primarily focuses on getting merging right, currently somewhat painful problem, and although still in development, its "Precise-CDV" merging method is already gaining popularity even amongst other VCS developers. Its other focus is simple user interface. Conclusion ~~~~~~~~~~ There is a lot of competition in the VCS "market", and still not a clear winner. So far most of the competition was in the design area, introducing and trying out new ideas, but the systems weren't very practically usable. It is not very likely that any radical new ideas will be unleashed soon and the further development will hopefully go into making the systems more practical to use - especially fixing the speed problems and frequently bad user interface. About the author ~~~~~~~~~~~~~~~~ Petr Baudis is currently studying at the Faculty of Math and Physics, Charles University, Prague, where he also works as network administrator. He is using and developing free software since childhood, the most notable work so far probably being the ELinks text web browser. Currently, he maintains the Cogito version control system with the generous sponsorship of SuSE Labs Czech Republic.