07 An introduction to version control 2: concepts (versions, checking out, committing, respositories)

This post is the sixth 'lesson' in the Participating in Free Software
LinuxChix course. You can find previous lessons at
http://www.linuxchix.org/content/courses/tools/ . Questions and
discussion are welcome, please make sure the string "[Tools]" is in the
subject of your mail.

I'll be on IRC to discuss this lesson at two times:

- Sun Oct 16 2005, 00:00 (midnight) UTC/GMT [1]
- Wed Oct 19 2005, 10:00 UTC/GMT [2]

The channel is #tools-course on server irc.linuxchix.org

In this lesson we'll cover the basics of what you do with version
control in more detail. This is still an abstract discussion, we'll move
onto running actual commands in two lessons' time.

--- Versions ---

This section covers the idea of "versions" in detail. You don't need to
know all of this in great detail to use a version control system, the
key things you need to know are:
- a version control system stores and can access every version of a
file that was ever committed
- these versions are accessed (almost always) by their version *number*

As you know by now (from both the name of the systems themselves and the
previous lesson), one of the key features of version control systems is
that they store versions or revisions of files (at least, of text
files). You can then at any later time do the following things:

- access a particular older version of a file
- access information about that version, such as when it was committed,
and who committed it (in most systems)
- compare any two versions of a file

In most systems, you can also do a line by line analysis of any file,
and get the version control system's report on who was the last person
to edit that line, and what date they committed their edits. If you're
puzzling over the purpose of some part of a file, it can be useful to
ask the version control system who is responsible for it so that you can
ask them.

I'm not going to cover version analysis commands in this course, but in
the next lesson I'll refer you to more complete documentation of the
tools in question so that you can find out these commands later if need
be.

The usual model for versions is that they have numbers, going up
sequentially. The way in which branches are numbered varies, but the
details are fairly uninteresting for a user. It doesn't particularly
matter *how* things are numbered, just that they are. You can think of
the number simply as a handy code for a particular version.

Revision numbers can get slightly confusing at times, because they can
look rather like release numbers. Software versions released to the
public are typically numbered, for example, Firefox 1.4 or Linux 2.6.12.
These release numbers are assigned by the developers responsible for
putting together a bunch of files and calling it a "release", and the
numbers automatically assigned by the version control system to the
files stored in there are unrelated.

There are two different ways of numbering:

1. file-based numbering. CVS does this. Each file that is in the
repository is numbered separately. Imagine that you have the most
up-to-date copy of every file. a.txt has been edited 57 times, so it
is numbered 1.57 (CVS uses 1.x numbers for the main branch, which
looks even more like a release number). b.txt has been edited twice,
so it is numbered 1.2.

2. tree-based or whole-respository numbering. Subversion and Bazaar do
this. In this system, each *commit* has its own number. If you make
the 150th commit to the system, then every file in the repository is
maked as "commit #150". If I change a.txt and make the 151st
commit, every file in the repository is marked as "commit #151",
even though only a.txt changed. Version #150 and #151 of b.txt will
actually be exactly the same, because my commit didn't change b.txt.

The advantage of tree-based numbering is that often changes to two files
only work when taken together. Imagine I am writing a book. I have a
file for chapter 1 and a file for chapter 2. I add a section to chapter
2 called "manipulating foozbits". In chapter 1, I use the words "for more
on manipulating foozbits, see chapter 2". The changes to chapter 1 are
closely related to the changes to chapter 2 and it makes sense to be
able to access them as a *single* version, even though they're in
different files. Code has similar kinds of inter-file dependencies.

--- Repositories ---

Repositories, as explained in the previous lesson, are the place where
all the versions are stored. They are usually files on disk in a special
format (you're not meant to edit them directly) and in a special place.
You don't have to have a working copy on the same machine as the
repository, version control systems have ways of accessing the
repository over the network.

Central code repositories are usually not open for anyone to write to
(Free Software projects usually let you read them at least). Typically
you will need to get someone to agree to give you access. This is
usually done by convincing them that you are going to help make their
project better. It's demonstrated in various ways, but usually this
involves sending your work to existing developers for review a few
times. Once it seems to them that you tend to make satisfactory
additions to the project, they will allow you to commit your changes
directly.

--- Sandboxes and working copies ---

When working with a revision control system, you do not directly edit
files in the repository. Instead, you and every other developer will be
working on what's called a "sandbox" or "working copy". It is simply a
copy of the files in the repository. Because each developer has an
independent copy of the files, you can change them, break them, fix
them, and so on, and the other developer's copies and the repository are
unaffected.

A working copy really is just a normal set of files, with one exception:
it will tend to have some special extra files which the version control
system will use to interact with the repository. These files note where
the repository is, what the last version you checked out was, and things
like that. You will not edit these special files directly. (We'll see
where they are in later lessons.)

--- Checking out and updating ---

"Checking out" a version is meant to be analogous with getting books out
of a library. It's not a great analogy though: essentially what you're
doing is making a copy of the repository at a particular version and
putting that copy in your working directory.

Checking out is a rare activity: it tends to refer to making a complete
copy of the repository files. A more common activity is working with an
existing check out and asking the version control system to give you a
new copy of the files if anyone has changed any of them. This is called
"updating" your files.

Usually when working collaboratively with others, you will update your
files regularly, perhaps once or twice a day. This means you will find
out as soon as possible if you and another person are working at
cross-purposes, and minimises the chance that you and another person
have made completely different changes to the same part of a file.

You will not usually be allowed to check files into a repository unless
you have updated to the latest version.

--- Committing ---

"Committing" is what you do when you want to share changes in your
working copy with other developers. You ask the version control system
to add them to the repository as the latest version. The next time other
people update, they will see them.

When you commit files, version control systems also allow you to specify
what is called a "log message" at the same time. This is where you write
a short summary of what your changes do. For example, you might have a
log message that says "added the manipulating foozbits section to
chapter 2", or "copyedited chapter 1" or "translated chapter 3 into
Spanish" or "fixed bug 10334: the program will now do an emergency save
when there's an error".

Log messages are typically a sentence or two, and should be descriptive
enough for your collaborators to be able to tell basically what the
commit does (bad commit messages are things like "at least, fixed that
damned bug" or "that was stupid of me" or just blank). There's a few
reasons for this:

1. at some point later (particularly with code), someone may look at
your changes and ask "why on earth is this doing that?", for example
"why is this saving data here?". Your log message may help answer
their question.

2. quite often projects have a mailing list or IRC bot or similar that
notifies everyone of changes to the repository. Rather than simply
saying "changes to a.txt and b.txt" it's useful to be able to use
your summary.

Generally, you commit files as often as they are working. Ideally this
is every few hours at least. When you're making a lot of changes that
aren't going to work separately and which won't come together for a
while, or changes that need to be approved by others, then it is good
practice to make a branch of the files and commit work there, rather
than leave it in your working copy only. That way it's versioned and
public, but still not interfering with the main work. Some systems make
this easier than others.

It may already occur to you to wonder what to do if you don't want to,
or can't, check your files back into the main repository. Perhaps you're
going to maintain some local set of changes that you'd like to have
versioned, but don't intend to submit to upstream just now, or ever.
Even if you are intending to submit them upstream, if you don't have
write access to the repository, then you can only send code to
developers, you can't use their version control system. (This was one of
the first things I wanted to do with version control systems.) Branching
traditionally only allowed you to have a separate set of versions of a
file *in the same overall repository*. Having a whole separate
repository and easily merging files backwards and forth between them is
a relatively new innovation. The two most well known version control
systems (CVS and Subversion) don't make this terribly easy. The new
distributed (or decentralised) version control systems are the first to
make this easy by design. We'll be looking at one, Bazaar, later in the
course.

-Mary

[1] Local times for Sun Oct 16, 00:00 available from
http://tinyurl.com/dfmue

[2] Local times for Wed Oct 19, 10:00 available from
http://tinyurl.com/djb9a