Managing Git Submodules With git.rake Tuesday, May 06, 2008

Update 2008-05-21: Tim Dysinger and Pat Maddox pointed out that git submodules are inherently not well-suited for frequently updated projects. Read the comments for more details, and please use submodules with caution on projects where you can't guarantee a shared repository has not changed between 'pull' and 'push' operation.

Today I'm releasing git.rake into the wild under an open-source license. It's a rakefile for managing multiple git submodules in a shared-server development environment.

We've been using it internally at my company, Pharos Enterprise Intelligence, for the last 5 months and it's been a huge timesaver for us. Read below for a detailed description of the features and its use.

The code is being released under the MIT license and the git repository is being hosted on github. Take a look:

http://github.com/flavorjones/git-rake

What git.rake Is

A set of rake tasks that will:

  • Keep your superproject in synch with multiple submodules, and vice versa. This includes branching, merging, pushing and pulling to/from a shared server, and committing. (Biff!)

  • Keep a description of all changes made to submodules in the commit log of the superproject. (Bam!)

  • Display the status of each submodule and the superproject in an easily-scannable representation, suppressing what you don't want or need to see. (Pow!)

  • Execute arbitrary commands in each repository (submodule and superproject), terminating execution if something fails. (Whamm!)

  • Configure a rails project for use with git. (Although, you've seen that elsewhere and are justifiably unimpressed.)

Prerequisites

If you're not sure how to add a submodule to your repo, or you're not sure what a submodule is, take a quick trip over to the Git Submodule Tutorial, and then come back. In fact, even if you ARE familiar with submodules, it's probably worth reviewing.

The Problem We're Trying to Solve Here

Let's start with stating our basic assumptions:

  1. you're using a shared repository (like github)
  2. you're actively developing in one or more submodules

This model of development can get very tedious very quickly if you don't have the right tools, because everytime you decide to "checkpoint" and commit your code (either locally or up to the shared server), you have to:

  • iterate through your submodules, doing things like:
    • making sure you're on the right branch,
    • making sure you've pulled changes down from the server,
    • making sure that you've committed your changes,
    • and pushed all your commits
  • and then making sure that your superproject's references to the submodules have also been committed and pushed.

If you do this a few times, you'll see that it's tedious and error-prone. You could mistakenly push a version of the superproject that refers to a local commit of a submodule. When people try to pull that down from the server, all hell will break loose because that commit won't exist for them.

Ugh! This is monkey work. Let's automate it.

Simple Solution

OK, fixing this issue sounds easy. All we have to do is:

  • develop some primitives for iterating over the submodules (and optionally the superproject),
  • and then throw some actual functionality on top for sanity checking, pulling, pushing and committing.

The Tasks

git-rake presents a set of tasks for dealing with the submodules:

    git:sub:commit     # git commit for submodules
    git:sub:diff       # git diff for submodules
    git:sub:for_each   # Execute a command in the root directory of each submodule.\
                         Requires CMD='command' environment variable.
    git:sub:pull       # git pull for submodules
    git:sub:push       # git push for submodules
    git:sub:status     # git status for submodules

And the corresponding tasks that run for the submodules PLUS the superproject:

    git:commit         # git commit for superproject and submodules
    git:diff           # git diff for superproject and submodules
    git:for_each       # Run command in all submodules and superproject. \
                         Requires CMD='command' environment variable.
    git:pull           # git pull for superproject and submodules
    git:push           # git push for superproject and submodules
    git:status         # git status for superproject and submodules

It's worth noting here that most of these tasks do pretty much just what they advertise, in some cases less, and certainly nothing more (well, maybe a sanity check or two, but no destructive actions).

The exception is git:commit, which depends on git:update, and that has some pixie dust in it. More on this below.

Leaving only the following specialty tasks to be explained:

    git:configure      # Configure Rails for git
    git:update         # Update superproject with current submodules

The first is simple: configuration of a rails project for use with git.

The other, git:update, does two powerful things:

  1. (Only if on branch 'master') Submodules are pushed to the shared server. This guarantees that the superproject will not have any references to local-only submodule commits.

  2. For each submodule, retrieve the git-log for all uncommitted (in the superproject) revisions, and jam them into a superproject commit message.

Here's an example of such a superproject commit message:

    commit 17272d53c298bd6a8ccee6528e0bc0d62104c268
    Author: Mike Dalessio <mike@csa.net>
    Date:   Mon May 5 20:48:13 2008 -0400

            updating to latest vendor/plugins/pharos_library

            > commit f4dbbce6177de4b561aa8388f3fa9f7bf015fa0b
            > Author: Mike Dalessio <mike@csa.net>
            > Date:   Mon May 5 20:47:46 2008 -0400
            >
            >     git:for_each now exits if any of the subcommands fails.
            >
            > commit 6f15dee8c52ced20c98eef63b3f3fd1c29d91bbf
            > Author: Mike Dalessio <mike@csa.net>
            > Date:   Fri May 2 13:58:17 2008 -0400
            >
            >     think i've got the tempfile handling correct now. awkward, but right.
            >

Excellent! Not only did git:update automatically generate a useful log message for me (indicating that we're updating to the latest submodule version), but it's also embedding original commit logs for all the changes included in that commit! That makes it much easier to find a specific submodule commit in the superproject commit log.

A Note on Branching and Merging

Note that there are no tasks for handling branching and merging. This is intentional! It could be very dangerous to try to read your mind about actions on branches, and frankly, I'm just not up to it today.

For example, let's say I invented a task to copy the current branch master to a new branch foo (the equivalent of git checkout -b foo master) in all submodules, but one of the submodules already has a branch named foo!

Do we reduce this action to a simple git checkout foo for that submodule? That could yield unexpected results if we a) forgot we had a branch named foo and b) that branch is very different from the master we expected to copy.

Well, then -- we can delete (or rename) the existing foo branch and follow that up by copying master to foo. But then we're silently renaming (or deleting) branches that a) could be upstream on the shared server or b) we intended to keep around, but forgot to git-stash.

In any case, my point is that it can get complicated, and so I'm punting. If you want to copy branches or do simple checkouts, you should use the git:for_each command.

Everyday Use of git:rake

In my day job, I've taken the vendor-everything approach and refactored lots of common code (across clients) into plugins, which are each a git submodule. My current project has 14 submodules, of which I am actively coding in probably 5 to 7 at any one time. (Plenty of motivation for creating git:rake right there.)

Let's say I've hacked for an hour or two and am ready to commit to my local repository. Let's first take a look at what's changed:

    $ rake git:status

    All repositories are on branch 'master'
    /home/mike/git-repos/demo1/vendor/plugins/core: master, changes need to be committed
    #   modified:   app/models/user_mailer.rb
    #   public/images/mail_alert.png        (may need to be 'git add'ed)
    WARNING: vendor/plugins/core needs to be pushed to remote origin
    /home/mike/git-repos/demo1/vendor/plugins/pharos_library: master, changes need to be committed
    #   deleted:    tasks/rake/git.rake

You'll notice first of all that, despite having 14 submodules, I'm only seeing output for the ones that need commits, and even that output is minimal, listing only the specific files and not all the cruft in the original message. It tells me that all submodules are on the same branch. It's smart enough to tell me that a file may need to be git-added. It will even alert me when a repo needs to be pushed to the origin.

I'll have to manually cd to the submodule and git-add that one file, but once that's done, I can commit my changes by running:

    $ rake git:commit

which will run git commit -a -v for each submodule, fire up the editor for commit messages along the way, push each submodule to the shared server, and then automagically create verbose commit logs for the superproject.

To pull changes from the shared server:

    $ rake git:pull

When you run this command, you'll notice that the output is filtered, so if no changes were pulled, you'll see no output. Silence is golden.

To push?

    $ rake git:push

Not only will this be silent if there's nothing to push, but the rake task is smart enough to not even attempt to push to the server if master is no different from origin/master. So it's silent and fast.

Let's say I want to copy the current branch, master, to a new branch, working.

    $ rake git:for_each CMD='git checkout -b working master'

If the command fails for any submodules, the rake task will terminate immediately.

Merging changes from 'working' back into 'master' for every submodule (and the superproject)?

    $ rake git:for_each CMD='git checkout master'
    $ rake git:for_each CMD='git merge working'

What git.rake Doesn't Do

A couple of things that come quickly to mind that git.rake should probably do:

  • Push to the shared server for ANY branch that we're tracking from a remote branch.

  • Be more intelligent about when we push to the server. Right now, the code pushes submodules to the shared server every time we want to commit the superproject. We might be able to get away with only pushing the submodules when we push the superproject.

  • Parsing the output from various 'git' commands is prone to breakage if the git crew starts modifying some of the strings.

  • There should probably be some unit/functional tests. See previous item.

Anyway, the code is all up on github. Go hack it, and send back patches!

15 comments:

dysinger said...

I don't use submodules anymore. They are a hassle when dealing with a bunch of inter-related sub-projects. Instead I use subtree and merge. It's much more like merging a branch (because that's exactly what it is). The branch just get's placed in a sub-directory. Read about it here http://dysinger.net/2008/04/29/replacing-braid-or-piston-for-git-with-40-lines-of-rake/ The rspec project had a disaster which luckily they recovered from where they lost 15 days of commits due to a submodule mess-up. Anyway submodules are good for some things. But that list is pretty small IMO.

Mike Dalessio (aka "Flavor") said...

@dysinger - I'll be looking into subtrees. Thanks for the feedback!

Mike Dalessio said...

@dysinger - After reading up on subtrees in the How-To, it seems like they're not appropriate for the situation git-rake was developed to handle -- namely, code under active development.

Specifically, subtrees are great for merging upstream changes into your local (downstream) repository. But it does not provide functionality for commit changes back upstream to the (shared) repository.

If your big complaint with submodules is that they're a hassle, and git-rake makes handling submodules easy, then that's problem solved, no?

dysinger said...

I think the exact opposite. I think submodules are for code that is not edited often. Everytime you edit code in the submodule you have to make an additional commit in the parent module so that everyone can see the latest changes in the submodule = hassle.

You can do two things for working on projects with sub-trees where you need to edit the sub-tree project.

1 - branch the sub-tree's remote just like you would branch any remote. Commit to them and push them back to their origin repo (which has already been added as a remote).

2 - you can always work on the sub-treed project in another directory where you have the project cloned on it's own just like any git project. When you are done push to the remote, pull back (merge) into your "parent" project.

It works and is less cumbersome than submodules IMO. It's just branches and merging - active development as it's supposed to be.

dysinger said...

I have used both and both work to be fair. I just like sub-trees better. I think it's cleaner.

Mike Dalessio said...

I understand where you're coming from. However, git-rake automatically pulls your commit logs from the submodule into the superproject ... without a hassle. That's the whole point.

I don't spend any more time managing my submodules than I would if I had all my code in a single project. If I make a change, I run "rake git:commit", and whatever changes were made in my submodules gets committed, pushed upstream, and the superproject pulls the commit logs into its own log. Easy, peasy.

I will take another look at subtrees, however your descriptions of pushing changes back upstream sounds like more work than I'm doing now with git-rake.

But really, that's the wonderful thing about git. There's a workflow for everyone's taste.

McNaz said...

Hi Mike.

I've just posted a new article with a work-around using Git's sub-project.

It was inspired by your work.

More info: http://panthersoftware.com/articles/view/4/git-svn-dcommit-workaround-for-git-submodules

Thanks.

Pat Maddox said...

dsyinger was right when he said that the RSpec project had a lot of trouble with submodules and found them unsuitable for active development.

They seem to work fine, as does git-rake, until you have more than one developer. Then the submodules become a real PITA.

Let's say you've got a super project with submodule a. I make changes to some file in a and commit my changes, and update the submodule reference to be commit abc123. Now you make some changes to a file in a, commit your changes and update the submodule ref to be def456. When either of us updates, there will be a merge conflict on the submodule reference, because we're both saying that the submodule reference has a different commit ID. You can just accept your own SHA, because you know you've already merged the submodules. But it's annoying to have to do that every time. On top of that, it's way too easy to blow away local changes by doing "git submodule update". So submodules, for us, had no upsides and plenty of downsides.

I experimented with git-rake for a bit and it doesn't appear to handle this problem. Are you using it on a team with multiple active developers, or are you using it alone, managing external dependencies?

As far as how we solved it in the RSpec codebase, we just check out the "child" repos into the proper dirs and use .gitignore to ignore their files from the "parent" repo. But there's no direct relationship between them, so we don't have any of the headaches of submodules.

Mike Dalessio said...

@pat: I understand the issue as you've described, and was able to reproduce it, and it's clarified some things for me.

The reason I (and my company) have not walked into this spinning propeller of submodule hell is because of two reasons:

1) We always make sure our submodules are on a branch, and not on the (default) detached HEAD. This allows us to ...

2) Always pull changes from the shared repo before committing local changes.

Now, this is only feasible on branches with relatively low update frequencies.

That is, if you can't be reasonably sure that the shared repository hasn't changed between the times you do a 'git pull' and a 'git push' on your submodule, then (proverbially speaking) you're F-ed in the A.

So, I acknowledge that there is no reasonable way to make submodules work for frequently-updated projects, even with git.rake.

Many thanks to you and dysinger for pointing this out. It's definitely an issue. I'll be updating the documentation in git.rake's github repo to alert people.

Anonymous said...

We now use the following sequence to clone a repo:

git clone whatever
cd whatever_dir
git submodule init
git submodule update
rake git:sub:checkout_master
rake git:update
rake git:commit
rake git:push

The checkout_master step is crucial of course, as it ensures that the submodules are all on the master branch. It is something of a precaution.

This complicated sequence (which we have turned into a shell script) would not be necessary if all submodules are switched to the master branch as they are added to the project, before the commit.

Robert said...

Wahh, this disappeared off github?

Forked here:
http://github.com/rmatei/git-rake/tree/master

Mike Dalessio said...

Sorry, I renamed my github account from 'mdalessio' to 'flavorjones'. I'll update the article in a moment, but for now you can see this repo at:

http://github.com/flavorjones/git-rake

Anonymous said...

It is extremely interesting for me to read this post. Thanks for it. I like such topics and anything connected to them. I definitely want to read a bit more soon.

Anonymous said...

It was very interesting for me to read this blog. Thanx for it. I like such themes and anything that is connected to them. I would like to read a bit more on that blog soon.

Marlene said...

We just setup our dev env in git with submodules so that we can get partial workspaces and versions for each of our dependent modules. Everyone doesn't need everything. Now I came across this post. Is this still the situation with submodules? Is there any better way to have sparse workspaces?
Thanks.