4th Dec 2024
Git submodules let you embed one repo as a subdirectory inside another. There have been many debates about whether this is a good idea. Before using submodules extensively I was on the fence. Many of the reasons given for not using them didn’t sound that compelling because the fundamental idea of composing repos sounds reasonable.
However after using them in anger the flaws in Git’s specific implementation became very apparent. This is a list of those flaws so you don’t have to discover them yourself.
Worktrees are an extremely useful feature that let you have multiple
copies of a repo that share the same .git
directory (so
commits and branches stay synchronised). Unfortunately they don’t work
reliably with submodules.
Quoting the man page:
Multiple checkout in general is still experimental, and the support for submodules is incomplete. It is NOT recommended to make multiple checkouts of a superproject.
In my experience this warning is accurate.
When you switch branches you need to run
git submodule update --init --recursive
to update the
submodules to point to wherever your target branch points them too. This
is a bit of a pain, and Git provides a flag to do it automatically for
you: git switch --update-submodules
. It can be enabled by
default so that all git switch
commands automatically
update submodules.
DO NOT USE THIS. It can seriously break your .git
directory to the point that you may have to delete it and start from
scratch. Common error messages are:
fatal: not a git repository: /some/path
Failed to clone 'some/submodule'. Retry scheduled
BUG: submodule considered for cloning, doesn't need cloning any more?
fatal: could not get a repository handle for submodule 'some/submodule'
This message contains some ideas for how to unbreak things if you get stuck in this mess.
Having to run git submodule update --init --recursive
endlessly is rather tedious.
If you work with multiple branches that contain different sets of submodules be prepared for pain. When switching from a branch with a submodule to another without that submodule, Git will just leave the files on disk and they will become untracked files. You have to manually delete them.
Often you will be met with unable to rmdir
errors or other similar where Git just can’t quite deal with
submodules changing when you switch branches.
This is similar to the behaviour when switching between branches with
different .gitignore
s, but much more annoying!
Without submodules, a Git repo is entirely self-contained and portable. It doesn’t know where it is hosted; it can be freely forked, cloned and mirrored.
With submodules this is no longer the case - it adds a .gitmodules
file which usually contains absolute URLs to the
submodules. If you clone the repo the submodules will still point to the
original URLs.
Even worse, the protocol (https or ssh) is often explicitly stated, which can even lead to the inability to clone the repo at all (for example if SSH URLs are used and you don’t have appropriate access permissions).
It is possible to use relative
paths but many people forget, and the exact format of these (number
of ..
s etc.) depends on where you are hosting it.
Technically this is also a problem with any third party dependency, for example if your project uses NPM, Pypi or Cargo libraries. These tools provide ways to replace dependencies if necessary. The problem arises when submodules are used for closely linked first party dependencies.
For example in VSCode by default if you click on the diff of a submodule, it will just show you what the hash changed from/to. Looking that up is very not fun.
There is a way to improve this via
git config --global diff.submodule log
, but like many of
Git’s UX fixes it is not enabled by default. Most developers are not
going to know that option even exists.
In general you will find that submodules are an afterthought in most tools.
Ignoring the above points which are all things that are Git-specific and could theoretically be fixed (but don’t hold your breath!), are submodules a good idea even in principle?
There are two situations where you might want to use submodules:
The first is generally reasonable in situations where a proper dependency manager (Cargo, vcpkg, etc.) is not available and you are not planning to edit that dependency.
The second is more debatable. You should consider the following:
A key difference between code that lives in a submodule and code in a subdirectory is that code in a subdirectory has a bounded set of users (other code in the repo), whereas the submodule has an unbounded set (literally any other repo might use it). This means that your code suddenly gains a public API.
By putting the code in a submodule you can no longer trivially make breaking changes to the code because you don’t know who might be using it.
Even if you develop tooling to identify downstream users of your submodule, actually testing changes is extremely difficult. You can make a change and run CI for your submodule, but did you also bump the version of your submodule in all the other repos that include it and run their CI? Probably not. I have done this for some projects using Gitlab CI, which supports triggering downstream pipelines. It’s better than nothing, but barely.
Making any kind of breaking change (including refactoring) becomes very difficult to coordinate across repos. In a monorepo you can just do the refactor, send it to CI and call it a day. In a multi-repo setup this is much much harder. You cannot test the overall change in one go. You have to do it in stages and in a backwards compatible way. It can become especially complicated when two repos depend on each other (yes it can happen!).
If you have a project that has two dependencies as submodules, and those dependencies both have a third dependency as a submodule (a diamond dependency graph), then you will have two copies of the third dependency in your project. This isn’t theoretical; I’ve worked in a repo that heavily uses submodules and ends up with 6 copies of the same submodule.
One alternative to submodules is a monorepo. Unfortunately Git has quite poor support for monorepos too. As far as I know, only Microsoft uses Git with a large monorepo. Google and Facebook both use custom VCS’s.
Git can become extremely slow with large repos. Monorepos can become huge, and in that case you really need sparse checkouts and file monitoring.
Git has experimental support for sparse checkouts, and file monitoring (not supported on Linux).
If possible, use a package manager. E.g. for Rust use cargo, for C++ use vcpkg, etc.
For first party dependencies use a monorepo, and a build system which works well with monorepos (Bazel, Buck2, Pants, Please, etc).
For third party dependencies where you can’t use a package manager
you could consider git subtree
(which is essentially
vendoring the repo). This works fairly well and is especially good if
you will be modifying the contents of the submodules. It does make
pulling and pushing to the original repo a bit more of a pain
though.
Another option is git subrepo
,
which looks very similar to git subtree
(I haven’t used
it).
Ultimately I don’t think Git has a good answer to the question that leads people to submodules. I wonder if Pijul or Sapling are any better…
Git has a habit of fixing UX issues and then hiding the fix behind a flag that nobody notices. Here are a few I have discovered. Set them all!
git config --global diff.submodule log
- better
submodule diffsgit config --global alias.pushf 'push --force-with-lease --force-if-includes'
- safe pushing; this checks you aren’t clobbering someone else’s commits
accidentallygit config --global push.autoSetupRemote true
- never
have to copy and paste
git push --set-upstream origin branchname
again!git config --global merge.conflictStyle zdiff3
- better
merge conflicts (though Git is still pretty shit at these in my
experience)git config --global fetch.prune true
- automatically
delete your local copy of origin/foo
when that branch is
deleted on the remote (usually because the PR/MR merged)In copy/pasteable form:
git config --global diff.submodule log
git config --global alias.pushf 'push --force-with-lease --force-if-includes'
git config --global push.autoSetupRemote true
git config --global merge.conflictStyle zdiff3
git config --global fetch.prune true