Thursday, May 7, 2015

Git Commits are Not Transactions

Git commits are not a list of differences between the last commit and itself.  A branch is not a transactional database of file content changes.  Git does not store your files as differences between each other (until later during garbage collection).

Each Git commit contains a compressed snapshot of your entire working directory*.  To be more precise, it contains a reference to a list of compressed files.  Each of those compressed files are what you "git add"ed in the last iteration including the files that you did not change.

To make this more clear and concrete, run the following command.

    $ git cat-file -p HEAD
    tree b1da7229394351fa209533717de8a741d227aa1b
    parent b6b293e9bc2dacb114463b36286b33e83c79c0b7
    author Dominic Muller <nicklink483@gmail.com> 1431043752 -0700
    committer Dominic Muller <nicklink483@gmail.com> 1431043752 -0700

    removed new_file

This will output the last commit** in plain text as it is stored in git***.  What you're looking at is literally how git saves a commit.  You may notice that the top line says the word "tree" followed by a sha1 hash.  This sha1 hash is the tree object that is the complete list of files that your index knew about at the time of the commit.  To show this, use the following command.

  $ git cat-file -p HEAD^{tree}
  100644 blob 6b5ffcfb6de1421c1b7a90b0b98febbceb75e70e    .gitignore
  100644 blob f73693a16cdf594532ee4c423a46d32ce3430c4e    blah.txt
  040000 tree 86c2509f4c12c5d3bf9a486925ed051666ee2d97    new_dir
  100644 blob 0861b9114fba8c82892d89e53f2a34447bd4c9e7    new_file.txt

That is the list of files and directories that your index knew about at the moment of that commit.  For any that say "blob" in them, if you runthe following (I'm doing it on blah.txt here):

    $ git cat-file -p f73693a16
    "test"

You will see an entire file of yours. Mine here only contained the content "test" (quotes included).

Every commit in Git can reconstruct the working directory at the time of it's creation without the knowledge of any other commit.  The reason commits have parents is a) for convenience on your part to know a kind of history, and b) to help git know if there has been any data loss/change at some point.  Every time you add a changed file, git makes a complete new copy of that file, so each commit has its own file references without the help from any other commit.  This may seem like file bloat over time, but fear not, as git will garbage collect to produce diffs in the database, but the commits still reference their own reconstructable blobs.

TL;DR If anyone tells you that a commit contains the difference between the last commit and that one, they are wrong, and you can prove it.  I've heard this falsehood more than once even from well-seasoned git enthusiasts.

For more on this, please read chapter 10 of Pro Git, available online for free here: https://git-scm.com/book/tr/v2/Git-Internals-Plumbing-and-Porcelain

*well, only the files that your index knew about from your working directory, but if you git add ./ and don't have anything in your .gitignore file, these will almost certainly be the same.

**git cat-file -p HEAD might not show you the most recent commit if your HEAD isn't pointing to the most recent commit, but you get the point.  You can always see what HEAD is referencing with "cat .git/HEAD" (unless you're on Windows, I guess)

***except for the header.  so it would be "commit 241\0" + that plain text.  The number 241 is the length in bytes of that commit on my computer.

No comments:

Post a Comment