Optional: Git Internals

Objectives

  • Understand what is easy to do with git, and what is not easy

Instructor note

  • 15 min teaching/type-along

  • 15 min exercise

Down the rabbit hole

When usually working with Git, you will never need to go inside .git, but in this exercise we will in order to learn about

  • how branches are implemented in Git, and how to use them freely

  • how you can avoid losing data with Git.

Prerequisites

For this exercise create a new repository and commit a couple of changes. You can also clone this repository:

$ git clone https://github.com/mmesiti/merge-fu.git

Now that we’ve made a couple of commits let us look at what is happening under the hood.

$ cd .git
$ ls -l

drwxr-xr-x   - user 25 Aug 15:51 branches
.rw-r--r-- 499 user 25 Aug 15:52 COMMIT_EDITMSG
.rw-r--r--  92 user 25 Aug 15:51 config
.rw-r--r--  73 user 25 Aug 15:51 description
.rw-r--r--  21 user 25 Aug 15:51 HEAD
drwxr-xr-x   - user 25 Aug 15:51 hooks
.rw-r--r-- 137 user 25 Aug 15:52 index
drwxr-xr-x   - user 25 Aug 15:51 info
drwxr-xr-x   - user 25 Aug 15:52 logs
drwxr-xr-x   - user 25 Aug 15:52 objects
drwxr-xr-x   - user 25 Aug 15:51 refs

Git stores everything under the .git folder in your repository.
We will have a look at the objects and the refs directories.

In the objects directory we find, among others, 3 kinds of objects:

  • commits: These represent the commits we have made with git commit

  • blobs: These represent snapshots of all the files we have ever added to the repo with git add.

  • trees: These represent directories containing the files we have added, and reference other trees (subdirectories) and blobs (files that we have added).

commit objects contain information about the author and the commit message, and every commit object references a single tree object.

All objects are named as the SHA-1 hash (a 40-character hexadecimal string) that is computed on their content.
This means that all objects are immutable.

A commit inside Git

States of a Git repository. Image from the Pro Git book. License CC BY 3.0.

Changes and their effect: files and commits

Refer to the figure above, and discuss: which SHA-1 hashes would change in the diagram if:

  • the content of the first file is changed,

  • we recreate a commit with another message or author

  • we recreate a commit with the same message or author

Is it possibe to have multiple commits refer to the same tree? What happens when you use git revert?

Once you have several commits, each commit object also links to the hash of the previous commit(s) (there is more than one previous commit for for merge commits). The commits form a directed acyclic graph (do not worry if the term is not familiar).

A commit and its parents

A commit and its parents. Image from the Pro Git book. License CC BY 3.0.

Changes and their effect: changing history

Refer to the figure above, and discuss: which SHA-1 hashes would change in the diagram if:

  • The the 3rd commit were changed

  • The 2nd commit were changed

Git is at its core a content-addressed storage system

A look at the objects

Let us poke a bit into raw objects! Start with:

$ git cat-file -p HEAD

Then explore the tree object, then the file object, etc. recursively using the hashes you see.

Demo: If you add it, you don’t lose it (for a while)

A common way to (apparently) lose work is to use git add indiscriminately.

You make some changes to a file, (let us call this version A) you git add them, then you make some other changes (let us call this version B) and you git add those again.

Now version A is apparently lost, and if we realize that we need it back we typically click nervously on the “undo” arrow of our editor.

But fear not! Try this.

  1. Create a file named test-add with the following command:

    echo 'Once a file has been git added, it is hard to lose!' > test-add
    
  2. Add it to the repository

    $ git add test-add
    
  3. Now change the content of the file to be

    Ops
    
  4. And repeat the add command

    $ git add test-add
    
  5. Apparently we have lost the previous version of the file. But it is actually there, stored in a dangling blob object (which is not referenced, even indirectly, by any ref) We can see this with the command fsck:

    $ git fsck
    Checking object directories: 100% (256/256), done.
    dangling blob dc3b15f60045eea7a87639436ed75021130579e0
    

    We can see the content of that blob by passing its hash (shortened for convenience) to the git cat-file -p command:

    $ git cat-file -p dc3b
    Once a file has been git added, it is hard to lose!
    

Deletion of dangling objects is done by a garbage collector that might be triggered automatically by some commands.

Discussion

Discuss the findings with other course participants.