Git Internals – Basic object data storage

Well, what makes git super fast? A look into git’s underbelly..

Before i begin, i will be setting up an empty repository.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git init
Initialized empty Git repository in /home/nikhil/dev/blog/git/.git/
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ ls -a
. .. .git

Also, it can be seen that initializing the repository creates a .git directory, and see the contents of the directory. As you can see, the objects folder is empty. Git has initialized the objects directory and created pack and info subdirectories in it, but there are no regular files.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ tree
.
|-- branches
|-- config
|-- description
|-- HEAD
|-- hooks
| |-- applypatch-msg.sample
| |-- commit-msg.sample
| |-- post-update.sample
| |-- pre-applypatch.sample
| |-- pre-commit.sample
| |-- prepare-commit-msg.sample
| |-- pre-rebase.sample
| `-- update.sample
|-- info
| `-- exclude
|-- objects
| |-- info
| `-- pack
`-- refs
|-- heads
`-- tags

9 directories, 12 files

At the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time. To demonstrate, you can use the plumbing command hash-object, which takes some data, stores it in your .git directory, and gives you back the key the data is stored as. Note that the hash-object is a plumbing command and is not meant to be used in a regular day.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ echo 'supercompiler' | git hash-object -w --stdin
755eb4004ee1ac36d0dd51008ed6279c2fb200e5

The -w tells hash-object to store the object; otherwise, the command simply tells you what the key would be. --stdin tells the command to read the content from stdin; if you don’t specify this, hash-object expects the path to a file. The output from the command is a 40-character checksum hash. This is the SHA-1 hash — a checksum of the content you’re storing plus a header.

Let us move to the objects directory and see how the file is stored,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ tree
.
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- info
`-- pack

3 directories, 1 file

You can see a file in the objects directory. This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters.

You can pull the content back out of Git with the cat-file command. This command is sort of a Swiss army knife for inspecting Git objects. Passing -p to it instructs the cat-file command to figure out the type of content and display it nicely for you.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ git cat-file -p 755eb4004ee1ac36d0dd51008ed6279c2fb200e5
supercompiler

Ok, let us play around a bit.

I am creating a v1 of a file and writing it to the repository, followed by modifying the file and writing the v2 to the repository. We can see both the file contents using the cat-file command and see a total of three different hashes stored within the objects directory.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 1" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
83baae61804e65cc73a7201a7252750c76066a30
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 2" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree .git/objects/
.git/objects/
|-- 1f
| `-- 7a7a472abf3dd9643fd615f6da379c4acb3e3a
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- 83
| `-- baae61804e65cc73a7201a7252750c76066a30
|-- info
`-- pack

5 directories, 3 files

You can have Git tell you the object type of any object in Git, given its SHA-1 key, with cat-file -t:

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -t 83baae61804e65cc73a7201a7252750c76066a30
blob

Now, there are two things that has to be mentioned here.

  1. Git does not store the file. Git stores only the contents.
  2. The contents are stored as a blob object

One thought on “Git Internals – Basic object data storage

Comments are closed.