Skip to main content

SQLite on Git, Part I: The .git folder - Falling down the Rabbithole

· 9 min read

Join me on a jump right into the rabbit hole on what we find when we look closer at the .git folder. We'll explore how git stores your code in loose objects and discover a flag in a 30yr old library used by git that could let us run databases inside git.

What has happened so far

In the prologue, I explained why random access matters—the ability to read specific parts of files stored in Git without decompressing the entire thing—which is essential for running SQLite databases on top of Git's storage.

Now that we have done the sane reasoning part - let's go crazy.

How git stores your data - the .git folder

Whether you've been using git for 6 months or 6 years, you've probably never looked inside the .git folder. Let's change that.

I assume that you're using Git already. If you want to learn how to use it - you may want to check out https://git-scm.com/learn or Git Full Course first. But what you're most likely less familiar with is the .git folder that lives in all your repos.

The following section will describe the parts of the .git data structure needed to understand the technique used to enable Random Access. It is inspired by the great work of the pro git book - by Scott Chacon. The parts excerpted and described here should be sufficient to follow me.

If you are firm with git's underlying data structure already - feel free to jump ahead to the compression section.

Loose objects - a snapshot of the content of the file

One misconception I had for a long time: Git does NOT store differences, it stores snapshots. What does that mean? Whenever you do a commit, git has a copy of the files content at the point of commit.

Git creates the snapshot to its internal database in the moment when you add the file to the staging area - the action you do before you commit.

It does that by creating so-called loose objects. You find them inside of .git/objects/

So let's play this through - follow those steps:

  1. We create an empty folder

mkdir my_ducktale_test_repo

  1. We change working directory into that folder

cd my_ducktale_test_repo

  1. We create a new repo

git init

Initialized empty Git repository in /Users/your_user/git_temp/.git/
  1. Let's look into the repo's .git folder - more specifically into its internal objects folder

ls -1 .git/objects

info
pack

As you can see, there are no objects yet - only the info and pack directories which are empty placeholders.

  1. Create a readme.md file containing Hello World

echo -n "Hello World" > readme.md

(The -n flag prevents adding a newline, which would change the hash)

You now have a fresh repo with one file that git doesn't know about just yet. A look into the current .git folder shows only two folders for now.

ls -1 .git/objects

info
pack

  1. To make git aware of the readme file we added before we have to add it to git:

git add readme.md

(No output - git add runs silently)

  1. Now let's take a look at that objects folder again.

ls -1 .git/objects

5e
info
pack

We see that a new folder 5e was created - let's look into that one as well.

ls -1 .git/objects/5e

1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

What is that weirdly named file? Let's look into it.

cat .git/objects/5e/1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

xK??OR04e?-P?H????/?I?R?

Cool! So git turned "Hello World" into that weirdly named compressed file. But here's the problem: if this was a 1GB file, git would still need to decompress the ENTIRE thing just to read the last byte. Remember the random access problem from the prologue? That's what we're trying to solve.

Ok we have a weirdly named file with some weird content in it. Let's untangle that.

The weird filename - an address for the content of the file

As outlined earlier git has a copy of every state that you commit. It stores the contents addressable by the content. What does that mean?

In the given case "Hello World" gets a "unique identifier" of 40 characters: 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689. Git uses SHA1 - a hash function that's like a digital fingerprint. Same content = same hash, always.

SHA1(Header+"Hello World") -> 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

Note

We're gonna look into the Header a bit later

You can generate this unique identifier (hash) yourself by calling a command in git:

git hash-object readme.md

5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

For git the file name doesn't play a role (at that stage). So piping the content into git's hash-object function directly produces the exact same hash.

echo -n "Hello World" | git hash-object --stdin

5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

So if we have the content we know its unique identifier (hash) - and if we have the hash of a content - we know the path (or address) to find it. The address for "Hello World" is, 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689 and the path to it is: ".git/objects/5e/1c309dae7f45e0f39b1bf3ac3cd9db12e7d689".

Note

Git splits the hash after the first two characters and uses those to distribute the files over subfolders named by those characters. git handles thousands of contents - this trick reduces the number of files per folder by a factor of 256.

The weird characters in the file behind the address

Got it - those 40 characters point to that file but those question marks don't look like Hello World. The answer is pretty straightforward here - git compresses the data.

Luckily git comes with a tool to give us the content behind the address.

git cat-file -p 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

Hello World

Nice! We see the content of the file Hello World again.

What we learned so far: Git snapshots the state of the file and stores them under the address produced by a function that produces the address using the content. This concept is called "content addressable storage".

Think of it like a library where books are filed by their ISBN instead of by author. If you know the ISBN (the hash), you know exactly where to find the book - no catalog needed! This is the core of how git structures data, and it's brilliant because the same content always produces the same address.

This means if we have the address of the content (only 40 characters) we can gain access to the content. This is the core piece of git structures data. How git uses the address internally is not important to understand the next steps and we will cover that in a later article.

Going one step deeper - how git compresses loose objects

Note

Want to see how to decompress git objects with JavaScript? Skip this if you don't have Node.js - I'll show you the result anyway!

So git cat-file -p [hash] provides us with the file content - served directly from the git folder. This is great - but we want to only read a fraction. Since Git doesn't provide this out of the box let's check out how Git stores the data.

Git uses zlib to compress – deflate – the loose objects (compare pro-git-book).

So let's uncompress (inflate) the object - I'm gonna switch to JavaScript since this is the language most devs will be able to follow, it runs in the browser... many more arguments but the main reason is my Rust skills...

To run this code locally you need to install node installed and pnpm installed first.

npx node -e '

const { readFileSync } = require("fs");
const { inflateSync } = require("zlib");

const compressedData = readFileSync(".git/objects/5e/1c309dae7f45e0f39b1bf3ac3cd9db12e7d689");

const inflated = inflateSync(compressedData);
console.log(inflated.toString("utf-8"));

'

blob 11Hello World

Ok the result blob 11Hello World looks way less cryptic. Hello World is obviously the content of the file but what is blob 11? That's the header - and it's actually structured like this:

blob 11[null byte]Hello World

The header tells us two things:

  1. The type of the object, here blob (file content)
  2. The size of the object, here 11 (11 bytes of content)

Git adds a null byte (an invisible character) after the header to separate it from the actual content. This helps git allocate the right amount of memory and verify integrity when inflating the object.

Note

git knows 3 other types: tree, commit, and tag. Read more here

The compression is the reason why reading from the middle or the end of those blobs is not possible. Zlib - the compression algorithm here always deflates from the start to the end and so does inflation work. Git must begin the deflation process always from the start to the point of interest. Since zlib is sequential, partial reads are impossible without decompressing everything up to the bytes you want to read.

Impossible? Huh - let's google that random access zlib possible revealed a very interesting approach by Hengs Li Random access to zlib compressed files. Turns out Genomic researchers who deal with files >100GB found a way to use a parameter called Z_FULL_FLUSH to deflate data which then allows random seek into the compressed file.

The key insight: Z_FULL_FLUSH forces zlib to reset its compression state at regular intervals, creating checkpoints where decompression can start independently—like chapters in a book instead of one continuous stream.

Oh my god... what if I could use the same mechanism with objects in Git?

That's a lead. What followed on this discovery was a sleepless night a deep dive into zLib, an implementation of a block based compression library.

In the upcoming article we're gonna look behind the curtain - look deeper into zLib, look at the implementation of my block-based compression library and use it to random seek into a loose object that is compatible with Git. Stay tuned!