I want to start this blog post with a quick TL;DR before I get into musings – this post is about a rewrite of Edward Sciore’s simple DB from the book “Database Design and Implementation”, which is originally in Java, but we ported (a good chunk of it) to OCaml. If you’re interested in the book, but don’t want to do Java, I’m posting this forward to encourage you to take the code and hack away on the OCaml instead! More details later in the blog post, but the code can be found here on github: Simple DB in OCaml

Right now I am posting the main branch, but I intend to clean it up and make a subset that is much cleaner and easier to browse through, and have that as a separate branch for people who just want a core base to work on.

Now to the longer form post, where I start with some musings on the coding process and what I am aiming to get out of this, before getting into some technical details.

My friend Tony and I have been meeting up every day, monday to friday, at a local coffee shop to do job apps, practice leetcode, and more excitingly, work on our own relational database that we’re building from the ground up. I decided to puck up Sciore’s book since Phil Eaton is hosting a book club, and we’ve been following Phil’s blog for a couple months now, which made us interested to jump into the realm of databases. What’s nice about the book is that it offers a complete implementation of a database from end to end. Even though the purpose of the book is to use the Java code that’s provided and build on top of that, it provides the base implementation for each module as a core component of each chapter. I think each chapter offers just enough knowledge to get exposed to the topic at hand, and start diving into the code.

After a couple of weeks of rewriting into OCaml, however, I reached a bit of a philosophical quandary. What am I really doing? For me coding feels like a translation layer from a design or thought process into an entity that is a running process. At a certain point, using Sciore’s design and code felt more like I was translating someone else’s design and thought process into the code. At first it felt like a sufficient challenge since I had to warm up my OCaml skills, and I was also faced with the task of actually understanding the code I was creating and writing test cases for it – which is probably where I got the most benefit so far. I had this thought initially at around chapter 5, where I felt I wasn’t entirely satisfied with the design of the concurrency and transaction management systems. We decided to hold off on implementing concurrency to make our own implementation of snapshot isolation. So right now, the DB is just single threaded.

So I made it my goal to just rush through and make a base implementation of chapter 6-9, which included facilities for record page management, pipelined query processing, and parsing. I also wrote a REPL that can now execute some simple commands. This is really the departure point for us, and where I think things will get interesting. I want to make sure I am crafting my own designs now, however imperfect they may be. I think designing API’s and systems from scratch, given a base of knowledge for context, is a muscle I have not sufficiently flexed. So our next goals are to implement some basic form of indexing and then build snapshot isolation, i.e., at long last have concurrency management in our database.

Tony brought up an interesting dichotomy of approaches between two people he knows – let’s call them A and B. Approach A is you want to learn something, so you go really deep from the foundation layer and build up the entire thing, which probably delays learning the higher level concept you want to learn. Approach B is forget the lower level details, just make a wrapper API that mimics all the functionality of the lower layer, and work at the layer that interests you. I think I’ve been stuck in the mindset of approach A for quite some time – I think there can be some value to it, i.e., I definitely don’t regret learning all the layers of the database we’ve built up, but I want to get more comfortable in the second approach. In a way, approach B is more true to the whole project of computer science. Part of what makes computer science so powerful is its facility for creating abstractions and building on top of these abstractions, reducing the mental overload of handling all the details that are below an abstraction.

Enough rambling, here is a quick recap of what the database looks like.

Technical Overview

Running the database

The database is written in OCaml, and uses the dune build system along with alcotest. It should be as easy as running dune build and dune test to run all the test cases. You can also run the test executable in _build/default/test with alcotest specific commands, i.e. to pass a regex for which suite of tests to run.

There is also a simple REPL which is built as a standalone executable in the REPL directory.

File Manager

This is the lowest level of the database - it provides facilities for file management (since each table is a file on disk treated as a sequence of blocks). It includes page and block management; pages are the in memory data structure that hold a “block”, and a block is a logical identifier for a file and offset into the file on disk.

(Post in progress…)