Implementing B+ Trees On Disk With OCaml, Part 1 - File and Storage Management
In this series we will be implementing a disk based B+ tree in OCaml. Here in part 1, we will cover the file and storage managers which we will use for our B+ tree implementation. I will start with a top down introduction that will help give an intuition for what we want out of our file/storage manager design, such that it will seamlessly integrate with the actual algorithm implementation later on. In part 2, I will go over how the B+ tree nodes are laid out on disk and serialized and deserialized between disk and the in memory node representation. Finally, part 3 will cover implementation of the B+ insertion algorithm. Beyond that there will likely be posts on implementing deletion and auxiliary operations built on top of the main insertion and deletion operations.
I will assume some familiarity with the general purpose of a B+ tree: we want a disk based data structure that can perform efficient lookups on a certain key. Since disk access is very expensive, we want to minimize the number of disk operations when we are doing a look-up/insert/delete/etc., and B+ trees accomplish this by having a very high fanout in the tree structure, making the overall height of the tree very low. Even though log base 2 and log base 1000 are asymptotically the same (log(n)), a balanced tree with a branching factor of 1000 will have a much lower height.
There are many resources on the basics of B+ trees and the associated algorithms for inserting and deleting from the tree, but they miss one crucial aspect: actually storing and maintaining the B+ tree on disk. It is arguably simple enough to make an in memory B+ tree implementation, with no need for persistence between runs of the program. It is a bit more challenging to efficiently persist and run operations on a disk based B+ tree.
Code Links
The full code for the File manager, blocks, and pages can be found at this link
The full code for the storage manager can be found at this link
And the full code for the B+ tree can be found at this link
Some OCaml Syntax
Feel free to skip this part if you are familiar with OCaml; I use very bare bones OCaml in this project and I believe the code can be quite readable with some handy syntax conversions. It’s much easier to read than Rust.
If you see something like:
f a b c
This can be thought of as
f(a, b, c)
And we won’t be doing any fancy currying that is found in functional languages.
If I give an API as the following, let’s say in the file Example.ml:
t
val f1: t -> int -> some_type
I am defining an abstract datatype in the namespace of Example, with an abstract datatype Example.t, and a function f1 that takes and Example.t and an int as parameters, and returns some_type. The actual implementation of the type t and the function are in a separate implementation file. In most of the examples in this code base, the underlying type t is just a struct. The code is similar to Go or C in the sense that we will be defining functions that take some struct type as a parameter and performs an operation on the struct.
For example, in very rough C pseudocode this would be something like:
struct t { .... }
some_type f1(param1 t, param2 int);
If you see the following
let x = 5 in
...
This is the same as introducing x = 5 and having it in scope in the following code.
Finally, this one is familiar if you know Rust; let’s say we have an enum type with several cases:
type t = Leaf | Internal
We can pattern match on the code if we are inspecting an object of type t, and perform different code based on the type using a match construct
match t_instance with
| Leaf -> // Handle Leaf instance
| Internal -> // Handle Internal instance
This is similar to the following:
t = enum {leaf, internal}
if t_instance == leaf {
// Handle Leaf instance
} else if t_instance == internal {
// Handle internal instance
}
A Straw Man Design
What is the main challenge in implementing a disk backed B+ tree? It is useful to start with a straw man design to see what the issues are. The non efficient approach would be much simpler to implement: what if we had an im memory representation of the entire B+ tree, perform all operations in memory, and when we need to persist the tree we perform an entire tree serialization procedure that stores it in disk. Then, whenever we need it we can deserialize the entire tree back into memory and continue on. This would allow us with the simplicity of only worrying about the algorithm implementation for an in memory data structure, and containing the disk operations to when we need to store or fetch the tree from disk.
Of course, there are a few big issues with this approach. The first issue is that we are reading or writing the entire tree from and to disk, and the number of keys is proportional to the records we are storing in our main DB table. This would take an amount of time that is proportional to scanning the entire table from disk. Even if you only do this once and keep it cached in memory, the table we are indexing can be massive, and the B+ tree can either exceed RAM or eat up an inordinate amount of memory, making it impractial.
A Top Down Exposition
Before we get to implementing the lower level disk operations, we want to consider what we want at a higher level when implementing the B+ tree algorithm to get a sense of what we need to implement in the lower level API. We want to ensure we are keeping disk accesses to a minimum during our operations. Suppose we have a degenerate case of a B+ tree that is just a binary search tree, with a branching of two. This is what we would want to code to look like when we are searching for a key:
fun find_key key node_pointer =
node = read_from_disk node_pointer
match node.type with
| Internal ->
if key <= node.key:
then
find_key key node.left_pointer
else
find_key key node.right_pointer
| Leaf -> ...
And the traversal can be kicked off with a pointer to the root node:
find_key key root_pointer
Instead of passing around an explicit node, we will be passing along “pointers” to where the nodes can be found on disk. For a traversal, we deserialize the node from disk and then use the in memory representation within our function logic. Instead of recursively calling the function on a node, we pass a child pointer to where the next node is located on disk. This will localize our disk accesses to nodes that are only along the path of nodes are reached along the algorithm. We don’t need the entire tree in memory, we can just access the nodes we need as we go along.
If we have to make some modification to a node, we just make sure we write back to disk to the location specified by the pointer. For example, if we have a function to insert a value in a leaf node (assuming there is space in the node):
fun insert_in_leaf key record_pointer leaf_pointer =
node = read_from_disk leaf_pointer
// Insert (key, record_pointer) in appropriate location.
...
write_to_disk node leaf_pointer
With these examples in mind, we can state our initial design decision. The B+ tree will be stored in a file on disk. Each node of the B+ tree corresponds to a disk block within the file. The block size is configurable, but it is chosen to match the hardware and OS block size to ensure efficient data transfer. This means we will want to pad as many key and pointer values into the block as possible; that will be discussed in part 2.
Hence the API we want to provide for our B+ tree is based on block creation, updates, and deletion operations. The file and storage managers shouldn’t worry about what is being stored on each block, they are just providing us an API to do block access.
The File Manager
We will implement this in two levels of abstraction: the lowest level will be the file manager. The file manager is responsible for handling multiple files in the DB, and provides operations for arbitrary block operations within the file. In principle this would be sufficient, but as we will see, we want one extra layer of abstraction on top of the file manager, which will be our storage manager. First, let’s go over the file manager implementation.
The file manager has two associated types: Block_id.t and Page.t. Block_id’s are a way of referencing blocks on disk. They consist of a file name and offset pairing that uniquely determines the location of a block on disk. The Page datatype is a wrapper for a bytes buffer in memory that allows us to store blocks in memory while we are performing operations on blocks before writing them back to disk. We assume the block size is a fixed OS/Hardware determined constant that we initialize everything with at the beginning.
Block_id.t
Here is our API for block_id’s:
type t
(* Given a Block_id.t, return associated filename. *)
val file_name : t -> string
(* Return associated block offset in the file *)
val block_num : t -> int
(* Initialize block_id with params. )
val make : filename:string -> block_num:int -> t
val to_string : t -> string
val eq : t -> t -> bool
And an implementation:
type t = string * int
let make ~filename ~block_num = (filename, block_num)
let file_name (filename, _) = filename
let block_num (_, block_num) = block_num
let to_string (s, n) = Printf.sprintf "%s, %d" s n
let eq (s1, n1) (s2, n2) = s1 = s2 && n1 = n2
Page.t
And here is our API for the Page datatype. Note that there is more here than we will explicitly need, as some of these are used for other parts of the Ocaml_DB project.
type t
val make : block_size:int -> t
val from_bytes : bytes -> t
val get_int32 : t -> int -> int32
val set_int32 : t -> int -> int32 -> unit
val contents : t -> bytes
val get_bytes : t -> int -> bytes
val set_bytes : t -> int -> bytes -> unit
val get_string : t -> int -> string
val set_string_raw : t -> int -> string -> unit
val get_string_raw : t -> int -> int -> string
val set_string : t -> int -> string -> unit
val max_len : int -> int
val ( = ) : t -> t -> bool
val zero_out : t -> unit
With an implementation:
type t = bytes
let make ~block_size = Bytes.make block_size '\000'
let from_bytes b = b
let get_int32 page offset = Bytes.get_int32_ne page offset
let set_int32 page offset n = Bytes.set_int32_ne page offset n
let contents page = page
let get_bytes page offset =
let len = Int32.to_int (get_int32 page offset) in
Bytes.sub page (offset + 4) len
let set_bytes page offset b =
assert (offset + 4 + Bytes.length b <= Bytes.length page);
let len = Bytes.length b in
let _ = set_int32 page offset (Int32.of_int len) in
Bytes.blit b 0 page (offset + 4) len
let set_string_raw page offset s =
let b = Bytes.of_string s in
let len = Bytes.length b in
Bytes.blit b 0 page offset len
let get_string_raw page offset n = Bytes.to_string (Bytes.sub page offset n)
let get_string page offset = Bytes.to_string (get_bytes page offset)
let set_string page offset s = set_bytes page offset (Bytes.of_string s)
(* only ascii encoding for now *)
let max_len l = 4 + l
let ( = ) = Bytes.equal
let zero_out page = Bytes.fill page 0 (Bytes.length page) '\000'
The Page is really a wrapper for the OCaml bytes datatype, so that we can provide a constrained API for our page functionality on top of a mutable Bytes array. For example, the set_string_raw method takes a page type that has been initialize already, an offset within the page, and a string. It converts the string into bytes, and uses an underlying Bytes.blit method that sets the bytes at an offset within the page. The methods for this are all fairly straighforward.
An OCaml specific aside: 31 bit ints, and endian-ness.
One peculiarity here is the get_int32 and set_int32 functions. OCaml ints aren’t actually 32 bits, they are 31 bits with the extra bit used for the OCaml runtime environment. This poses a bit of an issue if we want to encode everything as 32 bits, on disk. So the choice made here was to use the OCaml int32 datatype at the page level, even though a the rest of the database uses regular ocaml ints (since there aren’t cases where we actually need values that use up all 32 bits).
Also, we use the get_int32_ne and set_int32_ne functions, where ne stands for native encoding of big or little endianess. We have our int32 on the programmatic level, and this will ensure that whenever we get an int or set an int it will do it in native encoding. This makes the file on disk non portable between different machines, but for any particular DB instance running on any particular machine, it will produce the correct results, since we always serialize to native encoding and deserialize back from native encoding. To make the disk file portable between machines, we would have to pick a specific endianess and make sure we always serialize/deserialize ints with that endianess at this point of the code. For now the database is not distributed so it isn’t an issue yet, but it isn’t a complicated fix to make later on, especially since we can maintain the Page API to users.
File Manager Implementation
Now we can implement our file manager. Some stuff in here is specific to having it integrated with the database, so I will gloss over those components and just go over the core functionality we need for building our B+ tree storage manager. Here is the API for the file manager:
type t
(* Initialize file manager with directory for files and block size. *)
val make : db_dirname:string -> block_size:int -> t
val is_new : t -> bool
val get_blocksize : t -> int
val get_file : t -> string -> Unix.file_descr
(* Read block in the file into provided page. *)
val read : t -> Block_id.t -> Page.t -> unit
(* Write a page to a block in the file. *)
val write : t -> Block_id.t -> Page.t -> unit
(* Appen a block in a file, and return the block identifier.* )
val append : t -> string -> Block_id.t
(* Return number of blocks in the file. *)
val size : t -> string -> int
Let’s dissect the API a bit before I provide the implementation. The constructor function is a bit specific to our DB, since it takes a directory where we will want to put our files. The more important aspect is the block_size parameter, which will then be used throughout our code as the block size for everything. No files are actually created with the constructor. Internally, the make function will also initialize the file manager with a hashtable of active file descriptors for each file in the database.
get_file can be used to create a new file; if it doesn’t exist then a new file is created and a file descriptor is returned.
The read, write, and append operations give us arbitrary block level access within the file. This is very similar to what we already want for ou B+ tree, but as we will see we want to add some constraint and file organization on top of this.
Here is the implementation:
type t = {
is_new : bool;
db_dir : Unix.dir_handle;
db_dirname : string;
block_size : int;
open_files : (string, Unix.file_descr) Hashtbl.t;
}
exception InitDbErr
exception FileMgrReadErr
let rec clean_temp_dirs db_dirname db_dir =
try
let cur_file = Unix.readdir db_dir in
if String.length cur_file >= 4 && String.sub cur_file 0 4 = "temp" then (
Sys.remove (Filename.concat db_dirname cur_file);
clean_temp_dirs db_dirname db_dir)
else clean_temp_dirs db_dirname db_dir
with End_of_file -> ()
(* File Manager constructor. *)
let make ~db_dirname ~block_size =
(* Create open file handler for DB directory *)
let db_dir, is_new =
try
let stat = Unix.stat db_dirname in
if stat.st_kind = Unix.S_DIR then (Unix.opendir db_dirname, false)
else raise InitDbErr
with
(* If it doesn't exist already, create it. *)
| Unix.Unix_error (Unix.ENOENT, _, _) ->
let _ = Unix.mkdir db_dirname 0o755 in
(Unix.opendir db_dirname, true)
| _ -> raise InitDbErr
in
(* Remove leftover temporary tables. *)
clean_temp_dirs db_dirname db_dir;
Unix.rewinddir db_dir;
let open_files = Hashtbl.create 10 in
{ is_new; db_dir; db_dirname; block_size; open_files }
let is_new file_mgr = file_mgr.is_new
let get_blocksize file_mgr = file_mgr.block_size
let get_file file_mgr filename =
match Hashtbl.find_opt file_mgr.open_files filename with
| Some fd -> fd
| None ->
let full_path = Filename.concat file_mgr.db_dirname filename in
let fd = Unix.openfile full_path Unix.[ O_RDWR; O_CREAT; O_SYNC ] 0o755 in
Hashtbl.add file_mgr.open_files filename fd;
fd
let read file_mgr block page =
let fd = get_file file_mgr (Block_id.file_name block) in
let offset = Block_id.block_num block * file_mgr.block_size in
let _ = Unix.lseek fd offset SEEK_SET in
let n = Unix.read fd (Page.contents page) 0 file_mgr.block_size in
if n = 0 then Page.zero_out page else ()
(* Since Unix.write doesn't guarantee writing all n bytes,
we have a helper function to repeatedly call write until we
have written all n bytes.
Note there is a possible uncaught exception here, if anything
goes wrong with writing.
*)
let rec write_n fd page offset n =
if n = 0 then ()
else
let bytes_written = Unix.write fd page offset n in
write_n fd page (offset + bytes_written) (n - bytes_written)
let write file_mgr block page =
let fd = get_file file_mgr (Block_id.file_name block) in
let offset = Block_id.block_num block * file_mgr.block_size in
let _ = Unix.lseek fd offset SEEK_SET in
write_n fd (Page.contents page) 0 file_mgr.block_size
let size file_mgr filename =
let _ = get_file file_mgr filename in
let full_path = Filename.concat file_mgr.db_dirname filename in
let stat = Unix.stat full_path in
stat.st_size / file_mgr.block_size
let append file_mgr filename =
let block_num = size file_mgr filename in
let block = Block_id.make ~filename ~block_num in
let b = Bytes.make file_mgr.block_size '\000' in
let fd = get_file file_mgr filename in
let _ = Unix.lseek fd (block_num * file_mgr.block_size) SEEK_SET in
write_n fd b 0 file_mgr.block_size;
block
Note the use of the OCaml Unix library, which provides us with OS system calls to handle the file operations. I will briefly cover the write function. We get the file descriptor from our file manager, then using the block offset in the file (the block we are trying to write to), we calculate the byte offset within the file. Then we use the lseek system call to position at the start of where we want to write to. Then we call an auxiliary write_n function, which recursively writes until all n bytes are written. This is due to the semantics of the Unix.write system call: it is not guaranteed that all n bytes will be written at once, rather we get the number of bytes successfully written, then keep trying to write the other bytes. Note there is an uncaught exception here if anything goes wrong.
Storage Manager
Although the file manager already gives us block level operations, we are going to build the main B+ tree sotrage manager that is used for the B+ tree on top of the file manager. The main reason for this is that we will want a to implement a free list of deallocated blocks, so that when we delete a node and its associated block from the B+ tree, we can add the block to a free list from which future append operations can fetch new blocks, instead of having to append at the end of the file. In principle, this could be implemented within the file manager as well. I thought it would be better to separate the two, however, so that any module can build on top of the file manager and handle block layout how it sees fit. Also, whereas the file manager has a one to many relationship between itself and the files it is managing, the storage manager will have a one to one relationship between itself and the B+ tree storage file it is handling.
Here is our API for the storage manager:
type t
val make : File_manager.t -> string -> t
val append : t -> Page.t -> Block_id.t
val delete : t -> Block_id.t -> unit
val update : t -> Block_id.t -> Page.t -> unit
val update_block_num : t -> int -> Page.t -> unit
val get_block : t -> int -> Page.t
This is very similar to the file manager, with the difference being the delete function and an internal free list from which we can fetch new nodes whenever we append. When we delete a node from the tree (hence deleting the associated block using this API), the block is added to a free list for future use.
If we didn’t have a free list, then everytime we deleted a node we would have three basic options: 1) The file is append only, and deleted blocks will sit around taking up space (Perhaps with an intermittent garbage collection stage) or 2) we explicitly shift everything after a deleted block, which is costly, or 3) we can take a used block at the end, after the deleted block, and put it in place of the deleted block; however this would require the additional complexity of pointer manipulation, to make sure that any nodes pointing to that block are updated to the new location; this means explicitly maintaining this information and performing additional disk reads and writes to update the pointers.
The downside with the free list approach is in the kind of scenario where we have many inserts followed by many deletes, which will produce a lot of unused blocks in the file, hanging around in the free list. To remedy this we may add a garbage collection step in the future, similar to the remedy for approach #1.
Our storage manager will always reserve the first block in the file to point to the first element of the free list. Initially, the head block is initialized to 0, since it is not pointing anywhere. Here is what the initialization function looks like:
let make ~file_manager ~storage_file =
let block_size = File_manager.get_blocksize file_manager in
let head_page = Page.make ~block_size in
let block = Block_id.make ~filename:storage_file ~block_num:0 in
if File_manager.size file_manager storage_file = 0 then (
Page.set_int32 head_page 0 (Int32.of_int 0);
File_manager.write file_manager block head_page
) else
File_manager.read file_manager block head_page;
{file_manager; storage_file; head_page}
Note that this may be initialized to an existing B+ tree in the file, i.e. on startup. If the associated file has 0 blocks in it, then we initialize the first block to just contain all 0’s, then write it to disk. Otherwise the file is empty, and we read the first block into a page, which we keep around fresh in memory. Since we’re keeping an active copy in memory, we have to ensure that we keep it updated while we perform operations.
Next, let’s look into how we can append new blocks to our storage manager. Here is an example scenario from the test cases:
Assume the storage manager starts out empty, so we just have block 0 which doesn’t point anywhere. The contents of our storage manager are as follows:
Block 0: 0
Then, let’s append 5 blocks. We’ll put the random value 42 in each block. Note that the block numbering scheme for block N is really the N’th offset block, or if we are numbering the blocks from 1 it is the (N+1)st block in the file. So block 5 is at offset 5, but if we number from 1 it is the 6th block in the file. We will stick to offsets.
Block 0: 0
Block 1: 42
Block 2: 42
Block 3: 42
Block 4: 42
Block 5: 42
Then we delete block 5:
Block 0: 5
Block 1: 42
Block 2: 42
Block 3: 42
Block 4: 42
Block 5: 0
Free list: 5
Now block 0 is pointing to block 5, and block 5 is the end of the free list and has the value 0 to denote it is the end of the free list (not to denote that block 0 is itself the next element in the free list, as block 0 will never be used for storage).
Let’s now delete block 2:
Block 0: 2
Block 1: 42
Block 2: 5
Block 3: 42
Block 4: 42
Block 5: 0
Free list: 2 -> 5
What happened here? When we add a new element to the free list, we want to always append it to the front of the list to make it an O(1) operation in terms of disk reads and writes. What happened here was we inspected the current head of the list, which points to 5. We overwrite 5 into block 2, which is the same as inserting it into the front of the list; then we have block 0 point to block 2 instead of block 5. Since we are always keeping block 0 in memory, this means we only have to perform one disk read and write to update block 2.
Let’s delete blocks 1 and 3 in that order:
Block 0: 3
Block 1: 2
Block 2: 5
Block 3: 1
Block 4: 42
Block 5: 0
Free list: 3 -> 1 -> 2 -> 5
Now if we want to append a new block, we can fetch from the free list. We will append the value 9999 in a new block:
Block 0: 1
Block 1: 2
Block 2: 5
Block 3: 9999
Block 4: 42
Block 5: 0
Free list: 1 -> 2 -> 5
Before the insert operation, 3 was the head of the free list. So we fetched block 3, which was pointing to block 1 in the free list. We update the root node to point to block 1, and insert our new value 9999 into block 3.
Here is the full implementation for the storage manager. Note that the storage manager relies on the file manager for the implementation, so that we do not have to worry about low level file operations in the implementation. Rather, the main logic being implemented here is for the free list, and providing block allocation/deletion/updates using the free list approach:
open File
(* Note on block layout:
Block 0 contains metadata.
It only modifies the first 4 bytes, for the free list, and keeps other data consistent.
Programs using the storage manager can modify the remaining block space as necessary.
All the other blocks not currently in the free list are being
used.
*)
type t = {
file_manager: File_manager.t;
storage_file: string;
mutable head_page: Page.t;
}
let get_head_page ~storage_manager = storage_manager.head_page
let set_head_page ~storage_manager page =
storage_manager.head_page <- page;
let block_id = Block_id.make ~filename:storage_manager.storage_file ~block_num:0 in
File_manager.write storage_manager.file_manager block_id storage_manager.head_page
let make ~file_manager ~storage_file =
let block_size = File_manager.get_blocksize file_manager in
let head_page = Page.make ~block_size in
let block = Block_id.make ~filename:storage_file ~block_num:0 in
if File_manager.size file_manager storage_file = 0 then (
Page.set_int32 head_page 0 (Int32.of_int 0);
File_manager.write file_manager block head_page
) else
File_manager.read file_manager block head_page;
{file_manager; storage_file; head_page}
let append ~storage_manager ~page =
let fm = storage_manager.file_manager in
let block_size = File_manager.get_blocksize fm in
let sfile = storage_manager.storage_file in
let head_page = storage_manager.head_page in
let head_ptr = Block_id.make ~filename:sfile ~block_num:0 in
let next_ptr = Int32.to_int (Page.get_int32 head_page 0) in
(* if next is 0, free list is empty so we append at end of file. *)
if next_ptr = 0 then (
let blocksize = File_manager.get_blocksize fm in
let block = File_manager.append fm sfile in
File_manager.write fm block page;
block
(* otherwise, we get a block from the free list*)
) else
(* Read first element from free list into a page*)
let next_ptr = Block_id.make ~filename:sfile ~block_num:next_ptr in
let next_page = Page.make ~block_size in
File_manager.read fm next_ptr next_page;
(* Save pointer from first element in list, and update head pointer.*)
let next_next_ptr = Page.get_int32 next_page 0 in
Page.set_int32 head_page 0 next_next_ptr;
File_manager.write fm head_ptr head_page;
(* Finally, write append data to block we fetched from the freelist.*)
File_manager.write fm next_ptr page;
next_ptr
let delete ~storage_manager ~block =
let fm = storage_manager.file_manager in
let block_size = File_manager.get_blocksize fm in
let sfile = storage_manager.storage_file in
let head_page = storage_manager.head_page in
let head_ptr = Block_id.make ~filename:sfile ~block_num:0 in
let next_ptr = Page.get_int32 head_page 0 in
let page = Page.make ~block_size in
Page.set_int32 page 0 next_ptr;
Page.set_int32 head_page 0 (Int32.of_int (Block_id.block_num block));
File_manager.write fm head_ptr head_page;
File_manager.write fm block page
let update ~storage_manager ~block ~page =
let fm = storage_manager.file_manager in
let block_size = File_manager.get_blocksize fm in
let sfile = storage_manager.storage_file in
File_manager.write fm block page
let update_block_num ~storage_manager ~block_num ~page =
let block = Block_id.make ~filename:storage_manager.storage_file ~block_num in
update ~storage_manager ~block ~page
let get_block ~storage_manager ~block_num =
let block_size = File_manager.get_blocksize storage_manager.file_manager in
let page = Page.make ~block_size in
let block = Block_id.make ~filename:storage_manager.storage_file ~block_num in
File_manager.read storage_manager.file_manager block page;
page
Thanks for tuning in, in part 2 we will define the B+ tree datatypes and how to serialize/deserialize these on disk!