Implementing B+ Trees On Disk With OCaml, Part 2 - Disk Layout, Memory Representation, and (De)Serialization.
Welcome back to part 2 of this series, in which we are implementing a disk backed B+ tree. In this part we will go over the B+ tree in-memory data structure, how to serialize that data structure into a disk layout, and how to deserialize the disk layout back into the in-memory node type.
In (part1)[https://artjomplaunov.github.io/database/b+/tree/2025/01/13/btrees1.html] we implemented the file and storage managers. This provided us with a block access API to fetch, update, and delete blocks on disk. Each block on disk stores a node of our B+ tree. Let’s start out with what we need to store in each B+ tree node.
In B+ trees, we will have N keys per node, and an N+1 branching factor for each node. For example, if we have N = 2 keys per node, we have a branching factor of 3. For key K_i in an internal node, the pointer to the left of it will point to the subtree containing keys such that for all keys K in the subtree, K <= K_i. The pointer immediately to the right will point to the subtree containing keys such that for all keys K in the subtree, K > K_i.
Here is an example B+ tree, with N = 2.
┌─────────────────────────┐
│ P1 │ 20 │ P2 │ 40 │ P3 |
└───┬─────────┬─────────┬─┘
│ │ └────────────────────────────────────┐
┌───────┘ └──────────────────┐ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ R1 │ 10 │ R2 │ 19 │ SP1 |---►│ R3 │ 25 │ R4 │ 30 │ SP2 |---►│ R5 │ 40 │ R6 │ 50 │ Nil |
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
Internal nodes and leaf nodes will have the same exact layout, but the pointer fields can have different meanings. Internal node pointers are used to traverse the B+ tree according to the search property. At the leaf level, since we don’t have any further to search, the pointers will have a different meaning. Each pointer directly to the left of a key on the leaf level will be a record pointer, which points to a location where we can find the record with the corresponding search key value. Since we have N keys per node, this requires N pointers, but we still have one pointer left over. This last pointer will be used by the B+ tree as a sibling pointer, to point to the next node on the leaf level. Note that in the initial implementation, we will not have sibling pointers on internal levels. For now they are only at the leaf level.
Node Data Structure
Let’s get the in memory representation out of the way, since that is the easy part. Here is the node type for the B+ tree:
type node = {
(* Leaf | Internal *)
mutable node_type : node_type;
(* Block offset pointer to parent.*)
mutable parent : int;
(* current number of keys. *)
mutable cur_size : int;
keys : KeyType.value array;
pointers : int array;
(* Max number of keys that can be stored.*)
capacity : int;
key_type : KeyType.t;
}
We have a node_type field which tells us if the node is a leaf node or internal node. We maintain a parent pointer since that will be necessary for implementing B+ tree operations later on. cur_size is the current number of keys being stored in a node, and capacity is the total number of keys that we are allowed to store. As we will see, this will be defined by the corresponding block size on disk, since we will try to fit as many key/pointer pairs on a block as we can.
We have two separate arrays for storing keys and pointers. The keys array will be initialized to size capacity, and pointers will be size (capacity+1). We will also think of having N keys and (N+1) pointers as a shorthand, where N is whatever the computed capacity for a block size is. This isn’t important now, but it is useful to get some of these conventions defined to reason and work through the implementation later on, without getting confused what we mean when we say the (N+1)’st pointer, or the N’th key, etc.
Capacity = number of keys we can store in a node. Capacity and the variable “N” will be synonymous. Since we have N keys per node, we will have an N+1 branching factor, or N+1 pointers per node.
This point is very important throughout the implementation: since we are storing the keys and pointers as two separate arrays, we need some easy mapping between this in memory data structure representation, and the visual B+ tree layout above. We can think of pointer at offset i in the pointers array as being the corresponding left pointer of the key at offset i in the keys array. There is one extra pointer, which is the very last pointer in the node; in an internal node this is a corresponding right pointer for the final key in the keys array; in a leaf node this pointer denotes the sibling pointer field, so even if there are less keys than capacity, this pointer will be occupied by the sibling pointer.
Key Types
Notice that the keys array has its own key type, whereas pointers are just ints. We will describe the significance of pointers as ints in a bit (it is mainly just a lazy hack for now), but for keys it was worthwhile to make an abstract type right away, as it can be extended to handle more key types later on.
module KeyType = struct
type t = TVarchar of int | TInteger
type value = Varchar of string | Integer of Int32.t
let ( < ) k1 k2 =
match (k1, k2) with
| Varchar s1, Varchar s2 -> s1 < s2
| Integer n1, Integer n2 -> n1 < n2
| _ -> failwith "incomparable keys"
let ( = ) k1 k2 =
match (k1, k2) with
| Varchar s1, Varchar s2 -> s1 = s2
| Integer n1, Integer n2 -> n1 = n2
| _ -> failwith "incomparable keys"
let ( <= ) k1 k2 = k1 < k2 || k1 = k2
let ( > ) k1 k2 = not (k1 <= k2)
let ( >= ) k1 k2 = not (k1 < k2)
let string_of_key k =
match k with
| Varchar s -> Printf.sprintf "Varchar %s" s
| Integer d -> Printf.sprintf "Integer %d" (Int32.to_int d)
let sizeof_key key_type =
match key_type with TVarchar n -> n | TInteger -> 4
let empty_key key_type =
match key_type with
| TVarchar n -> Varchar (String.make n "\"") (* 0x22222222 *)
| TInteger -> Integer Int32.max_int
end
Right now keys are either integer keys or VARCHAR n keys, i.e. strings of some fixed length n. We provide some utility functions that will be used later on, such as getting the size of a key in bytes, comparison functions, and an empty key function. Throughout the implementation, we will have unused values within a node, and hence unused fields on disk storage. Instead of having random values there, it is useful to have some predefined constants when we are debugging with a hexdump. For instance, the empty_key function will initialize a key with a default value, so that when we look at a hexdump we see 0x22222222 instead of a random value. We will have some more constants for other values.
B-Tree Type
type t = {
sm : Storage_manager.t;
key : KeyType.t;
mutable root : node;
mutable root_num : int;
}
The B+ tree type doesn’t have much: we have an underlying storage manager that is initialized to handle the underlying file being used to store the B+ tree, a key type for the tree, an in memory copy of the root node, and the block offset on disk for where the root node is being stored. The latter two are maintained as a very simple efficiency gain; we will always we accessing the B+ tree via the root node, so we keep it around and make sure it is maintained in memory. This adds some very minor complexity in the code, since we want to make sure if we make any changes to the root node, we keep these fields up to date.
Disk Layout
Now let’s go over how we will lay out nodes on disk. The goal here is to encode enough information on a block of disk in order to be able to reconstruct the in memory data structure. Here is how each node will be laid out on disk; there are two separate diagrams for internal and leaf nodes:
Internal Node Layout
M = size of key
Byte offset Internal Node Layout
|
v
0 +------------------------+
| node_type [4 bytes] | = INTERNAL
4 +------------------------+
| parent_ptr [4 bytes] |
8 +------------------------+
| num_keys [4 bytes] | (0-3 used out of 3 max)
12 +------------------------+
| ptr_1 [4 bytes] | → child block for keys < key_1
16 +------------------------+
| key_1 [M bytes] |
20 +------------------------+
| ptr_2 [4 bytes] | → child block for keys between key_1 & key_2
24 +------------------------+
| key_2 [M bytes] |
28 +------------------------+
| ptr_3 [4 bytes] | → child block for keys between key_2 & key_3
32 +------------------------+
| key_3 [M bytes] |
36 +------------------------+
| ptr_4 [4 bytes] | → child block for keys >= key_3
40 +------------------------+
Remaining Space (no more space in block to fit another (pointer, key) pair).
Here we have the layout and meaning of fields for an INTERNAL B+ tree node. Note that although the layout will be exactly the same for LEAF nodes, it is useful to see two different diagrams to see what the fields mean in each case. The first 4 bytes of the block will hold the node type, the next 4 bytes store the parent pointer, then we store the current number of keys being stored in the node. Then we store the keys and pointers in alternating order. Note that we don’t have to alternate the keys and pointers on physical layout; this was just an arbitrary decision made early on. At the very least, the diagrams look nicer and correspond to what we visualize with a B+ tree. A more practical implementation would probably store all the keys in one chunk, followed by all the pointers.
Leaf Node Layout
Byte offset Leaf Node Layout
|
v
0 +------------------------+
| node_type [4 bytes] | = LEAF
4 +------------------------+
| parent_ptr [4 bytes] |
8 +------------------------+
| num_keys [4 bytes] | (0-3 used out of 3 max)
12 +------------------------+
| record_ptr_1 [4 bytes] | → points to record for key_1
16 +------------------------+
| key_1 [M bytes] |
20 +------------------------+
| record_ptr_2 [4 bytes] | → points to record for key_2
24 +------------------------+
| key_2 [M bytes] |
28 +------------------------+
| record_ptr_3 [4 bytes] | → points to record for key_3
32 +------------------------+
| key_3 [M bytes] |
36 +------------------------+
| sibling_ptr [4 bytes] | → points to next leaf node
40 +------------------------+
Same layout, but now each left pointer of a key will store a record pointer instead of a pointer to another B+ tree node. Now is a good time to talk about what the “pointers” really are. As we can see, they are just 4 byte ints, but how do we decipher these ints as locations? The locations can mean two different things – either a pointer within the B+ tree, as in all the pointers within an internal node, and the sibling pointer of leaf nodes. Alternatively, they can be record pointers that point to an entirely different file – the issue is that we don’t have any reference to this file. Unlike keys, we didn’t wrap an abstract type around pointers.
This is a bit of a hack, but it makes the code a bit simpler and we can justify it as follows: implicitly, we only have the B+ tree storage file, or the file that will contain the actual records. So we can assume that the index manager that is using the B+ tree is maintaining the relationship between a B+ tree and the actual database file that is being indexed. In this manner, when the index manager requests the location of a record, the B+ tree is traversed and we return a leaf node record pointer. This “pointer” is just the offset within the database table being index, i.e. the block that will contain the record. The index manager can resolve this offset and look in the actual file.
The B+ tree pointers, on the other hand, implicitly resolve to the B+ tree storage file itself. So the B+ tree algorithm can just pass this number to the underlying storage manager and the block offsets refer to the B+ tree file. In this manner we do not need to encode the filename for any pointer fields, this increasing the number of keys we can pack in a block. This puts a little more work on the index manager to maintain the mapping between the index and the file that is being indexed, but that is of course something the index manager has to do anyways.
Another way to put it is that the pointer fields are very similar to the Block_id type instroduced in part 1, where a block_id reference is a (filename, block offset) pairing that uniquely determines the block of a storage file. The main hack being used here is that the filename is implicit; each pointer in the B+ tree can only refer to either the B+ tree file itself, or to the database file that contains the actual records. This resolution is implicitly done depending where in the code we are; within the B+ tree algorithm we can resolve B+ tree pointers to get blocks from the storage file; and to resolve record pointers the index manager will just map the offset into the file storing the table.
Serialization
Now we get to serialization. The serialization function will take a B+ tree node, and lay it out in a Page object (from part 1), which we can then write to disk using the storage manager.
Here is the function in all its glory:
(* Serialize B+ tree node into layout on Page.
*)
let serialize node block_size =
(* Initialize Page which will hold our node layout. *)
let page = Page.make ~block_size in
(* Set node type at first 4 bytes.*)
Page.set_int32 page 0 (serialize_node_type node.node_type);
(* Set parent at offset 4.*)
Page.set_int32 page 4 (Int32.of_int node.parent);
(* Set current size at offset 8. *)
Page.set_int32 page 8 (Int32.of_int node.cur_size);
(* Calculate (pointer,key) pair size:
4 bytes for pointer + M bytes for key*)
let pair_size = 4 + KeyType.sizeof_key node.key_type in
(* Layout keys *)
for i = 0 to node.capacity - 1 do
let key_offset = 12 + (i * pair_size) + 4 in
match node.keys.(i) with
| Varchar s -> Page.set_string_raw page key_offset s
| Integer n -> Page.set_int32 page key_offset n
done;
(* Layout pointers. *)
for i = 0 to node.capacity do
let pointer_offset = 12 + (i * pair_size) in
Page.set_int32 page pointer_offset (Int32.of_int node.pointers.(i))
done;
(* Calculate final pointer offset.*)
let final_pointer_offset = 12 + (node.capacity * pair_size) in
(* If we are at a leaf node, ensure we are serializing the sibling pointer.
This may have already occured in the previous pointers loop if the node
is at capacity, but if the node is below capacity then this ensures
we write it.
*)
if node.node_type = Leaf then
Page.set_int32 page final_pointer_offset
(Int32.of_int node.pointers.(node.capacity));
page
The code is fairly self explanatory with the comments. We are allocating a page with the given block_size, which gives us a space to layout the node with its representation using bytes.
Deserialization
Now we want to go the opposite way. Given a page of bytes with the B+tree block size, we want to convert it into our node data structure.
let get_num_keys block_size key_ty =
(block_size - 16) / (4 + KeyType.sizeof_key key_ty)
(* Convert page layout into B+ tree node data structure. *)
let deserialize page key_ty block_size =
(* Read node type.*)
let node_type = Page.get_int32 page 0 |> int32_to_node_type in
(* Read parent.*)
let parent = Page.get_int32 page 4 |> Int32.to_int in
(* Read N = current number of keys. *)
let cur_size = Page.get_int32 page 8 |> Int32.to_int in
(* Calculate capacity using block size and key type. *)
let capacity = get_num_keys block_size key_ty in
let keys = Array.init capacity (fun _ -> KeyType.empty_key key_ty) in
(* all pointers have value 0xDDDDDDDD *)
let pointers = Array.init (capacity + 1) (fun _ -> unused_pointer_constant) in
let pair_size = 4 + KeyType.sizeof_key key_ty in
(* Read keys and pointers *)
for i = 0 to cur_size - 1 do
(* Read pointer i *)
let pointer_offset = 12 + (i * pair_size) in
let key_offset = 12 + (i * pair_size) + 4 in
pointers.(i) <- Page.get_int32 page pointer_offset |> Int32.to_int;
(* Read key i *)
match key_ty with
| TVarchar n ->
let str = Page.get_string_raw page key_offset n in
keys.(i) <- Varchar str
| TInteger ->
let num = Page.get_int32 page key_offset in
keys.(i) <- Integer num
done;
(* Read final pointer *)
let last_pointer_offset = 12 + (cur_size * pair_size) in
pointers.(cur_size) <- Page.get_int32 page last_pointer_offset |> Int32.to_int;
(* sister pointer is always the last pointer in our preset size array -- at index capacity *)
let sister_pointer_offset = 12 + (capacity * pair_size) in
if node_type = Leaf then
pointers.(capacity) <-
Page.get_int32 page sister_pointer_offset |> Int32.to_int;
{ node_type; parent; cur_size; keys; pointers; capacity; key_type = key_ty }
We have a helper function get_num_keys that calculates our node capacity. Note that given a type of key and the block size, we can calculate how many keys we can fit in – the metadata, N keys, and N+1 pointers. To calculate it we first substract 12 bytes of space needed for metadata, as well as 4 bytes for the final pointer. Then we try to pack in the remaining space with N key pointer pairs.
The function is doing the reverse of serialize. We extract the metadata from the head of the block, and copy over all the key and pointer pairs, making sure to copy the final pointer. Admittedly the code between serializing and deserializing is a bit non uniform, since serialize will read over all key/pointers up to capacity, whereas the deserialize code is only reading values up to cur_size, then making sure the last pointer logic is handled correctly. In either case, the code works for now but this post/code may be edited once the code is cleaned up and made more uniform.
Reading and writing from disk
Now with our handy serialization and deserialization functions, we can write some utilities to read and write from disk. This will make it easy for us to implement the B+ tree insertion algorithm in part 3 without having to worry about how we are accessing disk.
let write_node btree node n =
let block_size = File_manager.get_blocksize btree.sm.file_manager in
let page = serialize node block_size in
Storage_manager.update_block_num ~storage_manager:btree.sm ~block_num:n ~page
let write_node_append btree node =
let block_size = File_manager.get_blocksize btree.sm.file_manager in
let page = serialize node block_size in
let block = Storage_manager.append ~storage_manager:btree.sm ~page in
Block_id.block_num block
The write_node function will take a B+ tree node, and an offset n into the B+ tree file. We make a call to the underlying storage manager to update the blok within the file. This is used whenever we are updating a node on disk with some new values.
The write_node_append will append a block in the underlying B+ tree, and return an offset to the new blocks location. (Note that this may also return a block from the underlying free list we implemented in part1, not necessarily a new block at the end of the file). This is used whenever we are creating a new node in the B+ tree.
(* Fetch a block from the btree and deserialize it into a btree node.
params: p is a pointer to a block in the btree. *)
let get_node btree p =
let block_size = File_manager.get_blocksize btree.sm.file_manager in
let page = Storage_manager.get_block ~storage_manager:btree.sm ~block_num:p in
deserialize page btree.key block_size
This is our utility for fetching a node from disk. We pass a pointer p, which is the offset within the B+ tree storage file, read the block from disk into a page, then deserialize into our node type.
With these utilities handy, we have finished the disk backed portion of our B+ tree implementation. We have everything we need in terms of disk access and retrieval, and all that is left is to start implementing some B+ tree algorithms!