File Format

Overview

_images/file-format-layout.png

vsql stores the entire database (also known as the catalog) in a single file. The default extension is .vsql, although that is not required.

The file must include a Header, followed by zero or more pages. That is, when the database file is initially created it will only have a Header and zero pages.

Every page must be the same size, but this is configurable. The page size cannot be changed after the file is created (the page size is stored in the Header).

A page can be one of two types: leaf and non-leaf, and they may appear in any order in the file. They may also be reordered or removed as the file expands and contracts.

Also see Limitations.

Page

_images/file-format-page.png

All pages types (defined by Kind) will use the same basic structure with 3 bytes reserved for metadata and the remaining bytes available for Data. The format of the data itself will depend on the Kind:

Kind

Page Type

0

Leaf Page

1

Non-leaf Page

Since Used has a max value of 65536 this is also the maximum possible size for a page. Although, the default page size for new files is 4096 bytes. 1

The Used includes the size of Meta. So an empty page would have a Used of 3 and an entirely full page would have a Used equal to the page size.

Leaf Page

_images/file-format-leaf-page.png

The first 3 bytes are reserved for Meta, which will contain the kind (0 for leaf) and the number of used bytes (remember that it also contains the size of Meta).

The Data will contain zero or more objects. Although, zero objects is valid if that were the case the page should be removed as part of the garbage collection.

Each Object may be variable length and can be any number of bytes so long as the Object size is equal to or less than the (page size - 3). For objects that cannot fit into a page, they are stored in as a Blob Object and referenced from the Object.

Objects within a page are kept sorted by key and unused space is always located after the used data. You should not assume that the unused portion is zeroed.

Non-leaf Page

_images/file-format-non-leaf-page.png

A non-leaf page share a lot of similarities with a Leaf Page. However, the Objects are pointers to Leaf pages. That is the Value within the Object is a i32 that references the Leaf Page number.

Objects within a non-leaf page are ordered based on the objects Key. This is important for scanning (such as range queries) to improve performance as the Key itself is the first Key in the referenced Leaf Page. Here is an example of how a Non-leaf Page references a Leaf Page:

_images/file-format-non-leaf-reference.png

It’s also important to note that as the B-tree gets larger, there may be multiple non-leaf pages that need to be traversed this way. The process will always end with a leaf page. A parent of a non-leaf (which must also be a non-leaf) uses the Key that points to the first Key in the child page.

Object

An Object may be any length, but this example uses an object of 230 bytes:

_images/file-format-object.png

Individual objects can be of different types and are encoded/decoded based on the first byte of the Key:

First Byte

Object Type

B

Blob Object

F

Fragment Object

Q

Sequence Object

R

Row Object

S

Schema Object

T

Table Object

V

Sequence Value Object

Every object contains 15 bytes of metadata:

Part

Format

Description

Length

i32 (4 bytes)

Is the total length of the object (including the metadata).

TID

i32 (4 bytes)

Transaction ID that created this object. 2

XID

i32 (4 bytes)

Transaction ID that expired this object. 2

Ref

u8 (1 byte)

When true, the Value will be 5 bytes containing. See Blob Object.

Key Len

i16 (2 bytes)

The number of bytes in the proceeding Key.

Using this metadata we can say that the length of Value will be: (Length - 15 - Key Length).

Blob Object

When an object is added to the B-tree that is too large to fit into a single page, it must be split into blob (B) and fragment (F) objects. For example, if the page size was 256 bytes, but we try to insert a object that is 529 bytes:

_images/file-format-blob-1.png

It is split into 3 objects:

_images/file-format-blob-2.png

Where entire pages consist of one more blob objects followed by an optional fragement object containing any left over data. The fragment is optional because the object might happen to fit perfectly in a whole number of blob objects.

Finally, the original object is replaced with a reference (blue indiciated replacements):

_images/file-format-blob-3.png

Fragment Object

A fragment object (uses the prefix F) contains a portion of data from splitting a large object. See Blob Object.

Row Object

A Row Object (has the R prefix) contains a table row. The serialization does not need to be explained in detail here. You can check the code for Row.bytes() and new_row_from_bytes() respectively.

Schema Object

A Schema Object (has the S prefix) contains a schema definition. The serialization does not need to be explained in detail here. You can check the code for Schema.bytes() and new_schema_from_bytes() respectively.

Sequence Object

A Sequence Object (has the Q prefix) contains the definition for a sequence (not including the next value, see Sequence Value Object).

Sequence Value Object

A Sequence Value Object (has the V prefix) contains the next value for a sequence. Since a sequence’s next value needs to be atomic, even outside of transaction isolation, changing the value will be always persistent.

Table Object

A Table Object (has the T prefix) contains a table definition. The serialization does not need to be explained in detail here. You can check the code for Table.bytes() and new_table_from_bytes() respectively.

Notes for Future Improvements

Sequences

The properties of a sequence (such as the INCREMENT BY, etc) are held in the same record as the next value. Since the next value of a sequence needs to be atomic (and separate from the transaction isolation) a ROLLBACK on a transaction that contains an ALTER SEQUENCE will not undo any changes.

Ideally, the properties of a SEQUENCE can be stored in a separate location on disk.

Notes

1

See default_connection_options() in https://github.com/elliotchance/vsql/blob/main/vsql/connection.v.

2(1,2)

This is used for transaction visibility. See MVCC.