Checkpoints
Tree walk order
Checkpointer walks OrioleDB trees in LNR-order. The walk is divided into steps. The result of each step is a message, which determines the next step. The possible messages are given below.
WalkDownwards
– the last step found an in-memory downlink within an internal page. The next step should be to visit a page on the lower level and start processing it.WalkUpwards
– the last step finished processing the page. The next step should continue processing the parent.WalkContinue
– continue working with the current page after releasing a lock. That happens when the checkpointer has to wait for the concurrent operation.
The picture below is the example of OrioleDB B-tree walking by the checkpointer.
Checkpointer comes to the root page n1
with the WalkDownwards
message (step 1), then checks the first downlink l1
. Since l1
is the in-memory downlink, the checkpointer moves to the leaf page n2
with the WalkDownwards
message (step 2). After flushing n2
, checkpointer comes back to n1
with the WalkUpwards
message (step 3) and continues iteration over n1
downlinks. Similarly, it walks down to the n3
and back via the l2
downlink (steps 4 and 5). l3
appears to be IO in-progress downlink, and the checkpointer has to unlock the n1
page, and wait till IO is completed and continue with WalkContinue
message (step 6). After relocking n1
, checkpointer finds l3
to be an on-disk downlink and copies it "as is". Finally, checkpointer walks down to the n4
and back via the l4
downlink (steps 7 and 8). Then n1
is done, the checkpointer finishes the walk with the WalkUpwards
message.
Checkpoint state
While the checkpointer is writing children of non-leaf page, concurrent splits and merges could happen. Therefore, the checkpoint state contains images of non-leaf pages under checkpointing as its reconstructed as the checkpointer visits the downlinks. If there are no concurrent changes to the non-leaf page, the reconstructed state finally matches the page state. Otherwise, the reconstructed page state could not even match to page state at any moment of time but always match the history of the checkpointer tree walk.
Note that the reconstructed state does not contain in-memory downlinks. In-memory downlinks are replaced with on-disk downlinks as we wrote the children's pages.
The picture below represents an example of a checkpoint state.
The root page n1
, internal page n3
, and leaf page n8
are currently under checkpointing. The downlink l2
is written to the reconstructed state. The downlink l3
is the next downlink
in the reconstructed state. Its key is known, but the link is not because the corresponding children were not written yet. Similarly, downlink l6
is written, downlink l7
is the next downlink
, and downlink l8
is not processed yet.
Let us imagine that the following event happened:
- Page
n7
was written by checkpointer. - Page
n8
was split inton8
andn9
. - Pages
n6
andn7
were merged. The result is marked asn67
. - Checkpointer has written page
n8
and started processing pagen9
.
The resulting state is given in the picture below. Note that the reconstructed page image contains links l6
and l7
(as we visited them before the merge) but contains l8
and the next downlink
corresponding to l9
(as we visited those downlinks after the split).
Autonomous non-leaf pages
If the non-leaf page under checkpointing gets modified concurrently, it becomes an "autonomous" non-leaf page. Autonomous pages work with the rules below.
- If the page is marked as "autonomous", all its parents to the root are also marked as "autonomous".
- If the page has an associated on-disk location, this association is cleared. The corresponding location is marked as free space at the current checkpoint.
- The autonomous page will be processed until its hikey is met, disregarding how many pages will be visited to meet this target (due to concurrent insertion, it could be many pages).
- Even if the initial page corresponding to the autonomous page has been split. The page holding the initial hikey is tracked. The merge, which would remove that hikey, is prevented.
- If the autonomous page is full, but the corresponding hikey is not yet met, current contents are flushed to the disk (and parent got the corresponding downlink with
WalkUpwards
), but processing of the autonomous page continues till the hikey is met. - When flushing autonomous, the corresponding "on-disk" location is marked as free for the future checkpoint.
Checkpointer messages
Consider more details regarding the checkpointer messages we enumerated above.
WalkDownwards
This message has the parameters below.
- The number and change count of the in-memory page to be visited.
- Low key ("lokey"). The lokey is from the parent page downlink, or it is the parent page lokey if the downlink is the first on the page.
The checkpointer has to process the referenced page. After the page is processed, the WalkUpwards
message must be returned. If the referenced page is non-leaf, more messages will be issued during its processing, but finally, there must be WalkUpwards
of the referenced page.
There might be a failure due to concurrent operations: the in-memory page might have a different change count. In this case, the corresponding WalkUpwards
should return the invalid downlink. Also, in a failure case, WalkUpwards
message should go just after WalkDownwards
: once we start processing the non-leaf page, we must finish it.
WalkUpwards
This message has the parameters below.
- On-disk downlink. This link might be invalid, as described above.
- Next key. That is actually a hikey of the page written. It might not match the subsequent downlink of the parent page due to concurrent splits and merges. On mismatch, the parent page must be marked "autonomous".
- Flag indicating that parent page must be marked as "dirty". This flag is set when the page has been written to the new place after the previous checkpoint. This flag is not set if the page and its children will not be modified then. The parent must be marked as "dirty" to be written and reflect the new on-disk downlink.
- The flag indicates that we must save the existing
next downlink
on the parent page. That happens to the autonomous page when the current reconstructed image is finished: we have the page written and need to insert a new downlink to the parent, but we still need to visit the samenext downlink
.
This message indicates that the child page has been processed, and the parent needs to add the downlink. If there is no parent, we have processed the root and now have a pointer to the new root on-disk location.
WalkContinue
This message has no parameters. It just indicates that checkpointer must continue processing the same page with the same next downlink
. That happens when the checkpointer has to wait for the concurrent operation. Such as meeting IO in-process downlink and having to release the log and wait till the IO is finished.